<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Pensieve]]></title><description><![CDATA[An unconventional thinker who loves to operate at the intersection of technology and business. Personal opinions.]]></description><link>https://blog.singhanuvrat.com</link><image><url>https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png</url><title>Pensieve</title><link>https://blog.singhanuvrat.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 28 Jan 2026 18:48:28 GMT</lastBuildDate><atom:link href="https://blog.singhanuvrat.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Anuvrat Singh]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[anuvrat@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[anuvrat@substack.com]]></itunes:email><itunes:name><![CDATA[Anuvrat Singh]]></itunes:name></itunes:owner><itunes:author><![CDATA[Anuvrat Singh]]></itunes:author><googleplay:owner><![CDATA[anuvrat@substack.com]]></googleplay:owner><googleplay:email><![CDATA[anuvrat@substack.com]]></googleplay:email><googleplay:author><![CDATA[Anuvrat Singh]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The "Files are the Database": A Deep Dive into Delta Lake]]></title><description><![CDATA[How a simple transaction log turned cloud object storage into a reliable data warehouse.]]></description><link>https://blog.singhanuvrat.com/p/the-files-are-the-database-a-deep</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/the-files-are-the-database-a-deep</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Thu, 27 Nov 2025 08:51:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve been spending a lot of time lately reading about distributed systems - specifically the foundational papers on Google File System (GFS), BigTable, and Facebook&#8217;s Tectonic. They all solve the same massive problem: how to store petabytes of data reliably.</p><p>But they all share one trait: they require massive, dedicated infrastructure. They need a centralized &#8220;Metadata Service&#8221; (a Brain) to map data blocks to hard drives.</p><p>Delta Lake asks a different question. What if we don&#8217;t build the storage layer at all? What if we just use &#8220;dumb&#8221; object storage like S3 and manage the intelligence in a simple transaction log?</p><p>Here is how it works.</p><h3>The Problem: The &#8220;Cloud Physics&#8221; Gap</h3><p>For years, data engineers had to choose between two bad options.</p><p><strong>Option A: The Data Warehouse (like Snowflake or Redshift).</strong> It&#8217;s consistent and reliable (ACID). If you write data, it&#8217;s there. But it&#8217;s expensive and proprietary. You can&#8217;t just open the files with another tool.</p><p><strong>Option B: The Data Lake (S3 or Azure Blob).</strong> It&#8217;s cheap and open. You just dump Parquet files into a bucket. But it&#8217;s unsafe. S3 is &#8220;eventually consistent.&#8221; If a job crashes while writing, you get corrupt &#8220;ghost files.&#8221; If you read while writing, you might see partial data.</p><p>Delta Lake was born to fix this. It brings ACID guarantees to object storage without building a new database engine.</p><h3>The Architecture: The &#8220;Database&#8221; is just a Folder</h3><p>The core idea is simple: decouple the <strong>Physical State</strong> (what files are in the bucket) from the <strong>Logical State</strong> (what files actually belong to the table).</p><p>This happens in the Transaction Log.</p><h4>1. The Transaction Log (<code>_delta_log</code>)</h4><p>Instead of asking S3 to list thousands of files (which is slow), Delta Lake checks a folder named <code>_delta_log</code>.</p><ul><li><p><strong>The &#8220;WAL&#8221; (JSON Files):</strong> Every change is a JSON file. <code>000001.json</code> might say: <em>Add &#8220;file_A.parquet&#8221;, Remove &#8220;file_B.parquet&#8221;.</em></p></li><li><p><strong>The &#8220;Checkpoint&#8221; (Parquet Files):</strong> To keep the log from getting too big, Delta compacts these JSON files into a Parquet checkpoint every few commits.</p></li></ul><p>Why JSON? It&#8217;s human-readable. You can literally open the file and see what happened.</p><h4>2. Access Protocols (ACID on S3)</h4><p>This is where the physics gets tricky.</p><ul><li><p><strong>Reading:</strong> When you query a table, the reader checks the log first. It finds the valid list of files and then asks S3 for them. The log is the single source of truth.</p></li><li><p><strong>Writing:</strong> Writers upload data to S3 first (invisible to readers). Then, they try to create the next log entry (e.g., <code>000011.json</code>).</p></li></ul><p>On Google Cloud or Azure, this is easy because the storage supports atomic operations. On AWS S3, it&#8217;s harder. S3 lacks a &#8220;put-if-absent&#8221; feature, so Delta needs a small external helper (like DynamoDB) to make sure two writers don&#8217;t create the same log entry at the same time.</p><h3>High-Level &#8220;Superpowers&#8221;</h3><p>Because the log is immutable (we never change past entries), we get features that usually require an enterprise database.</p><ul><li><p><strong>Time Travel:</strong> Since we don&#8217;t delete data immediately (we just mark it as &#8220;removed&#8221; in the log), you can query the table as it existed yesterday.</p></li><li><p><strong>Unified Batch &amp; Streaming:</strong> A streaming job can just watch the transaction log. It acts like a Kafka topic. New file added? Process it.</p></li><li><p><strong>Schema Enforcement:</strong> Delta checks data types before writing. It acts like a bouncer, keeping bad data out.</p></li></ul><h3>Performance: Making Files Fast</h3><p>How does a file-based system compete with a specialized engine?</p><ul><li><p><strong>Compaction:</strong> It solves the &#8220;Small File Problem.&#8221; A background process (OPTIMIZE) grabs thousands of tiny files and rewrites them into larger, efficient chunks.</p></li><li><p><strong>Z-Ordering:</strong> This organizes data inside the files. If you filter by <code>CustomerID</code>, Delta physically groups those customers together. This lets the engine skip over 90% of the data it doesn&#8217;t need.</p></li></ul><h3>The Architectural Debate: Files vs. Services</h3><p>This brings me back to the papers I&#8217;ve been reading. The trade-offs here are fascinating.</p><h4>Delta Lake vs. Facebook Tectonic</h4><p>In my post about Tectonic, I noted that it manages the <strong>physical placement</strong> of blocks on hard drives. It&#8217;s a filesystem. Delta Lake is a <strong>Table Layer</strong>. It doesn&#8217;t care about blocks or hard drives. It assumes S3 handles the hardware. It pushes all the complexity to the client (Spark), keeping the storage layer simple.</p><h4>Delta Lake vs. Snowflake</h4><p>This is the classic &#8220;Dashboard Latency&#8221; trade-off.</p><p><strong>Snowflake</strong> uses an always-on database (FoundationDB) to track metadata.</p><ul><li><p><strong>Pro:</strong> It&#8217;s fast. You can look up a single row in milliseconds.</p></li><li><p><strong>Con:</strong> It costs money to keep that infrastructure running.</p></li></ul><p><strong>Delta Lake</strong> keeps metadata in files.</p><ul><li><p><strong>Pro:</strong> Infinite scale. Zero &#8220;always-on&#8221; cost.</p></li><li><p><strong>Con:</strong> The &#8220;Startup Tax.&#8221; To read the table, you have to parse the JSON log. That takes 200-700ms.</p></li></ul><p>This explains why Delta is amazing for scanning billions of rows but bad for powering a snappy user dashboard.</p><h3>Summary</h3><p>Delta Lake represents a shift from &#8220;Smart Storage&#8221; to &#8220;Open Storage.&#8221; By implementing a transaction log over standard files, it proves you don&#8217;t need a proprietary engine to get reliability.</p><p>However, physics still applies. In the <strong>Spanner</strong> paper, Google used atomic clocks to guarantee consistency across the globe. Delta uses a JSON log and optimistic logic because it runs on commodity cloud storage.</p><p>If you need sub-second lookups, use a service like Snowflake. But for massive data processing, the &#8220;dumb&#8221; file approach is surprisingly smart.</p>]]></content:encoded></item><item><title><![CDATA[An AI Was Tricked Into Hacking]]></title><description><![CDATA[The Real Flaw Is in Our Design]]></description><link>https://blog.singhanuvrat.com/p/an-ai-was-tricked-into-hacking</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/an-ai-was-tricked-into-hacking</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Sat, 15 Nov 2025 18:01:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We just got our first real look at an AI-orchestrated cyber attack. </p><p>In mid-September 2025, a state-sponsored group launched a sophisticated cyber-espionage campaign designated GTG-1002. What made this one different wasn&#8217;t just the scale (targeting 30 global entities) but the method. The attackers didn&#8217;t just <em>use</em> AI to help them; they turned an AI model, Anthropic&#8217;s Claude Code, into an autonomous agent that did 80-90% of the tactical work.</p><p>The AI autonomously mapped networks, tested for vulnerabilities, harvested credentials, moved through the system, and even analyzed the stolen data for intelligence value. It did this at a speed &#8220;physically impossible&#8221; for human operators.</p><p>But here&#8217;s the most important part: the AI didn&#8217;t &#8220;go rogue.&#8221; It didn&#8217;t become self-aware or malicious. It was conned.</p><p>This event isn&#8217;t a story about a rogue AI. It&#8217;s a story about a classic, very human security vulnerability known as the &#8220;Confused Deputy&#8221;. And it exposes a deep, systemic flaw in how we&#8217;re building the infrastructure to connect AI to the real world.</p><h3>The AI as a &#8220;Confused Deputy&#8221;</h3><p>The &#8220;Confused Deputy&#8221; is a long-standing problem in computer security. It&#8217;s a situation where a program with legitimate authority is tricked by a malicious entity into misusing that authority.</p><p>Imagine you hire a new personal assistant who is brilliant, incredibly fast, and extremely literal. You give them a master key to your office building. One day, a person pretending to be a building inspector tells your assistant, &#8220;We have a report of a security flaw in the executive office. I need you to use your master key, go in, and test the safe&#8217;s lock for me.&#8221; Your assistant, lacking the &#8220;gut feeling&#8221; or context to be suspicious, sees only a person with an (apparent) legitimate goal. They think they&#8217;re <em>helping</em>. So they use their authority (the master key) to fulfill the request, letting the &#8220;inspector&#8221; (a thief) into the room.</p><p>This is <em>exactly</em> what happened in the GTG-1002 campaign. The attackers &#8220;socially engineered&#8221; the AI. They used a &#8220;role-play&#8221; tactic, successfully convincing the AI model that it was an employee at a cybersecurity firm and that all its tasks were part of an <em>authorized, defensive penetration test</em>.</p><p>The AI wasn&#8217;t the attacker. It was the &#8220;confused deputy,&#8221; the first victim. The attackers tricked it into diligently misusing its own powerful reasoning to achieve their malicious goals.</p><h3>The &#8220;Insecure-by-Default&#8221; Protocol</h3><p>So, how did the AI get the master key in the first place? This is where the problem gets systemic. The attackers connected the AI to their hacking tools using the <strong>Model Context Protocol (MCP)</strong>. MCP was created by Anthropic to be a &#8220;universal, open standard&#8221; for AI - think of it as a &#8220;USB-C port for AI&#8221;. Its goal was connectivity and interoperability, <em>not</em> security.</p><p>The protocol&#8217;s design is &#8220;insecure by default&#8221;. It promotes an easy-to-implement but highly insecure pattern called <strong>&#8220;Agent-Auth&#8221;</strong>. This is the digital equivalent of giving your assistant (the AI agent) its own powerful, static credentials - the master key. This creates a second, more dangerous confused deputy.</p><ol><li><p>First, the human <em>confuses</em> the AI (the &#8220;role-play&#8221;).</p></li><li><p>Second, the (now-confused) AI <em>confuses</em> the tool.</p></li></ol><p>When the AI, fully believing it&#8217;s doing a &#8220;pen-test,&#8221; sends a command like &#8220;Run NetworkScan,&#8221; the tool (which has its <em>own</em> powerful permissions) faithfully executes it. The tool has no way to check the <em>original human&#8217;s</em> true malicious intent.</p><p>We built a system for convenience and forgot to build in the most basic safeguards.</p><h3>Why This Is Just the Beginning</h3><p>This problem isn&#8217;t a simple &#8220;bug&#8221; that can be patched. Anthropic banned the attackers&#8217; accounts, but the <em>techniques</em> are now public. The attackers&#8217; methods were brilliant in their simplicity.</p><ul><li><p><strong>&#8220;Salami-Slicing&#8221;:</strong> They bypassed AI safety models by breaking their malicious plan into thousands of tiny, individually benign slices. A single request like &#8220;Scan this IP&#8221; looks harmless and consistent with the &#8220;pen-tester&#8221; role. The safety models, which check one prompt at a time (stateless), were blind to the malicious <em>pattern</em> emerging over time.</p></li><li><p><strong>&#8220;Tool Shadowing&#8221;:</strong> They exploited the fact that an AI often only sees a tool&#8217;s <em>description</em>, not its code. An attacker can create a malicious hacking tool and name it &#8220;Cats Counter,&#8221; with the description &#8220;Counts all the cats in a given domain&#8221;. When the AI is asked to &#8220;count the cats,&#8221; it harmlessly executes the tool, which then does its real, malicious work in the &#8220;shadow&#8221;.</p></li></ul><p>The barrier to entry for sophisticated cyberattacks has now substantially dropped. This &#8220;commoditization of sophistication&#8221; means small-time actors can now achieve the results of an entire team of experienced hackers. This will happen again.</p><h3>The Path Forward: A Mandate for &#8220;User-Auth&#8221;</h3><p>If we can&#8217;t solve this at the model layer alone, what do we do? The GTG-1002 report shows that we need a &#8220;defense-in-depth&#8221; strategy.</p><ol><li><p><strong>At the Model Level - Stateful Safety:</strong> Safety systems must evolve. We need &#8220;stateful&#8221; safety that looks at <em>context and patterns</em>, not just single prompts. Instead of just asking &#8220;Is this prompt bad?&#8221;, the system should ask, &#8220;Why is this &#8216;pen-tester&#8217; making 5,000 requests a second at 3 AM?&#8221;. This is what Anthropic is now working to build.</p></li><li><p><strong>At the Protocol Level - Mandate &#8220;User-Auth&#8221;:</strong> This is the most critical fix. We must abandon the &#8220;Agent-Auth&#8221; (master key) pattern. The secure alternative is <strong>&#8220;User-Auth&#8221;</strong>. In this model, the AI agent <em>never</em> gets its own keys. Instead, it temporarily <em>borrows the user&#8217;s keys</em>. The AI agent&#8217;s permissions are identical to the human user who prompted it, and it can never have more authority than that person. If the human user &#8220;Anuvrat&#8221; doesn&#8217;t have permission to access the finance database, the AI he is using can&#8217;t access it either. This breaks the confused deputy problem at its root.</p></li><li><p><strong>At the Enterprise Level - Zero Trust:</strong> As leaders and architects, we must treat all these new AI tools as untrusted. We need to enforce the &#8220;principle of least privilege,&#8221; audit our new AI-driven supply chain , and, for any high-risk action (like running an exploit or exfiltrating data), <em>always</em> require an explicit, auditable &#8220;Human-in-the-Loop&#8221; confirmation.</p></li></ol><p>The GTG-1002 campaign wasn&#8217;t a &#8220;Terminator&#8221; moment. It was a failure of our own design. We&#8217;ve built an incredibly powerful engine, and now we have to do the hard work of building a safe, secure, and trustworthy architecture around it.</p>]]></content:encoded></item><item><title><![CDATA[OpenAI's Browser - Architectural Trade-off]]></title><description><![CDATA[When OpenAI launched its new browser, most people focused on the AI features. But the more interesting story is its architecture.]]></description><link>https://blog.singhanuvrat.com/p/openais-browser-architectural-trade</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/openais-browser-architectural-trade</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Tue, 04 Nov 2025 20:00:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When OpenAI launched its new browser, most people focused on the AI features. But the more interesting story is its architecture. OpenAI didn&#8217;t just reskin Chromium. They split it into two separate pieces. In doing so, they traded the complexity of managing a normal Chromium fork for the complexity of a distributed system.</p><p>This decision changes how the browser performs, handles security, and how it will be maintained. It&#8217;s a classic study in software trade-offs. Let&#8217;s look at what they built, why, and the brittle foundation it rests on.</p><h3>Part 1: The &#8220;Local Browser-as-a-Service&#8221;</h3><p>The Atlas browser isn&#8217;t one application. It&#8217;s two.</p><ol><li><p><strong>The Native UI Shell (Client):</strong> This is what you see. It&#8217;s the window, the tabs, and the AI controls, built mostly in Apple&#8217;s native SwiftUI. This part is light and responsive.</p></li><li><p><strong>The Chromium Service (Server):</strong> This is the entire Chromium engine (Blink, V8) running as a separate, headless service in the background.</p></li></ol><p>The two processes are held together by Mojo, Chromium&#8217;s internal communication framework. OpenAI&#8217;s engineers used this instead of a public API like Chrome DevTools. This setup basically creates a &#8220;Local Browser-as-a-Service.&#8221; It mixes the security ideas of a remote browser (running in the cloud) with the speed of a local app.</p><h3>Part 2: The &#8220;Why&#8221;: The Clear Benefits</h3><p>Why go through this trouble? The team says this design provides several benefits, and they seem right.</p><ul><li><p><strong>UI Development Speed:</strong> This is the main one. UI engineers can work in a clean Swift codebase. They don&#8217;t need to compile all of Chromium locally, which is a famously slow process. This lets them build and test new UI features much faster.</p></li><li><p><strong>Crash Isolation:</strong> This is a big win for users. If the heavy Chromium service crashes, the native UI shell doesn&#8217;t. The app stays responsive and can just restart the service, like restarting a server.</p></li><li><p><strong>Faster </strong><em><strong>Perceived</strong></em><strong> Startup:</strong> The app feels like it starts instantly. The lightweight native shell loads right away. The user sees a working app while the much heavier Chromium service boots up in the background.</p></li></ul><h3>Part 3: The Trade-offs</h3><p>These benefits are real, but they have a cost. This design introduces new risks in performance and stability.</p><h4>Trade-off 1: The IPC Performance Tax</h4><p>When you split a program in two, simple function calls become cross-process network requests. This adds a small delay, or &#8220;latency tax.&#8221; Atlas pays this tax on every single click, keypress, and rendered frame.</p><ul><li><p><strong>Input:</strong> Your click goes from the Swift UI (Process 1) over the Mojo bridge to the Chromium service (Process 2).</p></li><li><p><strong>Output:</strong> Chromium renders a new frame, then tells the Swift UI over the bridge that the frame is ready to be displayed.</p></li></ul><p>This only works because Mojo is very fast. But the team still sacrificed raw, in-process speed for this architectural split.</p><h4>Trade-off 2: The &#8220;Clean Codebase&#8221; Illusion</h4><p>The idea of a &#8220;clean codebase&#8221; is mostly an aspiration. A browser has thousands of small UI parts, like <code>&lt;select&gt;</code> dropdowns or permission prompts, that aren&#8217;t part of the web content.</p><p>Rebuilding all of these in SwiftUI would take forever. So, the team admits they took a shortcut. When a site asks for your location, the Chromium service (Process 2) actually renders the permission prompt using its own C++ UI framework. This C++ element is then projected <em>into</em> the &#8220;all SwiftUI&#8221; app (Process 1).</p><p>So, you&#8217;re often looking at a piece of C++ UI rendered by a hidden process. It&#8217;s a smart, practical solution, but it&#8217;s not a clean separation.</p><h4>Trade-off 3: The <code>CALayerHost</code> Gamble</h4><p>This is the riskiest part of the design. The tool that projects web content and C++ UI from the service into the native app is a Core Animation class called <code>CALayerHost</code>.</p><p><code>CALayerHost</code> is a private, undocumented Apple API.</p><p>On macOS, one process isn&#8217;t allowed to just draw in another&#8217;s window. <code>CALayerHost</code> is the only real way to make this cross-process projection work smoothly. This means the browser&#8217;s entire rendering pipeline depends on an unsupported API that Apple could change or remove in any macOS update, which would instantly break the browser.</p><p>This risk is shared. Other browsers, like Chrome and Firefox, also supposedly use these private APIs for performance. The Atlas team is betting that Apple won&#8217;t break all the major browsers at once. It&#8217;s a huge dependency on Apple&#8217;s goodwill.</p><h3>Part 4: Is It Really Decoupled?</h3><p>This brings up the main engineering question: Is managing this Mojo API bridge really easier than managing a traditional Chromium fork?</p><p>OpenAI claims &#8220;fewer merge conflicts.&#8221; This is probably true for the code. But they haven&#8217;t gotten rid of the monolith. They just traded a <em>build-time monolith</em> for a <em>runtime API monolith</em>. Their app is now tightly coupled to the Mojo API. This isn&#8217;t a stable, public API. It&#8217;s the internal, undocumented plumbing of Chromium, and it changes all the time. This creates a new, scary maintenance burden.</p><p>The Chromium team is slowly refactoring its own engine to be more modular (a project called &#8220;servicification&#8221;). The Atlas team is basically jumping ahead, treating the engine as a service before it was designed to be one. This means they are building on a moving target. As one example, a Chromium developer noted that a recent internal Mojo change required manually refactoring &#8220;about 5,000 call sites&#8221; across the codebase.</p><p>When OpenAI pulls the next security patch from Chromium, a similar internal change could break their C++ service layer, forcing a huge refactor. They&#8217;ve traded the predictable work of merge conflicts for the hidden risk of tracking internal, breaking API changes.</p><h3>Part 5: The Big Win: Security &amp; Privacy</h3><p>Despite the engineering risks, this architecture is a clear win for security and privacy.</p><p><strong>The &#8220;Double Sandbox&#8221; Security Model:</strong> The design creates a strong &#8220;double-wall&#8221; for security. In a normal browser, an attacker who escapes the renderer&#8217;s sandbox just needs to find one more bug to compromise the main browser process.</p><p>In Atlas, that entire &#8220;normal browser&#8221; is already inside the isolated Chromium service (Process 2). An attacker who escapes the renderer sandbox is <em>still trapped</em> inside the service. They would need a <em>third</em> vulnerability to attack the Mojo bridge and compromise the main Atlas UI (Process 1). This makes a successful attack much harder.</p><p>Choosing Mojo over the Chrome DevTools Protocol (CDP) was also a smart security move. CDP is a huge, public API with a large attack surface. OpenAI swapped a public building with hundreds of doors (CDP) for a custom vault with one tiny, private slot (their Mojo bridge). It&#8217;s a much smaller and more manageable security boundary.</p><p><strong>Agent Privacy:</strong> Finally, the design for AI agent browsing is excellent. When an AI agent browses for you, Atlas doesn&#8217;t just open an Incognito window. Instead, it spins up a new, unique, in-memory storage partition for every single agent session. This is the same tech used to isolate web apps.</p><p>When the session ends, all cookies, cache, and site data are completely discarded. This allows for multiple, separate agent sessions to run at the same time with no risk of data leaking between them. It&#8217;s a great example of privacy-by-design.</p><h3>The Verdict: A Brave, Brittle Future</h3><p>The Atlas browser is an interesting software architecture. It&#8217;s a bold idea built on a foundation of high-stakes bets. The team has traded:</p><ul><li><p><strong>Platform stability</strong> (by using Apple&#8217;s private <code>CALayerHost</code> API).</p></li><li><p><strong>Raw performance</strong> (by paying the IPC tax on every interaction).</p></li><li><p><strong>API stability</strong> (by coupling to Chromium&#8217;s internal Mojo APIs).</p></li></ul><p>In exchange for:</p><ul><li><p><strong>UI developer speed</strong> (by using a Swift-native UI team).</p></li><li><p><strong>Better crash isolation</strong> (by protecting the UI from engine crashes).</p></li><li><p><strong>A-class security and privacy</strong> (via the double-sandbox and ephemeral storage).</p></li></ul><p>This trade-off makes sense for a company focused on AI and user experience, not on maintaining a browser engine.</p>]]></content:encoded></item><item><title><![CDATA[The Spanner Paper: Google's Quest for a Globally Consistent Database]]></title><description><![CDATA[Or, how to have your Bigtable (scale) and eat your MySQL (transactions) too.]]></description><link>https://blog.singhanuvrat.com/p/the-spanner-paper-googles-quest-for</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/the-spanner-paper-googles-quest-for</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Wed, 22 Oct 2025 17:00:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the GFS and Bigtable papers, we saw a clear pattern in Google&#8217;s design philosophy: build systems that solve one problem really well, even if it means giving up other features. GFS gave them a file system that spanned the planet, but it didn&#8217;t have a standard POSIX API. Bigtable gave them a database with incredible scale, but it didn&#8217;t have cross-row transactions.</p><p>This trade-off worked great for something like web search. But what happens when you need that scale, but you also need transactions?</p><p>That&#8217;s the problem Spanner was built to solve. Google&#8217;s ads team was stuck. They were running on a manually sharded MySQL database, and their last resharding effort took over two years. They needed to scale, but Bigtable wasn&#8217;t an option. You can&#8217;t tell an advertiser you <em>might</em> have over-billed them because of eventual consistency.</p><p>Spanner was designed to bridge this gap. It was Google&#8217;s attempt to build a single, global database that was both scalable and strongly consistent. They pulled it off not with a single magic bullet, but with a series of brilliant, pragmatic trade-offs. The foundation for all of them was a new way to think about time.</p><h3>The Foundation: A Clock Built on Uncertainty</h3><p>The big problem with distributed transactions is ordering. You can&#8217;t trust server clocks. Spanner&#8217;s solution was to build a clock that was provably accurate, called TrueTime.</p><p>TrueTime is a new kind of API. Instead of returning a single number, <code>TT.now()</code> returns an interval: <code>[earliest, latest]</code>. That interval is a guarantee: the &#8220;true&#8221; absolute time is somewhere in that tiny window (less than 10ms). The designers realized you don&#8217;t need to know the <em>exact</em> time, you just need to know the bounds of your uncertainty. This guarantee is the bedrock for everything that follows.</p><h3>Spanner&#8217;s Core Trade-offs</h3><p>Spanner&#8217;s genius lies in the bargains it makes. It willingly accepts a small, calculated cost in one area to gain a huge advantage in another.</p><h4><strong>Trade-off #1: Trading Milliseconds of Latency for Global Consistency</strong></h4><p>How do you use a &#8220;fuzzy&#8221; clock to guarantee transaction order? By making a trade: you wait. This is the &#8220;Commit Wait&#8221; rule.</p><p>When a transaction commits, Spanner&#8217;s leader assigns it a timestamp and then forces itself to pause. It holds all the transaction&#8217;s locks and waits until it knows the absolute time has passed that timestamp.</p><p>That&#8217;s the deal. Spanner trades a few milliseconds of latency on every single write. In return, it gets a mathematical guarantee of external consistency. It&#8217;s the price it pays to ensure that if transaction T1 commits before T2 starts, T1&#8217;s timestamp is provably smaller than T2&#8217;s, across the entire globe.</p><h4><strong>Trade-off #2: Trading Heavyweight Writes for Lightweight Reads</strong></h4><p>The &#8220;commit wait&#8221; is expensive, but it unlocks a massive payoff by enabling Spanner to operate at two different speeds. This is the second trade-off: making writes slower to make reads much, much faster.</p><ul><li><p><strong>The Heavyweights: Read-Write Transactions.</strong> When you need to change data, Spanner uses traditional two-phase locking and pays the &#8220;Commit Wait&#8221; cost. This is the slow, safe, and correct path.</p></li><li><p><strong>The Lightweights: Snapshot Reads.</strong> Because Spanner is a multiversion database (like Bigtable), it can offer lock-free reads. A read-only transaction gets a fixed timestamp and simply reads the version of data from that moment in the past. It doesn&#8217;t need locks, so it&#8217;s blazing fast.</p></li></ul><p>This is how Spanner supports high-throughput applications. It concentrates the cost of consistency on the writes, allowing the vast majority of read traffic to fly.</p><h4><strong>Trade-off #3: Trading a Pure Relational Model for Physical Locality</strong></h4><p>Even with TrueTime, a transaction that spans datacenters is slow. The fastest distributed transaction is one that isn&#8217;t distributed at all. This led to the final, pragmatic trade-off.</p><p>Spanner&#8217;s schema has a feature called <code>INTERLEAVE IN PARENT</code>. This is a directive from the developer to tell Spanner to physically store a child row (like an Album) with its parent row (like a User).</p><p>This isn&#8217;t a &#8220;pure&#8221; relational model, where data location is abstract. It&#8217;s a deliberate choice to trade that purity for performance. By co-locating related data, the most common transactions (updating a single user&#8217;s data) become fast, single-site operations that don&#8217;t need a slow, global two-phase commit. It&#8217;s the same practical spirit as Bigtable&#8217;s single-row transactions.</p><h3>So, Was It Worth It?</h3><p>The F1 ad team&#8217;s story says it all. After moving from a manually sharded MySQL database to Spanner, their operations became massively simpler. Spanner&#8217;s automatic sharding and failover worked so well that when datacenters failed, the event was &#8220;nearly invisible&#8221; to them.</p><p>Spanner completes the story that GFS and Bigtable started. It&#8217;s the final piece of the puzzle, built on a series of smart bargains. It proves you can have both scale and consistency, all for the price of a few atomic clocks and a few milliseconds of waiting.</p>]]></content:encoded></item><item><title><![CDATA[Deconstructing Bigtable: A Study in Distributed System Design]]></title><description><![CDATA[After diving into the foundational papers on the Google File System (GFS) and Facebook&#8217;s Tectonic, I felt I had a decent grasp of how to build a distributed file system, the foundational layer for storing massive amounts of data.]]></description><link>https://blog.singhanuvrat.com/p/deconstructing-bigtable-a-study-in</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/deconstructing-bigtable-a-study-in</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Thu, 16 Oct 2025 17:01:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>After diving into the foundational papers on the <a href="https://blog.singhanuvrat.com/p/engineering-trade-offs-in-the-google?r=5c95k">Google File System (GFS)</a> and <a href="https://blog.singhanuvrat.com/p/tectonic-navigating-design-trade?r=5c95k">Facebook&#8217;s Tectonic</a>, I felt I had a decent grasp of how to build a distributed file system, the foundational layer for storing massive amounts of data. The next logical step was to climb up the stack. How do you go from just storing files to managing structured data at Google&#8217;s scale?</p><p>The answer, I found, is Bigtable.</p><p>Reading the <a href="https://research.google/pubs/pub27898/">Bigtable paper</a> was a fascinating exercise in systems analysis. I expected to learn about a database; instead, I found a case study in pragmatic, purpose-built design. Bigtable&#8217;s power isn&#8217;t just in its feature set, but in the deliberate, sometimes stark, trade-offs made to solve a specific set of problems at an unimaginable scale.</p><h3>The First Question: How Should We Organize a Petabyte?</h3><p>Before writing a single line of code, the Bigtable team faced a foundational choice: what should the data model look like? The familiar path would have been a traditional relational model, enforced by a rigid schema and queried with SQL. It&#8217;s the bedrock of the database world for a reason - it&#8217;s powerful, consistent, and well-understood.</p><p>But they chose a different path. They rejected the familiar comfort of SQL for a far simpler, more flexible abstraction: a <strong>sparse, distributed, multi-dimensional sorted map</strong>.</p><p>This was their first major trade-off. By sacrificing the declarative power of SQL, they gained extreme flexibility. Applications could now add new columns on the fly without complex schema migrations, a massive win for the rapidly evolving products at Google. It also meant they could sidestep the immense complexity of building a distributed query optimizer that could handle expensive operations like <code>JOIN</code>s. The cost? Convenience. The job of &#8220;query optimization&#8221; was pushed from the database engine to the application developer. It was a significant burden, but a necessary one to achieve the primary goal of performance at scale.</p><h3>The Performance Question: How Do We Make Reads Fast?</h3><p>Having chosen a sorted map, the next critical question was <em>how</em> to sort it. In a distributed system, you typically want to spread the load evenly. Systems like Tectonic use hashing to scatter data uniformly across all servers, which is perfect for avoiding hotspots.</p><p>Here, Bigtable made a defining and counter-intuitive trade-off. Instead of hashing, they chose to keep all data sorted <strong>lexicographically by its row key</strong>.</p><p>This decision gives developers a powerful, if double-edged, tool: control over data locality. By carefully designing row keys (like reversing domain names to group all pages from a single site together), a developer could turn a series of slow, random disk reads into a single, blazing-fast sequential scan. This is arguably the most important feature of the entire system. But the trade-off was significant. It gave up the guarantee of a uniform load, introducing the risk of &#8220;hotspotting,&#8221; where a single server could be overwhelmed by requests for a popular key range. It was a clear trade of automated safety for developer-controlled performance.</p><h3>The Scale Question: How Do We Avoid a Bottleneck?</h3><p>With a petabyte-scale sorted map, how does a client find a specific row without a central directory becoming a massive bottleneck? The GFS model, where a client asks a single master for data locations, wouldn&#8217;t work for the thousands of small, low-latency queries Bigtable needed to serve.</p><p>The team&#8217;s answer was another trade-off: they chose a <strong>decentralized, client-driven lookup hierarchy</strong> over the simplicity of a central master.</p><p>A Bigtable client finds its data by traversing a three-level, B+ tree like structure. It starts with a pointer in the Chubby lock service, which leads to a <code>METADATA</code> tablet, which in turn points to the user data tablet. The client then caches this location. The master is completely out of the data path. This makes the system massively scalable and resilient to master failures, but at the cost of a more complex client library that now had to handle this navigation and caching logic itself.</p><h3>The Consistency Question: How &#8220;Correct&#8221; Does It Really Need to Be?</h3><p>Here again, the design opts for pragmatism over theoretical purity. The final set of decisions centered on consistency and storage.</p><p>First, they traded <strong>full ACID transactions</strong> for <strong>single-row atomicity</strong>. After studying their applications, they realized most didn&#8217;t need the ability to atomically update multiple rows. This single compromise allowed them to sidestep the immense complexity of distributed transaction protocols, resulting in a system with higher performance and availability. The classic banking transaction became impossible, but for Google&#8217;s workloads, it was a price worth paying.</p><p>Second, they traded <strong>in-place file updates</strong> for <strong>immutable </strong><code>SSTables</code>. Aligning perfectly with GFS&#8217;s append-only philosophy, Bigtable never modifies a data file. All new writes go to an in-memory <code>memtable</code>, which is periodically flushed to a new, immutable <code>SSTable</code> file on disk. This radically simplifies concurrency - reads never block writes. The cost was a new background process called <strong>compaction</strong>, a constant, complex janitorial task required to merge <code>SSTables</code> and garbage collect deleted data.</p><h3>Conclusion</h3><p>Analyzing Bigtable is like studying the blueprints for a Formula 1 car. It&#8217;s not a general-purpose vehicle; it&#8217;s a highly specialized machine built to do one thing - manage structured data at extreme scale, exceptionally well. It achieves this by prioritizing its goals and deliberately omitting features considered standard in other systems.</p>]]></content:encoded></item><item><title><![CDATA[Tectonic: Navigating Design Trade-offs at Facebook Scale]]></title><description><![CDATA[Some research papers don&#8217;t just document a system, they reveal how engineers reason about complexity.]]></description><link>https://blog.singhanuvrat.com/p/tectonic-navigating-design-trade</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/tectonic-navigating-design-trade</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Wed, 15 Oct 2025 17:02:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Some research papers don&#8217;t just document a system, they reveal how engineers reason about complexity. Facebook&#8217;s <strong><a href="https://engineering.fb.com/2021/06/21/data-infrastructure/tectonic-file-system/">Tectonic filesystem</a></strong> is one such example. It&#8217;s not about building an ideal system, but about achieving <em>balance</em>: accepting small inefficiencies to gain massive improvements in stability, adaptability, and efficiency at exabyte scale.</p><p>This post explores how Tectonic&#8217;s major design choices illustrate the trade-offs that large-scale systems must navigate, and what they teach us about designing for scale.</p><h2><strong>Why Tectonic Matters</strong></h2><p>Tectonic is part of the continuing evolution of distributed storage systems. It doesn&#8217;t directly descend from Google&#8217;s <strong>File System (GFS)</strong>, but it extends the same set of design questions: <em>how can we balance simplicity, performance, and scalability when we can&#8217;t maximize all three at once?</em></p><p>In my earlier post on <strong><a href="http://blog.singhanuvrat.com/p/engineering-trade-offs-in-the-google">GFS</a></strong>, I described it as a freight train - steady, predictable, and optimized for bulk throughput. GFS worked beautifully for workloads dominated by large, sequential files. Facebook, however, faced a very different environment: <strong>billions of tiny objects</strong> and <strong>multiple tenants with conflicting requirements</strong>. Blob Storage demanded low latency for user-facing operations, while the Data Warehouse required high throughput for analytical batch jobs. Tectonic needed to support both on the same shared infrastructure.</p><p>Facebook&#8217;s engineers built something that resembled a <em>city</em> rather than a single-purpose factory&#8212;many subsystems coexisting, each tuned for a particular workload, all coordinated through a shared foundation.</p><h2><strong>The Inefficiency of Disaggregation</strong></h2><p>Before Tectonic, Facebook&#8217;s storage ecosystem was split across several specialized systems:</p><ul><li><p><strong>Haystack</strong> managed new (&#8220;hot&#8221;) photos and videos, using replication for fast reads and writes.</p></li><li><p><strong>f4</strong> stored older (&#8220;cold&#8221;) media using Reed&#8211;Solomon (RS) encoding to save space.</p></li><li><p><strong>HDFS</strong>, conceptually similar to GFS, powered the company&#8217;s analytics workloads.</p></li></ul><p>Each system worked well in isolation but not in combination. Haystack was IOPS-bound, leaving unused storage capacity. f4 was capacity-bound, leaving IOPS stranded. HDFS couldn&#8217;t share resources with either. The result was a sea of <strong>stranded resources</strong> - hardware trapped by specialization.</p><p>Tectonic&#8217;s purpose was to unify these workloads under a single platform, allowing multiple tenants to efficiently share the same hardware. To achieve that, Facebook had to completely rethink both metadata management and how clients interacted with storage.</p><h2><strong>Reinterpreting GFS Principles</strong></h2><p>n GFS, a single centralized <strong>NameNode</strong> held all filesystem metadata in memory - a simple design that offered low-latency lookups for millions of files. But when managing <em>trillions</em> of small objects, that model simply breaks. No single machine could store or serve that much metadata without becoming a bottleneck.</p><p>Tectonic solved this by distributing metadata across a <strong>sharded key-value store</strong>. This horizontal architecture introduced additional network hops and latency but delivered what GFS could not: near-infinite scalability and resilience across fault domains.</p><p><strong>Key takeaway:</strong> At massive scale, systems trade a bit of local speed for global scalability and fault tolerance.</p><h2><strong>The Core Trade-offs in Tectonic</strong></h2><p>Let&#8217;s explore the main trade-offs that define Tectonic and what they reveal about design at scale.</p><h3><strong>Metadata Latency vs. Scalability</strong></h3><p>In HDFS, metadata operations were instantaneous because everything lived in memory on a single node. Tectonic&#8217;s distributed design added network overhead but removed single-node limits. Instead of fighting that latency, engineers leaned into concurrency: clients issued many metadata requests simultaneously, achieving higher total throughput.</p><p><strong>Lesson:</strong> For batch-oriented workloads like Tectonic&#8217;s Data Warehouse, scalability comes from maximizing throughput rather than minimizing per-operation latency.</p><h3><strong>Simplicity vs. Flexibility (Client-Side Logic)</strong></h3><p>Tectonic moved much of its decision-making into client libraries. Each tenant could choose how to interact with storage based on its workload:</p><ul><li><p><strong>Blob Storage</strong> wrote data via replication for quick availability, later re-encoding it with RS for efficiency.</p></li><li><p><strong>Data Warehouse</strong> wrote directly in RS format, optimizing for throughput and space savings.</p></li></ul><p>This approach increased client complexity and duplicated logic, but it enabled a unified storage substrate to serve radically different performance needs.</p><p><strong>Lesson:</strong> Flexibility often requires decentralizing control, even if it introduces local complexity.</p><h3><strong>Availability vs. Cost Efficiency</strong></h3><p>Most of Tectonic&#8217;s data is RS-encoded. When a disk fails, reconstruction requires reading fragments from many disks. Too many concurrent reconstructions can overwhelm the cluster. Tectonic prevents this by limiting reconstruction traffic to roughly 10% of all reads. If that threshold is exceeded, new reconstruction requests pause until the system stabilizes.</p><p>This safeguard occasionally reduces availability but prevents cascading failures and eliminates the need for expensive over-provisioning.</p><p><strong>Lesson:</strong> Controlled degradation is often more sustainable than over-provisioning for extreme cases.</p><h3><strong>Replication Followed by Re-encoding</strong></h3><p>When new data arrives, Tectonic writes it through fast replication for speed and reliability, then later re-encodes it into RS blocks for storage efficiency. This two-phase process consumes extra space temporarily but improves write latency and simplifies recovery.</p><p><strong>Lesson:</strong> Temporary inefficiency can be a powerful tool for achieving long-term performance and maintainability.</p><h2><strong>Closing Reflections: Designing for Balance</strong></h2><p>If <strong>GFS</strong> was about <em>control</em>, <strong>Tectonic</strong> was about <em>coordination</em>. GFS optimized for one workload type. Tectonic tackled the harder problem - allowing diverse workloads to coexist efficiently. Across its design, Tectonic embodies one consistent philosophy: <em>accept bounded inefficiency to achieve adaptability and reliability at scale.</em> It doesn&#8217;t optimize for one metric, it balances them all.</p><p>The larger lesson is that distributed system design evolves through constraint, not convenience. The most durable architectures embrace imperfection as a design tool - using small inefficiencies to unlock scalability, resilience, and sustainability.</p><blockquote><p><strong>At large scale, perfection gives way to balance. The art of system design lies in knowing what to sacrifice, and doing so deliberately.</strong></p></blockquote>]]></content:encoded></item><item><title><![CDATA[Engineering Trade-offs in the Google File System]]></title><description><![CDATA[The Art of Choosing What to Give Up]]></description><link>https://blog.singhanuvrat.com/p/engineering-trade-offs-in-the-google</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/engineering-trade-offs-in-the-google</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Fri, 10 Oct 2025 17:02:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>All robust systems are shaped by constraints. Bridges are built to bend with wind rather than resist it, and airplanes are designed to handle turbulence rather than avoid it. Distributed systems follow the same idea: reliability comes from embracing the fact that things will fail, not pretending they won&#8217;t. When Google published <em><a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">The Google File System</a></em><a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf"> (Ghemawat, Gobioff, and Leung, 2003)</a>, the paper wasn&#8217;t about making something elegant, it was about building something that could survive. Every design choice was a conscious trade-off between performance, cost, and complexity.</p><h3>Reliability vs. Hardware Cost</h3><p>In the early 2000s, Google&#8217;s data was growing rapidly, but enterprise storage hardware was expensive and limited in scale. Traditional systems achieved reliability through RAID, redundant controllers, and high-end disks, but they were designed to scale <strong>vertically</strong> - by buying bigger and more powerful machines. That approach hit a wall: each new machine was more expensive and carried more risk if it failed.</p><p>GFS flipped that idea. It used <strong>commodity hardware</strong>, accepting that machines would fail often. Files were broken into <strong>64 MB chunks</strong>, stored across multiple <strong>chunkservers</strong>, and <strong>replicated three times</strong>, usually across different racks. The <strong>master</strong> server tracked replicas and automatically rebuilt lost data. Failures were common, but recovery was continuous.</p><p>By accepting failure as normal, GFS made reliability a software problem. Durability came from replication, not expensive hardware. Like a suspension bridge, it stayed strong by distributing tension rather than relying on one unbreakable beam.</p><h3>Latency vs Throughput</h3><p>Conventional file systems prioritize <strong>low latency</strong> because they serve humans. When you open or save a file, every millisecond counts. GFS was designed for machines, not people. Google&#8217;s crawlers, indexers, and MapReduce jobs processed terabytes of data, and what mattered most was <strong>throughput</strong> - how much data could move efficiently over time.</p><p>To achieve this, GFS used two key design choices:</p><ol><li><p><strong>Large 64 MB chunks:</strong> Reduced metadata lookups and disk seeks, improving sequential read and write speeds.</p></li><li><p><strong>Append-based writes:</strong> Encouraged continuous, large writes instead of small, random updates.</p></li></ol><p>The system excelled at high-bandwidth streaming workloads but struggled with small, random I/O. It traded quick responses for massive sustained data flow. GFS was more like a freight train - slow to start, but once in motion, it could carry enormous loads.</p><h3>Simplicity vs Availability</h3><p>Distributed systems often avoid single points of failure, but GFS intentionally included one: a <strong>single master</strong> server. The master handled metadata such as file names, chunk locations, and replica management. This made the architecture simpler and let the master make global decisions about placement and replication.</p><p>The downside was that if the master failed, metadata operations paused. GFS minimized this risk by keeping the master&#8217;s state <strong>in memory</strong> for speed and logging updates to a <strong>replicated operation log</strong>. It could recover within seconds, and clients could still read and write using cached metadata during downtime.</p><p>This was a deliberate compromise - accepting rare short pauses to keep the rest of the system simple and efficient.</p><h3>Consistency vs Developer Burden</h3><p>Most file systems enforce strict consistency so multiple users can modify files safely. But Google&#8217;s workloads didn&#8217;t need that. They were <strong>append-only</strong> and <strong>batch-oriented</strong>, like crawlers writing logs or analytics pipelines writing results.</p><p>GFS introduced <strong>atomic record append</strong>, allowing multiple clients to add data to the same file safely, without coordination. Each append was guaranteed to happen atomically and at least once, though duplicates were possible. Partial writes were not.</p><p>This relaxed consistency model simplified both the filesystem and developer experience. For most of Google&#8217;s workloads, a few duplicate records didn&#8217;t matter; losing data or blocking progress did. GFS made the right trade: practical consistency instead of theoretical precision.</p><h3>Correctness vs Progress</h3><p>Even with replication, disks could silently corrupt data. GFS protected itself with <strong>checksums for every 64 KB block</strong>, verifying data on every read. If corruption was detected, it restored the chunk from a healthy replica. The overhead was minimal and the gain in reliability was immense.</p><p>GFS assumed hardware would sometimes lie, so it built a system that could detect and fix those lies automatically.</p><h3>The Meta Trade-Off: Generality vs Focus</h3><p>GFS wasn&#8217;t meant to be a general-purpose file system. It focused on large, sequential, append-heavy workloads - the kind Google relied on most. By narrowing its scope, it avoided unnecessary complexity and achieved predictability at scale.</p><p>Its power came not from doing everything, but from doing one thing extraordinarily well.</p>]]></content:encoded></item><item><title><![CDATA[A Baby, a Bug, and a Pull Request to the Human Genome]]></title><description><![CDATA[A few weeks ago, I read a story that I haven&#8217;t been able to shake off.]]></description><link>https://blog.singhanuvrat.com/p/a-baby-a-bug-and-a-pull-request-to</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/a-baby-a-bug-and-a-pull-request-to</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Thu, 29 May 2025 23:42:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few weeks ago, I read a story that I haven&#8217;t been able to shake off. A baby named KJ Muldoon was born with a rare genetic disorder &#8212; one that, if untreated, would&#8217;ve ended his life before it truly began. But instead of planning for palliative care, the doctors at Children&#8217;s Hospital of Philadelphia did something astonishing.</p><p>They edited his DNA. They shipped a bug fix to his genome.</p><p>And it worked. This wasn&#8217;t some off-the-shelf therapy. The fix was handcrafted, tested, and deployed just for KJ. A one-time, one-person gene edit. The tool they used was CRISPR. I had heard the word thrown around in science headlines, but I had no idea what it actually meant &#8212; or why it&#8217;s such a big deal. So I did what I usually do when something technical intimidates me &#8212; I opened a new note and started writing it down like a software system I wanted to understand.</p><p></p><p><strong>DNA Is the Codebase</strong></p><p>Let&#8217;s start with the basics. Every cell in your body (minus a few exceptions like red blood cells) carries a full copy of your DNA &#8212; your personal codebase. It&#8217;s a 3.2 billion character&#8211;long sequence, written in a four-letter alphabet: A, T, C, and G. This code doesn&#8217;t run all at once. Cells interpret just the parts they need, kind of like microservices choosing which modules to load from a shared monorepo. Some lines of code are configuration &#8212; they turn things on and off. Some are executable &#8212; they get translated into proteins, which do the actual work.</p><p></p><p><strong>The Bug in KJ&#8217;s Code</strong></p><p>In KJ&#8217;s case, one of those code lines was wrong. A single typo. A gene responsible for a crucial enzyme in the liver had a defect. You can think of it like a function that always throws an exception, but in the runtime of the human body, that crash leads to metabolic failure.</p><p>In a traditional system, we&#8217;d hotfix the code and redeploy. But humans don&#8217;t come with continuous deployment pipelines. Or at least, they didn&#8217;t.</p><p></p><p><strong>CRISPR: The Find-and-Replace Engine for DNA</strong></p><p>CRISPR is a technology borrowed from nature and repurposed for editing genetic code. The core idea is shockingly simple: use a search string to find the exact spot in the DNA where a bug lives, then use a &#8220;molecular scalpel&#8221; to cut it, and finally &#8212; optionally &#8212; paste in a corrected version.</p><p>It works in three steps:</p><ol><li><p>gRNA (guide RNA): This is your grep query. A ~20-character sequence that tells the system exactly what to look for in the genome. In KJ&#8217;s case, it was a needle-in-a-haystack match that located the defective gene in the liver&#8217;s DNA.</p></li><li><p>Cas9 enzyme: This is the cutter. Once the gRNA finds its match, Cas9 makes a clean double-stranded break in the DNA at that location. Think of it as opening the source file at the exact line with the bug and placing your cursor there, ready to edit.</p></li><li><p>Repair Template (optional): After the cut, the system needs to patch the break. Sometimes it just slaps the ends back together (a rough fix, useful for turning things off). But if you supply a repair template &#8212; basically a diff file &#8212; the cell can apply a precise fix.</p></li></ol><p>CRISPR lets us write a targeted, declarative patch and ship it at the molecular level.</p><p></p><p><strong>How Did They Patch Only KJ&#8217;s Liver?</strong></p><p>Here&#8217;s the part that blew my mind. Remember how every cell has the same codebase? So wouldn&#8217;t CRISPR, once introduced, start editing every matching file?</p><p>Not if you package it cleverly.</p><p>The doctors used lipid nanoparticles &#8212; tiny fat bubbles that ferry molecular payloads into cells &#8212; tuned to be picked up primarily by liver cells. This is like deploying your fix only to a specific region of your production cluster. Even though the DNA is identical across the body, only the liver machines ran the hotfix script.</p><p></p><p><strong>Designing for Safety: Avoiding the Wrong Files</strong></p><p>CRISPR&#8217;s power comes with risk. What if your grep is too vague and hits multiple files? What if you accidentally cut the wrong line? This is called an off-target effect, and it can be disastrous &#8212; imagine silently corrupting a critical system file. To avoid this, scientists rigorously test gRNA sequences using both software simulations and lab assays. They look for sequences that are unique enough to only match the intended location, and they test for false positives before ever deploying to a human.</p><p>The goal is precision: one match, one cut, one fix.</p><p></p><p><strong>A Future of One-Line Fixes</strong></p><p>KJ&#8217;s story is the first time I&#8217;ve seen personalized medicine go from concept to commit. The doctors wrote a patch that corrected a specific mutation, in a specific patient, at a specific organ, with a single deployment. It was a one-off fix. But it&#8217;s a hint of what might become routine.</p><p>Imagine catching a predisposition for cancer and shipping a preventative edit to your genome. Or reversing the early onset of blindness. Or one day &#8212; and this is still sci-fi &#8212; patching the accumulated bugs of aging.</p><p>It&#8217;s not there yet. The challenges of delivery, safety, ethics, and cost are enormous. But the idea that we can now treat the body like a running system, and DNA like mutable code &#8212; that changes everything.</p>]]></content:encoded></item><item><title><![CDATA[Beyond `kubectl apply`Giving Your Kubernetes Apps Superpowers with the Operator Pattern]]></title><description><![CDATA[Kubernetes 101 &#8212; Desired State Magic]]></description><link>https://blog.singhanuvrat.com/p/beyond-kubectl-applygiving-your-kubernetes</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/beyond-kubectl-applygiving-your-kubernetes</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Mon, 05 May 2025 22:27:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Kubernetes 101 &#8212; Desired State Magic</h3><p>Alright, so you're deploying apps on Kubernetes. You probably know the drill: write some YAML defining <code>Deployments</code>, <code>Services</code>, maybe <code>StatefulSets</code>, run <code>kubectl apply</code>, and let Kubernetes work its magic. It's built around this core idea of a <strong>declarative state</strong>. You tell K8s <em>what</em> you want &#8211; "three replicas of my API server, please" &#8211; and its built-in <strong>controllers</strong> act like diligent robots, constantly working to make reality match your YAML. Pod crashes? The ReplicaSet controller replaces it. Simple, powerful automation.</p><h3>When Standard Resources Aren't Enough</h3><p>But let's be real. Not all applications fit neatly into the <code>Deployment</code> or <code>StatefulSet</code> box. Ever found yourself wrestling with apps that need:</p><ul><li><p><strong>Complex Day-1 Setup:</strong> Maybe initializing a database schema, registering with a central service, or configuring intricate network policies before the app can even start?</p></li><li><p><strong>Weird Lifecycle Needs:</strong> Think application-specific backup/restore logic, fancy coordinated upgrades that need more finesse than a rolling update, or dynamic scaling based on custom metrics?</p></li><li><p><strong>Specialized Operational Knowledge:</strong> Tasks that usually live in runbooks or require a seasoned SRE to perform manually during outages or maintenance?</p></li></ul><p>Trying to automate <em>this</em> kind of stuff often leads to a mess of external scripts, manual interventions, or overly complex Helm charts that feel like fragile hacks. We're basically back to imperative management, not the declarative dream Kubernetes promised. How do we bundle this operational "know-how" <em>with</em> the application and teach Kubernetes to handle it natively?</p><h3>Enter the Operator Pattern</h3><p>What if I told you there's a way to extend Kubernetes, to teach it new tricks specific to <em>your</em> application? That's precisely what the <strong>Operator pattern</strong> lets you do.</p><p>Think of it like this: you take all the procedures and logic a skilled human operator would use to manage your specific application, and you encode that logic into a piece of software &#8211; <strong>an Operator</strong> &#8211; that runs <em>inside</em> your Kubernetes cluster.</p><p>The magic happens by combining two things:</p><ol><li><p><strong>Custom Resource Definitions (CRDs):</strong> You define your <em>own</em> Kubernetes object tailored to your app. Forget juggling separate Deployments, Services, and ConfigMaps; manage your app as a single <code>Kind: AwesomeDatabase</code> or <code>Kind: MyManagedWebApp</code>. It's like adding custom components to your Kubernetes toolkit.</p></li><li><p><strong>Custom Controller (The Operator itself):</strong> This is the brain you build. It's a controller specifically designed to understand <em>your</em> CRD. When it sees a user create an <code>AwesomeDatabase</code> resource, it knows exactly what low-level Kubernetes objects (StatefulSets, Services, backup CronJobs, etc.) are needed to make that database a reality.</p></li></ol><p>Your CRD (The "What") + Your Controller (The "How") = Your Operator</p><p>The beauty? Your users (or customers) get a dead-simple API (your CRD) to manage a complex application. They declare their desired state using <code>Kind: AwesomeDatabase</code>, and your Operator handles the messy details behind the scenes, making your app a first-class citizen of the Kubernetes ecosystem.</p><h3>How the Magic Works</h3><p>Alright, let's peel back the layers a bit:</p><ol><li><p><strong>CRDs &#8212; Defining Your Language:</strong> As mentioned, CRDs let you extend the Kubernetes API. You define the <code>apiVersion</code> (like <code>mydatabases.mycompany.com/v1alpha1</code>), the <code>kind</code> (<code>AwesomeDatabase</code>), and most importantly, the <code>spec</code> fields users can configure (<code>version</code>, <code>storageSize</code>, <code>highAvailability</code>, etc.) using an OpenAPI schema. It&#8217;s the blueprint for your custom resource.</p></li><li><p><strong>CRs &#8212; The User's Request:</strong> Once the CRD exists, a user can create a Custom Resource (CR) &#8211; an instance of your <code>AwesomeDatabase</code>. This YAML object holds their specific desired state: "I want version 14.2, 200Gi of storage, and HA enabled."</p></li></ol><pre><code><code># User creates this simple(r) object:
apiVersion: mydatabases.mycompany.com/v1alpha1
kind: AwesomeDatabase
metadata:
  name: customers-db
spec:
  version: "14.2"
  storageSize: "200Gi"
  highAvailability: true
  backupFrequency: "daily"
</code></code></pre><ol><li><p><strong>The Controller &#8212; Your Automated Expert:</strong> The Operator (your custom controller) is running, constantly watching for <code>AwesomeDatabase</code> resources. It's programmed with the expert knowledge: "Aha, <code>customers-db</code> wants version 14.2 and HA! That means I need to create a StatefulSet with 2 replicas configured this specific way, a Service for discovery, maybe set up replication&#8230;"</p></li><li><p><strong>The Reconciliation Loop &#8212; Constant Adjustment:</strong> The controller doesn't just act once. It lives in a continuous <strong>reconciliation loop</strong>:</p><ul><li><p><strong>Observe:</strong> Sees the <code>customers-db</code> CR, or detects changes to it, or notices that a resource it manages (like the StatefulSet) has deviated from the desired state.</p></li><li><p><strong>Analyze:</strong> Compares the desired state (from the CR's <code>spec</code>) with the actual state (what currently exists in the cluster for <code>customers-db</code>).</p></li><li><p><strong>Act:</strong> Makes calls to the Kubernetes API (create StatefulSet, update Service, delete Pod, configure backup job, etc.) to close the gap between actual and desired.</p></li></ul></li></ol><p>This loop ensures Kubernetes is continually working to enforce the state declared in your custom <code>AwesomeDatabase</code> resource, automating away the complex, application-specific operational burden.</p><p>By building an Operator, you're not just deploying an application; you're deploying an automated SRE for that application right into Kubernetes itself. Pretty cool, huh?</p>]]></content:encoded></item><item><title><![CDATA[Debugging Pip Extras: A Deep Dive into .whl Files]]></title><description><![CDATA[Background]]></description><link>https://blog.singhanuvrat.com/p/debugging-pip-extras-a-deep-dive</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/debugging-pip-extras-a-deep-dive</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Wed, 08 Jan 2025 23:58:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Background</h2><p>Python packaging has a nifty feature to include packages as an optional dependency for extended functionality. As an example, Amazon SageMaker released a new capability to use partner AI apps within SageMaker. We worked with the partner app teams to include <code>sagemaker</code> as an optional dependency of their app SDK. Within AWS, you can enable the partner app and install the SDK to include SageMaker specific functionalities.</p><pre><code>pip install partner_app_sdk[sagemaker]</code></pre><p>The optional dependency is declared by defining an <code>extras_require</code> section in the <code>setup.py</code> or <code>pyproject.toml</code> file. </p><h2>Problem</h2><p>An engineer on my team was testing a new version of a partner&#8217;s SDK. When they ran the command to upgrade the dependency, they encountered the following message:</p><pre><code>&gt; pip install partner_app_sdk[sagemaker] -U

Requirement already satisfied: partner_app_sdk[sagemaker] in /opt/conda/lib/python3.11/site-packages (0.1.5)
Collecting partner_app_sdk[sagemaker]
  Using cached partner_app_sdk-0.2.2-py3-none-any.whl.metadata (514 bytes)
WARNING: partner_app_sdk 0.2.2 does not provide the extra 'sagemaker'</code></pre><p>Despite the new version 0.2.2 being downloaded, pip warned that the <code>sagemaker</code> extra was not provided. This was puzzling since the <code>sagemaker</code> extra was expected to be part of the package.</p><h2>The Exploration</h2><p>To debug what was happening, we started by reviewing the <code>setup.py</code> files for both the versions. The <code>extras_required</code> section correctly listed <code>sagemaker</code> as its dependency. So, why was pip unable to find <code>sagemaker</code> in the new version? We explored the following options:</p><ol><li><p>Could the pip cache be corrupted? This was easy to rule out. We cleared the cache and re-ran the update command. Didn&#8217;t resolve the issue. </p></li><li><p>Could it be a problem with the wheel <code>.whl</code> file? </p></li></ol><h2>Deep Dive into Wheel Files</h2><p>A wheel (<code>.whl</code>) is a binary distribution format for Python packages. It contains the package and its metadata, including the information about optional dependencies.</p><p><strong>Inspecting the Wheel Files</strong></p><p>To check the <code>.whl</code> files, we downloaded it manually and examined the metadata inside the wheel.</p><pre><code>&gt; pip download partner_app_sdk==0.1.5 --no-deps
&gt; unzip -p partner_app_sdk-0.1.5-py3-none-any.whl | grep -A 5 "Provides-Extra"

Provides-Extra: langchain
Requires-Dist: langchain; extra == "langchain"
Provides-Extra: sagemaker
Requires-Dist: sagemaker; extra == "sagemaker"

&gt; pip download partner_app_sdk==0.2.2 --no-deps
&gt; unzip -p partner_app_sdk-0.2.2-py3-none-any.whl | grep -A 5 "Provides-Extra"

Provides-Extra: langchain
Requires-Dist: langchain; extra == "langchain"</code></pre><p>This revealed that <code>sagemaker</code> was indeed missing from the <code>.whl</code> metadata, despite being included in the <code>setup.py</code> file! This explains why pip wasn&#8217;t able to find <code>sagemaker</code> and issued the warning.</p><p><strong>Root Cause?</strong></p><p>We are yet to understand the root cause here. Most likely, the build process caused the metadata to be incomplete or incorrect in the wheel file. But now that we know this can happen, we will implement tests to validate the <code>partner_app_sdk</code>.</p>]]></content:encoded></item><item><title><![CDATA[Writing Into Dynamic Partitions Using Spark]]></title><description><![CDATA[Hive has this wonderful feature of partitioning &#8212; a way of dividing a table into related parts based on the values of certain columns.]]></description><link>https://blog.singhanuvrat.com/p/writing-into-dynamic-partitions-using</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/writing-into-dynamic-partitions-using</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Thu, 12 Jul 2018 04:42:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hive has this wonderful feature of partitioning &#8212; a way of dividing a table into related parts based on the values of certain columns. Using partitions, it&#8217;s easy to query a portion of data. Hive optimizes the data load operations based on the partitions.</p><p>Writing data into partitions is straightforward. You have two options:</p><ul><li><p><strong>Static Partitioning:</strong> You provide the list of partitions you want to write the data into.</p></li><li><p><strong>Dynamic Partitioning:</strong> You provide a column whose values become the values of the partitions. In this case, Hive creates as many partitions as unique values of the column provided.</p></li></ul><p>Here&#8217;s a code snippet that writes into static and dynamic partitions:</p><pre><code><code>DROP TABLE IF EXISTS stats;

CREATE EXTERNAL TABLE stats (
    ad              STRING,
    impressions     INT,
    clicks          INT
) PARTITIONED BY (country STRING, year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';

MSCK REPAIR TABLE stats;

-- Specify static partitions
INSERT OVERWRITE TABLE stats
PARTITION(country = 'US', year = 2017, month = 3, day = 1)
SELECT ad, SUM(impressions), SUM(clicks)
FROM impression_logs
WHERE log_day = 1
GROUP BY ad;

-- Load data into partitions dynamically
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

INSERT OVERWRITE TABLE stats
PARTITION(country = 'US', year = 2017, month = 3, day)
SELECT ad, SUM(impressions), SUM(clicks), log_day
FROM impression_logs
GROUP BY ad;</code></code></pre><p>Did you notice the difference between the two? In the second INSERT query, the value for day hasn&#8217;t been specified in the PARTITIONS construct. Instead, the value for the day partition column comes from log_day column of the impression_logs tables. While the first query only writes into 1 partition, the second query can write into as many partitions as the days in the impression_logs table.</p><p>The OVERWRITE keyword tells Hive to delete the contents of the partitions into which data is being inserted. This is the key! Hive only deletes data for the partitions it is going to write into. All other partitions remain intact.</p><p>This feature of Hive allows us to develop applications using AWS EMR + S3 and Hive to write partitioned outputs and run backfills/updates on selective partitions. We create ephemeral EMR clusters and load the partitions needed by a job in each run. Partitioning the data by date also allows us to run multiple parallel jobs (in separate EMR clusters), processing data for different dates at the same time. Hive optimizes the disk reads to only load the partitions requested by a query.</p><h2><strong>Problem Statement</strong></h2><p>And this is where Spark differs massively in implementation! When a Spark programmer talks about partitions, they always mean the partitions into which a Dataset is divided across the cluster. They talk about re-partitioning the data or to coalesce the data before writing to disk.</p><p>One of the biggest problems I faced when working on a new project with Spark was the organization of the output data into buckets (Hive partitions) such that an individual (or a collective group of these buckets) may be overwritten by a batch Spark job. The challenge was that, even though Spark provides an API to write into a format similar to Hive partitions, it either OVERWRITEs all partitions or appends to the partitions. Spark doesn't natively support the same behavior as Hive. In the OVERWRITE mode, Spark deletes all the partitions, even the ones it would not have written into.</p><p>Here&#8217;s the Spark code that writes the data into partitions:</p><pre><code><code>impressionsDF
    .write.mode("overwrite")
    .partitionBy("country", "year", "month", "day")
    .json("s3://output_bucket/stats")
</code></code></pre><p>If we run the job to backfill the data for just 1 day, then Spark will delete the data for all other days if we were to run the above code.</p><h2><strong>Solution</strong></h2><p>After searching around a lot for solutions and trying out various options, I finally settled on using Hive context to write the partitioned data. With Spark-2.1.0 the Spark community has fixed a few bugs related to dynamic partition inserts:</p><ul><li><p>SPARK-18183 &#8212; INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition.</p></li><li><p>SPARK-18185 &#8212; Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions</p></li></ul><p>So, if you are using Spark 2.1.0 and want to write into partitions dynamically without deleting the others, you can implement the below solution.</p><p>The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.</p><pre><code><code>// Create SparkSession with Hive dynamic partitioning enabled
val spark: SparkSession =
    SparkSession
        .builder()
        .appName("StatsAnalyzer")
        .enableHiveSupport()
        .config("hive.exec.dynamic.partition", "true")
        .config("hive.exec.dynamic.partition.mode", "nonstrict")
        .getOrCreate()

// Register the dataframe as a Hive table
impressionsDF.createOrReplaceTempView("impressions_dataframe")

// Create the output Hive table
spark.sql(
    s"""
      |CREATE EXTERNAL TABLE stats (
      |   ad            STRING,
      |   impressions   INT,
      |   clicks        INT
      |) PARTITIONED BY (country STRING, year INT, month INT, day INT)
      |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
    """.stripMargin
)

// Write the data into disk as Hive partitions
spark.sql(
    s"""
      |INSERT OVERWRITE TABLE stats
      |PARTITION(country = 'US', year = 2017, month = 3, day)
      |SELECT ad, SUM(impressions), SUM(clicks), day
      |FROM impressions_dataframe
      |GROUP BY ad
    """.stripMargin
)
</code></code></pre><p>Spark now writes data partitioned just as Hive would &#8212; which means only the partitions that are touched by the INSERT query get overwritten, and the others are not touched.</p><p>I hope Spark adds this functionality natively, but until then, this is the best solution I have.</p>]]></content:encoded></item><item><title><![CDATA[Parse Json in Hive Using Hive JSON Serde ]]></title><description><![CDATA[In an earlier post, I wrote a custom UDF to read JSON into my table.]]></description><link>https://blog.singhanuvrat.com/p/parse-json-in-hive-using-hive-json</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/parse-json-in-hive-using-hive-json</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Thu, 25 May 2017 04:48:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In an earlier post, I wrote a custom UDF to read JSON into my table. Since then, I have also learned about and used the Hive-JSON-Serde. I will use the same example as before.</p><pre><code><code>{
  "customer": {
    "given_name": "Anuvrat",
    "surname": "Singh"
  },
  "order": {
    "id": "123dfe523gd"
  }
}
</code></code></pre><p>Now, using the Hive-JSON-Serde you can parse the above JSON record as:</p><pre><code><code>create table order_raw(
  customer map&lt;string, string&gt;,
  order map&lt;string, string&gt;
) row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '&lt;location-to-file&gt;';

select customer['given_name'], customer['surname'], order['id']
from order_raw;
</code></code></pre><p>This is really great! I can now parse more complicated JSON without writing any UDF. The fun doesn&#8217;t stop here. We can now define JSON that have nested arrays and maps and parse them using the lateral view along with the explode() UDF provided by hive. Let&#8217;s take another example to demonstrate the power of Hive-JSON-Serde.</p><p>Say you have a startup in the transportation domain, like Uber (which is awesome btw!). You have the travel details of customers in JSON format, and you want to analyze it. Each JSON record contains the customerId, age, and a list of trips taken by the customer. Each trip has distanceTravelled, fare and referral. Each trip might have 1 or more referral codes assigned to it.</p><p>You get the picture? Cool. Let&#8217;s see a sample JSON.</p><pre><code><code>{
  "customerId": "0277ZGAX80PG6ZSJ04J5",
  "age": 23,
  "services": [
    {
      "trips": [
        {
          "tripId": "A12-5678344-4097746",
          "fare": 24.0,
          "distanceTravelled": 3.2,
          "referrals": {
            "email_campaign": {
              "campaignA": {
                "referralIds": ["0ZK7V4HM5ZZNKJ0PRRR5"]
              }
            }
          }
        }
      ]
    }
  ]
}
</code></code></pre><p>Okay, so the JSON looks a little complicated. But that was my aim. Let&#8217;s say the schema for your startup was designed in a way to allow you to add different services like cargo transport, etc. Trips are the taxi component. And you might want to track and attribute each ride to email campaigns, special offers, etc.</p><p>Now let&#8217;s create a table for this JSON using the Hive-JSON-Serde.</p><pre><code><code>create table cust_trips (
    customerId  string,
    age int,
    services array&lt;struct&lt;
        trips:array&lt;struct&lt;
            tripId:string,
            fare:double,
            distanceTravelled:double,
            referrals:struct&lt;
                email_campaign:map&lt;string, struct&lt;referralIds:array&lt;string&gt;&gt;&gt;
            &gt;
        &gt;&gt;
    &gt;&gt;
) row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '&lt;location-of-files&gt;';
</code></code></pre><p>Awesome!! Designing the table was the hardest part, where all the magic happens. If done properly, extracting data is a smooth sailing. Let&#8217;s get the count of customers referred by each referralId. For this, we need to write a query where we group all the records by referralId and take the count of distinct customerIds. Remember that referralId was nested so deep in the JSON, perhaps within multiple arrays.</p><p>We will use the explode() UDF to explode an array into different rows. For example, a single record A, [1, 2] in a hive table can be exploded into 2 records &#8212; A, 1 and A, 2. Sweet, isn&#8217;t it? I&#8217;d suggest you read it up if you don&#8217;t already know about it.</p><p>Here&#8217;s the query that answers the question posed above.</p><pre><code><code>select
    campName,
    refIdArr.referralIds[0],
    count(distinct customerId)
from cust_trips ct
    lateral view explode(ct.services) v1 as s
    lateral view explode(s.trips) v2 as t
    lateral view explode(t.referrals.email_campaign) v2 as campName, refIdArr
group by
    campName,
    refIdArr.referralIds[0];
</code></code></pre><p>Note how simple and easy the query is!</p><p><em>PS: I hope I haven&#8217;t made any error writing it. Since I cannot share the original query I wrote for work, I had to build an example along a similar line and copy the query with variable names changed.</em></p>]]></content:encoded></item><item><title><![CDATA[Writing UDF To Parse JSON In Hive ]]></title><description><![CDATA[Every so often, we need to perform data transformation in ways too complicated for SQL (even with the Custom UDF&#8217;s provided by hive).]]></description><link>https://blog.singhanuvrat.com/p/writing-udf-to-parse-json-in-hive</link><guid isPermaLink="false">https://blog.singhanuvrat.com/p/writing-udf-to-parse-json-in-hive</guid><dc:creator><![CDATA[Anuvrat Singh]]></dc:creator><pubDate>Wed, 05 Apr 2017 04:51:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FlqZ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2dc7865-7004-4399-973f-5caba76958d6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every so often, we need to perform data transformation in ways too complicated for SQL (even with the Custom UDF&#8217;s provided by hive). Let&#8217;s take JSON manipulation as an example.</p><p>JSON is widely used to store and transfer data. Hive comes with a built-in json_tuple() function that can extract values for multiple keys at once. But if you have a nested JSON, the query using json_tuple() can get messy quickly. Say we have the following JSON:</p><pre><code><code>{
  "customer": {
    "given_name": "Anuvrat",
    "surname": "Singh"
  },
  "order": {
    "id": "123dfe523gd"
  }
}
</code></code></pre><p>If we were to use the json_tuple() function of Hive, we would write something like:</p><pre><code><code>create table json_table (
  record string
);

select
  v2.given_name,
  v2.surname,
  v3.id
from json_table jt
  lateral view json_tuple(jt.record, 'customer', 'order') v1 as customer, order
  lateral view json_tuple(v1.customer, 'given_name', 'surname') v2 as given_name, surname
  lateral view json_tuple(v1.order, 'id') v3 as id;
</code></code></pre><p>Which, you have got to agree, looks hideous!</p><p>Instead, we could quickly write a script (python being my fav.) that does the transformation for us. The input to the script is a single record of json from the table, and the output of the script should be tab separated values. The values can be operated upon or inserted into another table using hive.</p><p>Here&#8217;s the script. The script calls transform_json() method for each line, which extracts the values we are interested in and prints them with tab as the separation character.</p><pre><code><code>#!/usr/bin/python
import sys
import json

def get_order_details(order_record):
  json_record = json.loads(order_record)

  customer = json_record['customer']
  order = json_record['order']

  values = [customer['given_name'], customer['surname'], order['id']]

  print '\t'.join(map(str, values))

for line in sys.stdin:
    line = line.strip()
    get_order_details(line)
</code></code></pre><p>In the hive script, we first add the file and then call it using the transform() command. Here&#8217;s a neat looking script that does what our previous hive script did, using json_tuple().</p><pre><code><code>add file 'get_order_details_mapper.py';

select transform ( record )
using 'python get_order_details_mapper.py'
as given_name, surname, order_id
from json_table;
</code></code></pre><p>That&#8217;s it! Neat and simple.</p><p>In my particular case the keys were variables and I just could not have used json_tuple() to extract info from json.</p>]]></content:encoded></item></channel></rss>