<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[The Pragmatic Engineer]]></title><description><![CDATA[Observations across the software engineering industry.]]></description><link>https://blog.pragmaticengineer.com/</link><image><url>https://blog.pragmaticengineer.com/favicon.png</url><title>The Pragmatic Engineer</title><link>https://blog.pragmaticengineer.com/</link></image><generator>Ghost 6.39</generator><lastBuildDate>Mon, 18 May 2026 11:55:35 GMT</lastBuildDate><atom:link href="https://blog.pragmaticengineer.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The Pulse: Did capacity shortages turn Anthropic hostile to devs?]]></title><description><![CDATA[For the past few weeks, Anthropic has continually upset devs with its “dumber” model, and by removing Claude Code access from some paid accounts. After securing lots of compute from SpaceX, could the reason have been to conceal capacity issues?]]></description><link>https://blog.pragmaticengineer.com/the-pulse-did-capacity-shortages-turn-anthropic-hostile-to-devs/</link><guid isPermaLink="false">6a05f20bbfd90c000141d3d3</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 14 May 2026 16:10:59 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of five topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-did-capacity-shortages?ref=blog.pragmaticengineer.com" rel="noreferrer"><em><u>last week&#x2019;s The Pulse</u></em></a><em> issue. Full subscribers received the article below seven days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com"><em> <u>subscribe here</u></em></a><em>.</em></p><p>Last week, we reported on Anthropic <a href="https://newsletter.pragmaticengineer.com/i/196004322/2-anthropics-speed-run-to-break-peoples-goodwill?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">seemingly being on a speed run</a> to break devs&#x2019; goodwill by silently &#x201C;nerfing&#x201D; Claude Code, banning corporate accounts without warning, and a weird growth experiment involving revoking Claude Code and then restoring it. This week, a dev on the $20/month Pro plan had Claude Code removed just days into their subscription:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-7.png" class="kg-image" alt loading="lazy" width="1182" height="960" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-7.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-7.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-7.png 1182w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Claude Code turned out to be a trial for seven days for some paying customers. Source: </em></i><a href="https://x.com/jgeigerm/status/2051142221702087149?s=20&amp;ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Jaime Geiger</em></i></a></figcaption></figure><p><strong>This week, Anthropic announced a big data center expansion, and relaxing previous usage limitations, </strong>while<strong> </strong>Elon Musk&#x2019;s SpaceX / xAI ( a single company after a merger) is renting its complete Colossus 1 data center to Anthropic. From <a href="https://x.ai/news/anthropic-compute-partnership?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">the announcement:</a></p><blockquote>&#x201C;Colossus 1 features over 220,000 NVIDIA GPUs, including dense deployments of H100, H200, and next-generation GB200 accelerators. The cluster delivers extreme parallel performance for large language models, multimodal systems, scientific simulations, and generative AI at frontier scale.<br><br>Anthropic plans to use this additional compute to directly improve capacity for Claude Pro and Claude Max subscribers.&#x201D;</blockquote><p>In parallel with this release, Anthropic <a href="https://x.com/ClaudeDevs/status/2052064938840228237?s=20&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">announced</a>:</p><ul><li>Doubling Claude Code&#x2019;s current 5-hour limits for Pro, Max, Team, and seat-based Enterprise plans</li><li>Removing peak hours limit reduction on Claude Code for Pro and Max plans</li><li>Substantially raising API rate limits for Opus models</li></ul><p><strong>Is it possible that capacity issues are what led Anthropic to make Claude worse? </strong>It&#x2019;s confirmed the company has struggled with capacity for months. Conveniently, Claude Code being &#x201C;nerfed&#x201D; led to lower compute load, while removing Claude Code access from cheap plans could look like rate limiting. Even the banning of corporate accounts could be seen as scaling back at a time when the business has struggled to serve existing growth. Yesterday, (6 May), at the Code with Claude event hosted by Anthropic, CEO, Dario Amodei, said:</p><blockquote>&#x201C;We originally planned for 10x growth, and we&#x2019;ve seen something more like 80x growth in revenue and usage over the last period of time.&#x201D;</blockquote><p><strong>SpaceX / xAI renting a good chunk of its capacity to Anthropic is ironic, </strong>considering that xAI (Musk&#x2019;s AI startup) builds Grok, a frontier model and direct rival of Claude, and also in January, Anthropic banned xAI developers from Claude. As <a href="https://newsletter.pragmaticengineer.com/i/184676515/xai-devs-used-claude-for-coding-and-got-cut-off?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">covered at the time:</a></p><blockquote>&#x201C;It&#x2019;s common for an AI lab to not allow another AI lab to use its model, like at OpenAI, Anthropic, and Google. On the other side, there&#x2019;s also the pertinent question of why a leading AI lab would even want to use a rival for its own day-to-day work?<br><br>Turns out, xAI (Elon Musk&#x2019;s AI lab) was relying on Cursor to write code, which we know because they got cut off.&#x201D;</blockquote><p>Anthropic likely banned xAI to stop Claude from being potentially <a href="https://labelbox.com/guides/model-distillation/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">distilled</a> while it tried to improve Grok&#x2019;s coding capability. Meanwhile, Musk <a href="https://x.com/FredLambert/status/2052166477818839416?s=20&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">called</a> Anthropic &#x201C;misanthropic and evil&#x201D; earlier this year, and said the new tenant &#x201C;hates Western civilization&#x201D;. But both parties seem happy to put that behind them and strike a deal, so perhaps there&#x2019;s something else at play.</p><p><strong>Could SpaceX / xAI be checking out of the frontier-AI model wars? </strong>Leasing a good chunk of its data center capacity might suggest that.<strong> </strong>SpaceX / xAI has two data centers: Colossus 1 and Colossus 2. Colossus 1 represents somewhere <a href="https://x.com/tanayj/status/2052078899744714908?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">around</a> 45% of current SpaceX / xAI capacity, and 20-25% of planned total capacity.</p><p>Giving up as much capacity as this might indicate a lack of demand, or capacity sitting idle. It also means Grok is losing out in market share to Claude, ChatGPT, and other leading models. <em>In </em><a href="https://newsletter.pragmaticengineer.com/i/189777574/3-popular-models?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>February&#x2019;s AI tooling survey</em></a><em> we found scarce mention of Grok, which lagged in usage behind open models like DeepSeek and Qwen.</em></p><p>To be fair, unlike Anthropic and OpenAI, Grok never had a B2C nor B2B business that took off. The biggest consumer use case for Grok seems to be its integration into the social media platform, X; at least, I don&#x2019;t know of any tech company using the model for serious work.</p><p><strong>&#x201C;The enemy of my enemy is my friend&#x201D;, says the maxim, </strong>and if there&#x2019;s one company Musk hates, it&#x2019;s OpenAI. He is currently suing OpenAI, claiming it betrayed its founding nonprofit mission to develop safe AGI for humanity&#x2019;s benefit by shifting to a profit-driven model backed by Microsoft. Musk also claims that despite investing about $40M, he has no ownership of the company.</p><p>He wants $150B in damages, the removal of Sam Altman and Greg Brockman, and for OpenAI to return to a full nonprofit, as per when he invested in the company. <em>We covered more about OpenAI&#x2019;s own ethical challenges between nonprofit and for-profit right after the firing of Sam Altman in 2023, in the deepdive </em><a href="https://newsletter.pragmaticengineer.com/p/what-is-openai?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>What is OpenAI, really?</em></a></p><p>Similarly, Anthropic may well have an issue with OpenAI, if CEO Dario Amodei&#x2019;s failure to join hands with Sam Altman while sharing a stage with the Prime Minister of India earlier this year is anything to go by.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-8.png" class="kg-image" alt loading="lazy" width="1280" height="853" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-8.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-8.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-8.png 1280w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">(Most) AI leaders join hands at the AI Impact Summit with India&#x2019;s Prime Minister. Source: </em></i><a href="https://fortune.com/2026/02/19/openai-anthropic-sam-altman-dario-amodei-refused-to-hold-hands-ai-super-bowl-ad-war-ceos-big-tech-conflict/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Fortune</em></i></a></figcaption></figure><p>Capacity issues hurting Anthropic would benefit OpenAI, and so by offering significant capacity to Anthropic, Musk is making it harder for OpenAI to win the market. That would be ironic, given he&#x2019;s a former investor.</p><p><em>Read the full issue of </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-did-capacity-shortages?ref=blog.pragmaticengineer.com" rel="noreferrer"><em><u>last week&#x2019;s The Pulse</u></em></a><em>, or check out </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-forward-deployed-engineering?ref=blog.pragmaticengineer.com" rel="noreferrer"><em><u>this week&#x2019;s The Pulse</u></em></a><em>. This week&#x2019;s issue covers:</em></p><ol><li><strong>Forward deployed engineering heats up again.&#xA0;</strong>Massive demand for the role at Google, OpenAI, and Anthropic. The latest version of the FDE role looks like the consultant / solution architect role done by many early-junior engineers.</li><li><strong>Why are layoffs spiking?&#xA0;</strong>Tech job cuts are higher than since early 2023 for various reasons: smaller teams prompt reorgs and reduce the need for middle management. Meanwhile, poorly performing companies make layoffs without the influence of AI.</li><li><strong>New trend: self-reporting 100% AI generated code at Microsoft.&#xA0;</strong>With mid-year performance reviews looming, some managers advise their reports to claim they use AI for everything.</li><li><strong>Industry Pulse.&#xA0;</strong>Tokenmaxxing at Amazon, too, SaaS companies grow faster than before &#x2013; perhaps partly due to AI, Bun rewritten in Rust with AI works well, Anthropic overtakes OpenAI in enterprise spend, and more.</li><li><strong>Vibe coding &amp; agentic engineering get uncomfortably close</strong>. A relatable observation by software engineer, Simon Willison, about reviewing AI agents&#x2019; code less than would be ideal.</li></ol>]]></content:encoded></item><item><title><![CDATA[TechPays has been acquired by Levels.fyi]]></title><description><![CDATA[<p><em>tl;dr:</em><a href="https://techpays.com/?ref=blog.pragmaticengineer.com"><em> <u>TechPays</u></em></a><em> is joining</em><a href="https://www.levels.fyi/?ref=blog.pragmaticengineer.com"><em> <u>Levels.fyi</u></em></a><em>: so the leading tech salary site in Europe gets the love and care it deserves. Thanks to</em><a href="https://www.linkedin.com/in/rzsombor/?ref=blog.pragmaticengineer.com"><em> <u>Zsombor</u></em></a><em> for building this project with me for so many years.</em></p><p>Pay transparency has always been an issue in tech, <em>especially</em> in Europe. For a while,</p>]]></description><link>https://blog.pragmaticengineer.com/techpays-has-been-acquired-levels-fyi/</link><guid isPermaLink="false">6a0348f9c21cab000134339f</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Tue, 12 May 2026 16:06:08 GMT</pubDate><content:encoded><![CDATA[<p><em>tl;dr:</em><a href="https://techpays.com/?ref=blog.pragmaticengineer.com"><em> <u>TechPays</u></em></a><em> is joining</em><a href="https://www.levels.fyi/?ref=blog.pragmaticengineer.com"><em> <u>Levels.fyi</u></em></a><em>: so the leading tech salary site in Europe gets the love and care it deserves. Thanks to</em><a href="https://www.linkedin.com/in/rzsombor/?ref=blog.pragmaticengineer.com"><em> <u>Zsombor</u></em></a><em> for building this project with me for so many years.</em></p><p>Pay transparency has always been an issue in tech, <em>especially</em> in Europe. For a while, I assumed that the most that a senior+ software engineer could make in London or Amsterdam would be in the realm of &#xA3;100K / &#x20AC;100K. Once you reach that level, you&apos;ve made it. You&#x2019;re now at the very top of the market! <em>Or are you?</em></p><p>So when I was making  &#xA3;93K in London, working as a principal engineer at Skyscanner in 2016, I was not expecting that I could be compensated meaningfully better. Pay surveys kept confirming that I&apos;m well above the median, and into the 90th percentile of pay grades.</p><p>Imagine my surprise when I got an offer from Uber, in Amsterdam, that effectively doubled by compensation, into the realm of around &#x20AC;220-250K ($260-295K). By year four, I made &#x20AC;283K ($332K):</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/Screenshot-2026-05-12-at-17.46.16.png" class="kg-image" alt loading="lazy" width="1210" height="1358" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/Screenshot-2026-05-12-at-17.46.16.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/Screenshot-2026-05-12-at-17.46.16.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/Screenshot-2026-05-12-at-17.46.16.png 1210w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">My total compensation at Uber, per year, 2016-2019. Blue is base salary, yellow is equity, green is cash bonus. Note how by year 5 (2020), my compensation dropped to below year 2, thanks to hitting my 4-year vesting cliff for the initial equity grant.</span></figcaption></figure><p><strong>It felt like I discovered a &quot;secret, upper-tier&quot; of the market that no one else knew about. </strong>When I became a manager at Uber, and started hiring for my team, several strong software engineers were hesitant to move forward with the process, because they <em>assumed</em> that they were at the very top of the market &#x2013; but they still made ~half of what we would have offered! I had no way of telling them &quot;your data is wrong, this place pays a lot more!&quot; and so several of them just never bothered to interview, assuming the most raise they would get would be 5-10%. When they could have potentially doubled their compensation&#x2026;</p><p><strong>I saw first-hand that not having good compensation information works against us, developers, and decided to try and change this. </strong>I collected data points from closer to 200 engineers working in the Netherlands, and explained that there&apos;s a third, &quot;hidden&quot; tier of compensation in <a href="https://blog.pragmaticengineer.com/software-engineering-salaries-in-the-netherlands-and-europe/" rel="noreferrer">The Trimodal Nature of Software Engineering Salaries in the Netherlands and Europe</a>.</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-6.png" class="kg-image" alt loading="lazy" width="1600" height="1026" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-6.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-6.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-6.png 1600w" sizes="(min-width: 720px) 720px"></figure><p>After the success of the article, I decided to &quot;open source&quot; compensation data points I collected, and thus <a href="https://techpays.com/?ref=blog.pragmaticengineer.com" rel="noreferrer">TechPays</a> was born:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/Screenshot-2026-05-12-at-17.53.55.png" class="kg-image" alt loading="lazy" width="2000" height="1318" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/Screenshot-2026-05-12-at-17.53.55.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/Screenshot-2026-05-12-at-17.53.55.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1600/2026/05/Screenshot-2026-05-12-at-17.53.55.png 1600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/Screenshot-2026-05-12-at-17.53.55.png 2336w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">TechPays</span></figcaption></figure><p>I built this site together with <a href="https://www.linkedin.com/in/rzsombor/?ref=blog.pragmaticengineer.com" rel="noreferrer">Zsombor Erd&#x151;dy-Nagy</a>. We paid attention to support compensation anonymization, capture freelancer compensation, and break down how compensation packages were put together. We&apos;ve received so many heart-warming stories on how you&apos;ve been able to negotiate better compensation packages, thanks to having access to this information. </p><p>Knowing that we&apos;re making a difference kept us going for a few years, as a side project. However, over time, both Zsombor and I got busier with other projects. For me, it was <a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com"><u>The Pragmatic Engineer</u></a> taking up more of my time. We wanted to find a way to keep TechPays running, and get the care it deserves.</p><p><strong>Levels.fyi will be taking over operating TechPays</strong> &#x2013; and taking learnings about European compensation packages, and integrating into their global pay transparency platform. I&apos;ve known Levels.fyi founders <a href="https://www.linkedin.com/in/zuhayeer/?ref=blog.pragmaticengineer.com" rel="noreferrer">Zuhayeer</a> and <a href="https://www.linkedin.com/in/zmohiuddin/?ref=blog.pragmaticengineer.com" rel="noreferrer">Zaheer</a> for years, and we share our drive to make compensation as transparent as possible, across the tech industry.</p><p>With TechPays, there are no changes: you get to browse the data, as before. And expect even more, high-quality data points on Levels.fyi, for Europe, and globally.</p><p>To get more details on compensation, check out Levels.fyi. And read the <a href="https://newsletter.pragmaticengineer.com/p/trimodal?ref=blog.pragmaticengineer.com"><u>Trimodal nature of tech compensation in the US, UK and India</u></a>, based on Levels.fyi data points:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/data-src-image-fb8866c5-3b41-45b6-a0b9-1ebf281adf7d.png" class="kg-image" alt loading="lazy" width="739" height="712" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/data-src-image-fb8866c5-3b41-45b6-a0b9-1ebf281adf7d.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/data-src-image-fb8866c5-3b41-45b6-a0b9-1ebf281adf7d.png 739w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">From the deepdive </span><a href="https://newsletter.pragmaticengineer.com/p/trimodal?ref=blog.pragmaticengineer.com"><u><span class="underline" style="white-space: pre-wrap;">The trimodal nature of tech compensation in the US, UK and India</span></u></a></figcaption></figure>]]></content:encoded></item><item><title><![CDATA[The Pulse: AI load breaks GitHub – why not other vendors?]]></title><description><![CDATA[GitHub’s leadership blames the 3.5x increase in service load as the cause of degradation – or it might be self-inflicted.]]></description><link>https://blog.pragmaticengineer.com/the-pulse-ai-load-breaks-github/</link><guid isPermaLink="false">69fccc856562a30001f428bb</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 07 May 2026 17:33:18 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-github-breaks?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the article below seven days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em> subscribe here</em></a><em>.</em></p><p>GitHub&#x2019;s reliability has been beyond unacceptable recently: last month, third party measurements pinned it at <a href="https://newsletter.pragmaticengineer.com/i/192229275/1-does-github-still-merit-top-git-platform-for-ai-native-development-status?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">one nine</a> (right at 90%). This month, reliability has been down to <em>zero</em> nines &#x2013; 86% &#x2013; as per <a href="https://mrshu.github.io/github-statuses/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">a third-party tracker</a>, and last week, things got even worse: a frankly embarrassing data integrity incident, more outages, and a partial explanation from GitHub, eventually.</p><h3 id="data-integrity-incident">Data integrity incident</h3><p>Last Thursday (23 April), <a href="https://www.githubstatus.com/incidents/zsg1lk7w13cf?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">this happened</a>: PRs merged via the merge queue using the squash merge method produced incorrect merge commits, when the merge group contained more than one PR. Commits were reverted from subsequent merges: basically, commits were &#x201C;lost&#x201D; in the code that was merged!</p><p>Thanks to <a href="https://www.githubstatus.com/incidents/zsg1lk7w13cf?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">a bug</a> GitHub introduced, the service broke its integrity promise that pull requests would be merged as expected when using <a href="https://docs.gitlab.com/user/project/merge_requests/squash_and_merge/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">squash merge</a>, which is a technique typically used to merge multiple small commits into a single, meaningful commit. This is a big deal: as data integrity promises are some of the most important ones, for services like GitHub.</p><p>A total of 2,092 pull requests were impacted, and companies hit by the outage included Modal and Zipline. Effectively, GitHub pushed a bunch of work on affected customers who had to manually untangle and recover lost commits, which GitHub could offer zero assistance with.</p><p>Customers had to manually go through their git history and restore missing code. After following manual recovery steps (reverting the squash commit and re-applying commits one by one), all commits should have been recovered.</p><p>GitHub <a href="https://x.com/davidxia_/status/2047642368724120043?s=20&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">later emailed</a> the list of affected commits to customers, but it&#x2019;s odd that GitHub executives seemed to downplay the nature of this outage. After all, an outage that messes with data integrity is a much bigger deal than something like a fall in availability where no data is corrupted.</p><p>Can Duruk, software engineer at Modal, <a href="https://x.com/can/status/2047823390342324572?s=20&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">was unhappy</a> about GitHub&#x2019;s muted response to the outage:</p><blockquote>&#x201C;The COO going out of their way to find a huge denominator to make the impact appear small feels very dishonest; versus a sincere apology about how this invalidates their entire promise to their customers. We had to dig into their status page about this to even realize they just casually f***ed up our repo.&#x201D;</blockquote><h3 id="outages-don%E2%80%99t-stop">Outages don&#x2019;t stop</h3><p>On Monday (27 April), pull requests and issues disappeared from GitHub&#x2019;s web UI:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image.png" class="kg-image" alt loading="lazy" width="1456" height="788" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Pull requests go missing. Source: </em></i><a href="https://x.com/badlogicgames/status/2048803113683788138?s=20&amp;ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Mario Zechner</em></i></a></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-1.png" class="kg-image" alt loading="lazy" width="1198" height="298" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-1.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-1.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-1.png 1198w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Issues also not to be found. Source: </em></i><a href="https://x.com/zeeg/status/2048810616849355252?s=20&amp;ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">David Cramer</em></i></a></figcaption></figure><p>This had to do with an Elasticsearch outage on GitHub&#x2019;s backend: the cluster became overloaded and went down. So, while pull requests, issues, and projects didn&#x2019;t vanish altogether, they also didn&#x2019;t show up during <a href="https://www.githubstatus.com/incidents/ql942tw29yl6?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">the 6-hour-long outage.</a></p><p>There were other outages this week:</p><ul><li><a href="https://www.githubstatus.com/incidents/x69zbgdyfzg0?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Some pull requests not showing up</a> (Tuesday, 28 April)</li><li><a href="https://www.githubstatus.com/incidents/dbypmw7h77l5?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Problems with some GitHub Actions</a> (the same day)</li><li><a href="https://www.githubstatus.com/incidents/x69zbgdyfzg0?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Incomplete pull requests in repositories</a> (Wednesday, 29 April)</li></ul><p>Also on Tuesday (28 April), security firm Wiz <a href="https://www.wiz.io/blog/github-rce-vulnerability-cve-2026-3854?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">disclosed a critical security issue</a>, where a bad actor could get access to all repositories on GitHub and GitHub Enterprise server by using only a <em>git push</em> command. <em>GitHub fixed the issue on GitHub.com within six hours, but GitHub Enterprise servers that were not updated remain vulnerable.</em></p><h3 id="famous-open-source-contributor-quits-github-in-frustration">Famous open source contributor quits GitHub in frustration</h3><p>On Tuesday, Mitchell Hashimoto, founder of HashiCorp, creator of Ghostty, announced GitHub was unfit for professional work and that he was moving off to Ghostty, the open source terminal that&#x2019;s his main focus. Mitchell&#x2019;s reasoning <a href="https://mitchellh.com/writing/ghostty-leaving-github?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">was dead simple</a>: being on GitHub makes him unproductive (emphasis mine:)</p><blockquote>&#x201C;The past month I&#x2019;ve kept a journal where I put an &#x201C;X&#x201D; next to every date where a GitHub outage has negatively impacted my ability to work. Almost every day has an X. On the day I am writing this post, I&#x2019;ve been unable to do any PR review for ~2 hours because there is a GitHub Actions outage. <strong>This is no longer a place for serious work if it just blocks you out for hours per day, every day.</strong><br><br>It&#x2019;s not a fun place for me to be anymore. I want to be there, but it doesn&#x2019;t want me to be there. I want to get work done and it doesn&#x2019;t want me to get work done. I want to ship software and it doesn&#x2019;t want me to ship software.<br><br>I want it to be better, but I also want to code. And I can&#x2019;t code with GitHub anymore. I&#x2019;m sorry. After 18 years, I&#x2019;ve got to go. I&#x2019;d love to come back one day, but this will have to be predicated on real results and improvements, not words and promises.&#x201D;</blockquote><p>Mitchell&#x2019;s experience suggests that GitHub&#x2019;s official status page is inaccurate from the point of view of a heavy user like himself. The third-party &#x201C;<a href="https://mrshu.github.io/github-statuses/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">missing GitHub status page</a>&#x201D; is likely to be a better estimation: where GitHub&#x2019;s reliability is at zero nines: at 85.51% uptime. That means that a part of GitHub was down for 2-3 hours, per day, on average, for the last 90 days (!!)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-2.png" class="kg-image" alt loading="lazy" width="1456" height="516" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-2.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-2.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-2.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Reliability woes: GitHub &#x201C;not a place for serious work.&#x201D; Source: </em></i><a href="https://mrshu.github.io/github-statuses/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">The Missing GitHub Status Page</em></i></a></figcaption></figure><p>Mitchell&#x2019;s complaint sounds straightforward:</p><ol><li>As a professional software engineer, it&#x2019;s important to have tools that help you get work done</li><li>For months, GitHub has got in the way of his work on open source projects via a flood of outages</li><li>It makes no sense to use a product unfit for professional work.</li><li>As GitHub shows no signs of improvement, it&#x2019;s worthwhile to move to a different solution which <em>just</em> <em>works</em></li></ol><h3 id="cto-blames-ai-agent-fuelled-load-spike">CTO blames AI agent-fuelled load spike</h3><p>GitHub CTO, Vlad Fedorov, <a href="https://github.blog/news-insights/company-news/an-update-on-github-availability/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">shared an update</a> on why reliability has been terrible for months at GitHub. He identified the load from agents being much bigger than expected as the culprit. Charts illustrating this were shared by GitHub:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-3.png" class="kg-image" alt loading="lazy" width="1200" height="671" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-3.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-3.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-3.png 1200w" sizes="(min-width: 720px) 720px"></figure><p>This chart looks eye-catching &#x2013; but there&#x2019;s just one tiny issue: no Y axis! So, while it tells the story of the load going up slowly and then very fast, we&#x2019;re not told by how much. However, I managed to get data from GitHub, and below is the chart showing the actual load increase over two years:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-4.png" class="kg-image" alt loading="lazy" width="1456" height="913" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-4.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-4.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-4.png 1456w" sizes="(min-width: 720px) 720px"></figure><p><strong>A load increase of ~3.5x, spread across two years, doesn&#x2019;t seem so brutal at first glance. </strong>It is nothing like a load increase of 10x in a month, and a good chunk of it occurred in recent months. So, why can&#x2019;t GitHub handle it? In a blog post, Fedorov said:</p><blockquote>&#x201C;A pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At large scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, retries amplify traffic, and one slow dependency can affect several product experiences.&#x201D;</blockquote><p>Here&#x2019;s how the per-second load numbers from January 2023 and today compare:</p><figure class="kg-card kg-image-card"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-5.png" class="kg-image" alt loading="lazy" width="1456" height="633" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/05/image-5.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/05/image-5.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/05/image-5.png 1456w" sizes="(min-width: 720px) 720px"></figure><p>GitHub took 15 years to achieve the 2023 numbers, and maybe it expected to continue growing in a comparable way in the future. If so, some engineering decisions about long-term infrastructure improvements would have been made obsolete by the arrival of AI agents.</p><p><strong>To add to GitHub&#x2019;s challenges, the company is in the midst of a migration from its own data centers &#x2192; Azure. </strong>In October last year, GitHub <a href="https://thenewstack.io/github-will-prioritize-migrating-to-azure-over-feature-development/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">started to</a> move over to Azure &#x2013; a project expected to take 12 months &#x2013; because it already had constraints on its own data center capacity.</p><p>Such large-scale infrastructure migrations are hard enough when the load on a service is relatively stable; just making sure nothing breaks takes a lot of effort. But moving at a time when load is spiking means that bugs can cause more visible outages. Of course, GitHub can secure a lot more compute capacity on Azure, now they know what to expect.</p><p><strong>But other major companies prepared for a 10x increase in infra load, so why not Microsoft / GitHub? </strong>A year ago, I did research on how Big Tech was preparing to respond to the impact of AI on their business. Google was improving its internal systems to accommodate for a 10x increase in load. As we covered in The Pragmatic Engineer, <a href="https://newsletter.pragmaticengineer.com/i/167269400/google?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">in July last year:</a></p><blockquote>&#x201C;Google is preparing for 10x more code to be shipped. A former Google Site Reliability Engineer (SRE) told me:<br><br>&#x201C;What I&#x2019;m hearing from SRE friends is that they are preparing for 10x the lines of code making their way into production.&#x201D;<br><br>If any company has data on the likely impact of AI tools, it&#x2019;s Google. 10x as much code generated will likely also mean 10x more: code review, deployments, feature flags, source control footprint and, perhaps, even bugs and outages, if not handled with care.&#x201D;</blockquote><p>Predicted enormous load increases were not secret knowledge within the industry, yet it seems GitHub was blissfully ignorant of their potential size. According to Vlad, GitHub did <em>eventually</em> plan for a need to increase capacity by 10x, but this was in October 2025, months later. In February 2026, the company is now adjusting that expectation to 30x. <a href="https://github.blog/news-insights/company-news/an-update-on-github-availability/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">He wrote</a>:</p><blockquote>&#x201C;We started executing our plan to increase GitHub&#x2019;s capacity by 10X in October 2025 with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today&#x2019;s scale.&#x201D;</blockquote><p>There&#x2019;s also the question of whether GitHub miscalculated how much time it had to prepare for explosive load growth, and whether it was caught off guard when that growth materialized months sooner than expected at the start of this year.</p><p><strong>Given GitHub only started to prepare for a major load increase in October, its current problems are unsurprising. </strong>At the scale of GitHub, it&#x2019;s common enough for each team owning a service to plan a year ahead on how much load their service will have, and hardware resources like storage, VMs, and networking are allocated accordingly. Load planning can account for up to half of the preparations, and when reality doesn&#x2019;t conform to plans, some systems can struggle to scale up.</p><p>So, on one hand, dealing with a 3.5x increase in load over 2 years should not be such a big deal for most services; especially not ones which can be horizontally scaled (when there&#x2019;s not much state, and scaling is achieved simply by adding new nodes.) But GitHub probably stores a lot more state with pull requests, workflows, projects, etc. This probably makes scaling more tricky when it comes to databases and systems running workflows.</p><p><strong>GitHub also has 18 years of tech debt on its hands, and thousands of staff to align as &#x201C;organizational overhead.&#x201D; </strong>As its service load grows faster than before, responding is harder due to all that accumulated &#x201C;debt&#x201D;:</p><ul><li>Tech debt: many systems at the company are 10+ years old and are likely patched up, making them more difficult and risky to change</li><li>Organizational debt: around 4,000 people work at GitHub, of whom 1,000 are engineers. Teams have dependencies with each other, and even seemingly simple work can require dozens of engineers to work together</li><li>Customer expectations: GitHub cannot break customer workflows, even if doing so would mean changes to systems happen faster</li></ul><p>GitHub finds itself in the &#x2018;innovator&#x2019;s dilemma&#x2019;: the company became successful because it built developer workflows that made sense, pre-AI, and it used to be able to accurately forecast service load changes. But now that engineering teams&#x2019; workflows include AI agents, GitHub&#x2019;s own workflows are not necessarily the best fit, and the company failed to forecast service-level changes.</p><h3 id="other-vendors-floored-by-ai-load-not-really">Other vendors floored by AI load? Not really</h3><p>One thing that doesn&#x2019;t add up about the situation is that other vendors who are presumably experiencing similar load spikes don&#x2019;t appear to be suffering with reliability issues as much. Vercel, Linear, Resend, Railway, Sentry, and other infra providers see record-level growth thanks to AI, but keep up with the load.</p><p>Yes, it&#x2019;s true that AI vendors like Anthropic, OpenAI, and Cursor have some reliability issues, but it&#x2019;s not at the scale of GitHub&#x2019;s. GitHub&#x2019;s direct competitors, GitLab and Bitbucket, presumably see load going up similarly, but they&#x2019;re not going down as much.</p><p><strong>An obvious question is how much of GitHub&#x2019;s pain is self-inflicted? </strong>With Microsoft as owner, it has more resources at its disposal than any competitor or startup, and yet failed to predict load increases and is too big to respond with the nimbleness of a startup.</p><p>It&#x2019;s undeniable that solving for a major load increase is a hard challenge; it&#x2019;s when the difference between average and standout engineering teams is apparent. GitHub hasn&#x2019;t been responding like a world-class engineering org.</p><h3 id="github-alternatives">GitHub alternatives?</h3><p>Every regular user of GitHub feels the pain of ongoing outages. As a dev, you can either hope Microsoft will <em>eventually</em> improve reliability, or seek alternatives. As covered above, Mitchell has chosen to quit and is currently deciding where to take Ghostty.</p><p>The obvious alternatives are GitHub&#x2019;s biggest competitors, GitLab, and Bitbucket. Each offers Git hosting, and neither comes with the uptime woes that GitHub is suffering from.</p><p><strong>Self-hosted </strong>solutions are also an option, like self-hosting your git repo, or going with a <strong>self-hosted forge </strong>like <a href="https://tailscale.com/blog/self-hosted-git-server-tailscale-forgejo?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Forgejo</a>, which is an open source, local-first GitHub alternative.</p><p>I also suspect that, soon enough, we&#x2019;ll see startups offering GitHub-like code hosting capabilities, while offering more robust uptime and being architected to handle the 30x-or-more scale which GitHub hopes one day to support.</p><p><em>Read the full issue of </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-github-breaks?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em>, or check out </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-did-capacity-shortages?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>this week&#x2019;s The Pulse</em></a><em>. This week&#x2019;s issue covers:</em></p><ol><li>Did Anthropic turn hostile on devs because capacity was running low?</li><li>Amazon finally allows Claude Code and Codex usage</li><li>Meta forcefully assigns engineers to data labelling ahead of job cuts</li><li>New trend: small &#x201C;AI-forward&#x201D; teams</li><li>Industry Pulse: why Meta tracks employees&#x2019; computer activity, OpenAI starts to move off Datadog, Apple lets slip it uses Claude Code, GitHub &#x2192; Xbox transfers at Microsoft, VS Code inserted &#x201C;coathored by Copilot&#x201D; even when Copilot did nothing, analysis of the Coinbase layoffs</li></ol>]]></content:encoded></item><item><title><![CDATA[The Pulse: token spend breaks budgets – what next?]]></title><description><![CDATA[In the past 2-3 months, spending on AI agents has exploded at many tech companies, Details from 15 of them, including the different ways they are coping with this realization.]]></description><link>https://blog.pragmaticengineer.com/the-pulse-token-spend-breaks-budgets-what-next/</link><guid isPermaLink="false">69f36c81ac26b70001aa306d</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 30 Apr 2026 14:52:36 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of three topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-ai-token-spending-out-of?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the article below seven days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em> subscribe here</em></a><em>.</em></p><p>Last week, we covered <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-tokenmaxxing-as-a-weird?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">the slightly perverse trend of &#x201C;tokenmaxxing&#x201D;</a> across the industry, where devs run agents with the sole aim of boosting their personal &#x201C;token stats&#x201D; in an effort to rank higher on internal token leaderboards, and not be seen as a Luddite who doesn&#x2019;t use AI tools enough compared to peers.</p><p>This week, I spoke with a software engineer at a large company and another at a seed-stage place. Both shared almost identical stories: at their latest all-hands, company leadership expressed concerns about the fast-rising costs of tokens. At both places, token spend has increased by ~10x in the last six months &#x2013; with no signs of slowing down.</p><p>I wanted to find out about this trend, so I talked to devs at 15 businesses. Below is what I learned about what&#x2019;s happening in workplaces of all sizes. Names are anonymized.</p><h2 id="large-companies">Large companies</h2><h4 id="setting-the-default-model-to-a-cheaper-one-10000-person-saas-company-offices-on-all-continents">Setting the default model to a cheaper one: 10,000+ person SaaS company, offices on all continents</h4><p>Inside a large SaaS company, most devs use an internal background coding tool for coding. This model defaults to Claude Sonnet, which is the cheaper Claude version. Model selection is not persisted, so devs who prefer working with Opus, for instance, must reselect it on every subsequent startup.</p><p>This tool supports all major frontier models such as Sonnet, Opus, GPT, and Gemini. Devs at the company whom I talked to are very heavy users of the tool and have not encountered usage limitations.</p><h4 id="fintech-company-us-series-d-8000-people-staff-engineer">Fintech company, US, Series D, ~8,000 people. Staff engineer:</h4><blockquote>&#x201C;The cost in token spend is off the charts &#x2013; and leadership has shared this trend with us. They have not said anything beyond showing growth in spend, and mentioning that this won&#x2019;t be sustainable. So, nothing specific yet, but my sense is that something will have to change. Limits or prioritizing cheaper models, cutting back on hiring? Who knows.&#x201D;</blockquote><h4 id="infra-company-us-publicly-traded-5000-people-engineering-director">Infra company, US, publicly traded, ~5,000 people. Engineering Director:</h4><blockquote><strong>&#x201C;We&#x2019;re monitoring but not restricting.</strong> We are spot checking the heaviest users, but we are seeing the business cases working out.<br><br>We are offering some guidance on model selection - e.g., turn off the new high-effort setting in Claude. Some users are trying open source models &#x2013; but open source model usage is a bottom-up initiative, not a top-down one.&#x201D;</blockquote><h4 id="information-technology-us-10000-people-director-of-engineering">Information technology, US, 10,000+ people. Director of Engineering:</h4><blockquote>&#x201C;We have already had to raise our API budget limits multiple times in April. We recently switched to a much higher-effort level for Claude, which significantly increased the cost per PR.<br><br><strong>One reason for the cost spike is using state-of-the-art models for demanding tasks.</strong> We are using that high-effort setting even for fairly trivial tasks that could have been handled by much cheaper models, or even by lower-effort Claude loops. Despite a few of us pointing this out, leadership has basically said budget is not the concern right now.<br><br>I sense that the budget increase has not been forecasted, and we&#x2019;re in for a reckoning.<strong> </strong>I suspect the attitude changes once finance and other cost-conscious parts of the org realize we are spending hundreds of dollars per day, per highly-engaged developer. For now, fear of missing out and not wanting to fall behind seems to be outweighing cost discipline.&#x201D;</blockquote><h4 id="games-studio-useurope-5000-people-senior-developer">Games studio, US+Europe, ~5,000 people. Senior developer:</h4><blockquote>&#x201C;What budget increase? It&#x2019;s very hard to get a budget for AI here! Claude Code is still not rolled out because $200/month/dev is seen as too high a cost. I talk with people at startups where $1,000/month in spending is totally normal, and it&#x2019;s night and day here.&#x201D;</blockquote><h4 id="fintech-company-useurope-late-stage-5000-people-staff-engineer">Fintech company, US+Europe, late stage, ~5,000 people. Staff engineer:</h4><blockquote><strong>&#x201C;Some developers are now spending $500 a day (!!) on Claude Code.</strong> Practically speaking, this means that employee costs have doubled. Productivity has increased, in my view, but now the bottleneck is code reviews. AI can spit out code quite quickly, but we still have human reviews in place. Leadership encourages using AI for code review, but my team will not blindly trust AI.<br><br>The push from AI is coming from the top. This year&#x2019;s performance review had a section on AI, rating devs by how well they used AI, so this is another reason everyone just uses it as much as they can.&#x201D;</blockquote><h2 id="mid-sized-companies">Mid-sized companies</h2><h4 id="saas-industry-us-2000-people-dev-productivity-lead">SaaS industry, US, ~2,000 people. Dev Productivity Lead:</h4><blockquote>&#x201C;Model routing helped keep our costs growing less dramatically. For example, changing the default model reduced cost by 30%. This is our strategy with AI spend, summarized:<strong>Short term: spend, spend, spend!</strong> Experiment and use whatever models make sense.<strong>Measure the impact</strong>. Measure key outcomes and report on spend, monthly.<strong>When spend vs results diverge: adjust. </strong>When our spend increases dramatically, but outcomes don&#x2019;t follow: see what we can do to adjust the delta. More spend should mean better outcomes. If not, we are doing something wrong.&#x201D;</blockquote><h4 id="finance-industry-us-2000-people-vp-of-ai">Finance industry, US, ~2,000 people. VP of AI:</h4><blockquote>&#x201C;We have Cursor and Claude Desktop, both of which have around 800-1,200 total users. Token usage is growing somewhat unexpectedly. Estimates are being adjusted on the fly; the initial plan to have strict limits (say, $100 per user) is breaking when reality hits, and people exhaust them in 3-5 working days.<br><br>Using expensive models is a problem. In regards to Cursor, many devs are defaulting to the most expensive models without realizing that going with Opus gives single percentage gains in intelligence compared to Sonnet, for example, while exhausting their budgets almost immediately.<br><br><strong>We are working on blocking/managing out the most expensive models [with Cursor]</strong>, as going into thousands of dollars per user, per month is not sustainable on our scale. Cursor is a good partner and we&#x2019;re working with them to switch to a &#x201C;pooled spend&#x201D; model where heavy users can tap into a pool of extra spend.<br><br>Claude is a similar story. We were at $100 of Claude Desktop limit for everyone, but as we are moving forward, I can see that we would need to go much higher, especially for business-critical use cases.&#x201D;</blockquote><h4 id="infra-company-us-late-stage-700-people-founder">Infra company, US, late-stage, ~700 people. Founder:</h4><blockquote>&#x201C;We haven&#x2019;t had much of an issue. Most folks police themselves for runaway costs; for example, we had someone hit like $10K in a week because they messed up caching, but it was caught and they corrected their harness.<br><br><strong>For the most part, we don&#x2019;t see our high-end folks spending more than ~$1K/week.</strong> Now, to be clear, this is not a small amount! BUT it&#x2019;s already a small subset of the population.<br><br>We&#x2019;re just factoring it into engineering costs at this point: if it&#x2019;s, say, $2K/month per employee, that&#x2019;s $24K per year.<br><br>Who cares, then, when engineers already cost $200-400K/year in cash comp? Okay, so what if it&#x2019;s $5K/month. That&#x2019;s $60K/year.<br><br><strong>Our bet is that token costs will stabilize and we&#x2019;ll eventually end up with local-ish models.</strong><br><br>Now, it could be five years before they stabilize, but overall, spend today isn&#x2019;t that insane to me.<br><br>There&#x2019;s a lot of people who are just dumb about it, but most legit execs push back on this. Take the <a href="https://newsletter.pragmaticengineer.com/i/183931240/ralph-mania?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Ralph loops</a> or other insanity where someone spends $1K/day, $5K/week or stuff like this. That&#x2019;s all just people being fools thinking they&#x2019;re doing &#x201C;R&amp;D,&#x201D; or somehow that they&#x2019;re smarter than everyone else, but they&#x2019;re just producing junk that never ships or is not useful.<br><br><strong>We saw a bit of &#x201C;stupid overspend&#x201D; in the first couple months, but that&#x2019;s all gone now. </strong>Costs could go up even more if we would &#x201C;crack the whip&#x201D; in wanting to see even more output, but we&#x2019;re not doing that.&#x201D;</blockquote><h4 id="healthcare-industry-us-500-people-senior-engineering-manager">Healthcare industry, US, ~500 people. Senior engineering manager:</h4><blockquote>&#x201C;<strong>We are not holding back on spend, and have a monthly spend leaderboard.</strong> And we WANT devs to spend more on tokens! For example, one of my engineers spent $1,400 on a long Claude Code session in a single day.<br><br><strong>We are seeing massive leverage, and we do more with the same number of people. </strong>This is why we are okay with our spending spiking. Our traffic is growing more than 10x, year-on-year, and we have managed to keep things running with the same team, and these AI tools.<br><br>Engineering is now blocked on Product and Design &#x2013; which never happened before! This is how fast execution has become. We now have Staff+ engineers writing Product PRDs so we can move faster.<br><br>I&#x2019;ve been in tech for close to 15 years and I never saw dramatic change like this. I just came back after a 3-month break, and every single thing is different in my day! I feel these AI agents are the biggest change in the industry since high-level languages became widespread.&#x201D;</blockquote><h4 id="e-commerce-company-us-europe-2000-devs-head-of-engineering">E-commerce company, US &amp; Europe, ~2,000 devs. Head of Engineering:</h4><blockquote>&#x201C;The increase in spend is INSANE. It&#x2019;s about usage going up, with no signs of stopping. Usage is off the charts.<br><br>We currently do not have limits in place, and are not pausing now. Our CEO is AI-pilled and won&#x2019;t let us slow down.<br><br><strong>We do buy tokens at a discount. </strong>They start from 5% and go up with usage with the vendors we use (the usual suspects.)<br><br>We don&#x2019;t let devs use anything lower than Opus 4.7 for coding. Cheaper models might work better, but a slight error pushed to prod would result in hours of toil.&#x201D;</blockquote><h2 id="small-companies">Small companies</h2><h4 id="series-a-us-50-people-principal-engineer">Series A, US, ~50 people. Principal Engineer:</h4><blockquote>&#x201C;About 15 devs are heavy users of AI and costs are rising very fast. Almost everyone uses Claude and Claude Code. We are considering four potential options:<strong>Increase AI budget, and start measuring more</strong>. Continue doing what we are, but allow devs to use more tokens instead of hiring limits. The precise ROI is hard to quantify, but we&#x2019;ll start to measure and track both AI adoption and impact.<strong>Optimize token consumption. </strong>Use cheaper models for simpler tasks, review token usage, and see where we can cut usage. Downside: this approach could become one with diminishing returns, fast.<strong>Integrate more AI providers in the company.</strong> Find wrappers to abstract LLMs. The problem is: how do you replace Claude Code, for instance?<strong>Pivot to local models:</strong> such as Kimi, Qwen, and so on. The problem is it&#x2019;s a big investment in high-end hardware or cloud GPUs. Upside: it offers better long-term cost control, once done.<br><br>We are likely to go with option #1: increase spend BUT maintain momentum and put the right measurements in place. We can do #2, #3 and #4 later. But if we kill AI usage momentum inside the company, the outcome will probably be worse.&#x201D;</blockquote><h4 id="ai-infra-us-seed-stage-15-people-founder">AI infra, US, seed stage, ~15 people. Founder:</h4><blockquote>&#x201C;<strong>We saw a 15x increase in 6 months:</strong>Six months ago our spend per developer was ~$200/monthToday, it&#x2019;s around $3,000/developer/month, for our seven devs<br>We&#x2019;re not slowing usage, especially as we are building an AI infra product. The increase was much faster than expected, though.&#x201D;</blockquote><h4 id="small-bootstrapped-company-europe-founding-engineer">Small, bootstrapped company, Europe. Founding engineer:</h4><blockquote>&#x201C;Our current strategy in dealing with the increase in costs is to switch to a cheaper model; unfortunately, from Opus to Sonnet in our case. That said, Sonnet is quite decent.&#x201D;</blockquote><h3 id="how-businesses-manage-token-spend">How businesses manage token spend</h3><p>Regardless of company size, there seems to be two strategies for how companies deal with increased spending. A summary:</p><p><strong>Strategy #1: &#x201C;let it rip and start measuring.&#x201D; </strong>Around half of respondents say AI spend is rising dramatically, and they have decided to do nothing about it. They <em>want</em> devs to use AI as much as it makes sense to, and to help the work as much as possible.</p><p>However, because the cost is rising dramatically, these companies are now starting to measure usage and attempting to measure the impact of their AI tools.</p><p>There&#x2019;s a few companies where the impact seems to be very positive, already. Smaller startups whose business is exploding in numbers of customers, load, and revenue, see that they don&#x2019;t need to hire more staff because existing engineers can keep supporting the growth with AI tools.</p><p><strong>Strategy #2: curb spending.</strong> Commonly mentioned cost-saving approaches:</p><ul><li>Use cheaper models for simpler tasks</li><li>Set default models to less capable ones</li><li>Set a spending cap and make it hard for engineers to exceed it, or require consent for doing so</li></ul><p>Most companies using strategy #1 have briefly considered going with this approach, but threw it away, because they see this approach as optimizing on the wrong thing: cutting costs before the productivity impact of using state-of-the-art tools is even known!</p><p><strong>Discounts exist when the spend is in the millions of dollars. </strong>I asked several people if they are getting discounts from vendors when buying tokens at scale. There were no exact numbers, but this is what I gathered in aggregate about possible custom agreements:</p><ul><li><strong>Cursor: open to discounts above a few million dollars in spend. </strong>Companies have negotiated discounts with Cursor after crossing $1M of spending. Some companies negotiated tiered discounts from this level, starting at 5% and going higher as their spend goes up.</li><li><strong>Anthropic: no discounts. </strong>I talked with companies spending $5M+ per year on Claude which have received no discounts. If Anthropic offers discounts, it will likely be at a much higher tier.</li><li><strong>All discounts are custom, so try to negotiate &#x2013; it&#x2019;s free! </strong>Pricing discounts are on a per-customer basis, and highly custom. The easiest way to see if a discount is available is to ask the vendors!</li></ul><p><em>&#x2014;-</em></p><p><em>Read the full issue of </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-ai-token-spending-out-of?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em>, or check out </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-github-breaks?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>this week&#x2019;s The Pulse</em></a><em>. This week&#x2019;s issue covers:</em></p><ol><li><strong>Load from AI breaks GitHub &#x2013; but why not other vendors? </strong>GitHub&#x2019;s reliability is less than one nine, and getting worse. Prolific open source contributor, Mitchell Hashimoto, is quitting GitHub because he thinks it&#x2019;s not suited for professional work. GitHub&#x2019;s leadership blames the 3.5x increase in service load as the cause of degradation &#x2013; or it might be self-inflicted.</li><li><strong>Anthropic&#x2019;s speedrun to destroy trust.</strong> Anthropic could do no wrong until recently, but in the past month, that&#x2019;s all changed. Silently nerfing Claude Code, banning companies from Claude, and baffling price rises all add to a sense that Anthropic is in its &#x201C;extraction&#x201D; era of generating more revenue for the same or worse service.</li><li><strong>Industry pulse. </strong>Dramatic price increases at GitHub Copilot, explosive growth at Codex, Google scrambling to build a good coding model, Cursor might be bought by SpaceX, AI agent deletes car business, and more.</li><li><strong>Mitchell Hashimoto &amp; the &#x201C;building block economy</strong>.<strong>&#x201D; </strong>Ghostty&#x2019;s creator finds that open source &#x201C;building blocks&#x201D; are the best way to win massive adoption by software components &#x2013; but it&#x2019;s got harder to build a business on top of open building blocks.</li></ol>]]></content:encoded></item><item><title><![CDATA[The Pulse: ‘Tokenmaxxing’ as a weird new trend]]></title><description><![CDATA[At Meta, Microsoft, Salesforce and other large companies, devs are purposefully burning tokens (and money!) to inflate their AI usage and hit AI usage metrics which they treat as targets.]]></description><link>https://blog.pragmaticengineer.com/the-pulse-tokenmaxxing-as-a-weird-new-trend/</link><guid isPermaLink="false">69ea4ef45d681300012e37e5</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 23 Apr 2026 16:55:40 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-tokenmaxxing-as-a-weird?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the article below seven days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em> subscribe here</em></a><em>.</em></p><p>Inside Meta, an engineer created a &#x201C;token leaderboard&#x201D; that ranks employees by token usage. Last week, The Information <a href="https://www.theinformation.com/articles/meta-employees-vie-ai-token-legend-status?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">reported</a>:</p><blockquote>&#x201C;Employees at Meta Platforms who want to show off their AI superuser chops are competing on an internal leaderboard for status as a &#x201C;Session Immortal&#x201D;&#x2014; or, even better, &#x201C;Token Legend.&#x201D;<br><br>The rankings, set up by a Meta employee on its intranet using company data, measure how many tokens &#x2014; the units of data processed by AI models &#x2014; employees are burning through. Dubbed &#x201C;Claudeonomics&#x201D; after the flagship product of AI startup Anthropic, the leaderboard aggregates AI usage from more than 85,000 Meta employees, listing the top 250 power users.<br><br>The practice is emblematic of Silicon Valley&#x2019;s newest form of conspicuous consumption, known as &#x201C;tokenmaxxing,&#x201D; which has turned token usage into a benchmark for productivity and a competitive measure of who is most AI native. Workers are maximizing their prompts, coding sessions and the number of agents working in parallel to climb internal rankings at Meta and other companies and demonstrate their value as AI automates functions such as coding.&#x201D;</blockquote><p>I spoke with a few engineers at Meta about what&#x2019;s happening, and this is what they said:</p><ul><li><strong>Massive waste. </strong>Plenty of devs are running an OpenClaw-like internal agent that burns massive amounts of tokens for little to no outcome.</li><li><strong>Outages caused by AI overuse. </strong>A dev mentioned that some SEVs were caused by what looked like careless AI code generation; almost like a dev behind the SEV was more concerned with churning out massive amounts of code with AI than with product quality.</li><li><strong>Gamified leaderboard. </strong>Those at the top of the leaderboard produce throwaway, wasteful work. This is painfully clear to anyone who checks Trajectories (AI prompts), which can be viewed.</li></ul><p>As per The Information, Meta employees used a total of 60.2 trillion AI tokens (!!) in 30 days. If this was charged at Anthropic&#x2019;s API prices, it would cost $900M. Of course, Meta is likely purchasing tokens at a discount, but that could still come in at $100M+ &#x2013; in large part from senseless &#x201C;tokenmaxxing&#x201D;.</p><p><strong>After backlash on social media, Meta abolished the internal leaderboard last week. </strong>One day after The Information revealed details about the incredible tokenmaxxing numbers, I confirmed that Meta has taken down its leaderboard; perhaps they realized that the incentive created enormous and unnecessary waste. If so, it&#x2019;s a bit surprising that it took media coverage for the social media giant to reach that conclusion.</p><p><strong>One engineer at Meta told me they think Meta had a different goal with the token leaderboard. </strong>A long-tenured engineer suspects increasing AI usage actually was the real goal. They said:</p><blockquote>&#x201C;Putting a leaderboard in place was always going to incentivize much more AI usage. And more AI usage means producing a lot more real-world traces. These traces can then be used to train Meta&#x2019;s next-generation coding model better.<br><br>I believe this was the goal, even if no one said it out loud.<br><br>It&#x2019;s an expensive way to generate data for training, but if any company has the means to do so, it&#x2019;s Meta.&#x201D;</blockquote><h3 id="microsoft-full-force-tokenmaxxing"><strong>Microsoft: full-force tokenmaxxing</strong></h3><p>Similarly, Microsoft has had an internal token leaderboard like Meta&#x2019;s since January, and it started pretty well, as I reported back at the time: there&#x2019;s an internal token dashboard that displays the individuals who use the most tokens in order to promote the use of tokens and experimentation with LLMs. At the Windows maker, this leaderboard is interesting:</p><ul><li>Very senior engineers &#x2013; distinguished-level folks &#x2013; are in the top 5 across the whole company, despite the fact that this group generally wrote little code in the past.</li><li>VP-level folks make the top 10 and top 20, despite often being in meetings for most of the day and rarely writing code.</li></ul><p>However, what starts as a metric for performance reviews or promotions can quickly become a target for devs. I talked with a software engineer at the Windows maker who admitted they&#x2019;re full-on &#x201C;tokenmaxxing&#x201D; &#x2013; not to get on the leaderboard, but rather because they don&#x2019;t want to be seen as using too few tokens:</p><blockquote>&#x201C;We have internal dashboards and metrics tracking AI usage, token usage, percentage of code written by AI vs hand-written code.<br><br>I am conscious of not wanting to be seen as &#x201C;uses too little AI,&#x201D; and I&#x2019;m not ashamed to say I need to do tokenmaxxing to do this. Things I do to inflate my token usage metrics:Ask AI questions about the code already in the documentation. The AI pulls up the documentation, processes it, and gives me results 10x slower, but while burning lots of tokens. I could use &#x201C;readthedocs&#x201D; [an internal product], but then my token numbers would be lowerAsk the AI to prototype a feature that I have no intention of working on. Prompt it a few more times, then throw the whole thing awayDefault to always using the agent, even when I know I could do the work by hand much faster. Then watch it fail&#x201D;</blockquote><p>This engineer is relatively new at the company, so is concerned about job security, and is playing this game to avoid being tagged as insufficiently &#x201C;AI-native&#x201D; by burning far more tokens than necessary.</p><h3 id="salesforce-burning-tokens-to-hit-%E2%80%9Cminimum%E2%80%9D-%E2%80%9Cideal%E2%80%9D-targets"><strong>Salesforce: burning tokens to hit &#x201C;minimum&#x201D; &amp; &#x201C;ideal&#x201D; targets</strong></h3><p>Elsewhere, Salesforce has created &#x201C;tokenmaxxing&#x201D; incentives, as well.<strong> </strong>Talking with an engineer there, I learned that the company built two tools that effectively incentivize excessive spending on tokens:</p><ol><li><strong>&#x201C;Minimum&#x201D; incentives with a tracking tool.</strong> There&#x2019;s a Mac widget that shows your own spend, updated every 15 minutes. It also displays minimum expected spend. Last week, the target was $100 on Claude Code, and $70 on Cursor.</li><li><strong>Showing everyone&#x2019;s spend. </strong>A web-based tool to see the token spend of any colleague. It&#x2019;s used to check where team mates&#x2019; usage is at.</li><li><strong>&#x201C;Maximum&#x201D; spend limits that can be exceeded. </strong>Up to a week ago, there was also a <em>maximum</em> monthly limit of $250 for Claude Code and $170 for Cursor. <em>However, this can be exceeded with the simple press of a button if the limit is reached. I&#x2019;ve learned that last week, some engineering organisations at Salesforce had their &#x201C;maximum&#x201D; limit removed in order to &#x201C;remove any friction from the development process.&#x201D;</em></li></ol><p>The message Salesforce sends to staff is clear: &#x201C;use a minimum of $170/month tokens or be flagged.&#x201D; Who wants to get flagged for using too few tokens? The outcome is somewhat wasteful token spend:</p><ul><li><strong>Burning tokens for nothing. </strong>Devs ask Claude or Cursor: &#x201C;build me X,&#x201D; where X is a project or product with nothing to do with their work, and not something they&#x2019;d ever ship. It&#x2019;s just a way to burn tokens</li><li><strong>Calibrating token spend to be above average. </strong>Plenty of devs browse peers&#x2019; token spend to figure out the slightly-above average point, then use the tokens needed to hit that mark</li></ul><h3 id="shopify-an-example-on-how-to-avoid-tokenmaxxing"><strong>Shopify: an example on how to avoid tokenmaxxing</strong></h3><p>The first-ever token leaderboard that I&#x2019;m aware of was built by Shopify in 2025. And it worked well! Last June, the Head of Engineering at Shopify, Farhan Thawar, told me <a href="https://newsletter.pragmaticengineer.com/p/how-ai-is-changing-software-engineering?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">on The Pragmatic Engineer Podcast</a>:</p><blockquote>&#x201C;We have a leaderboard where we actively celebrate the people who use the most tokens because we want to make sure they are [celebrated] if they&#x2019;re doing great work with AI.<br><br>[And for the top people on the leaderboard,] I want to see why they spent say $1,000 a month in credits for Cursor. Maybe that&#x2019;s because they&#x2019;re building something great and they have an agent workforce underneath them!&#x201D;</blockquote><p>I asked Farhan for details on how it&#x2019;s gone since. Here&#x2019;s what he told me:</p><blockquote>&#x201C;We have since renamed the token leaderboard to usage dashboard: for obvious reasons, as we don&#x2019;t want to encourage &#x201C;competing&#x201D; to make it to the top of this board. We have token spend on our internal wiki profile as well as on the usage dashboard.<br><br><strong>We also have circuit breakers to catch &#x201C;runaway agents.&#x201D;</strong> So if personal spend spikes within a day, we can cut off access immediately, and you can renew if the usage spike was deliberate, or if it was a runaway agent. The circuit breaker worked well for us: we&#x2019;ve not only caught runaway agents, but found bugs in our infra this way!&#x201D;</blockquote><p>Shopify&#x2019;s approach seems to have worked for a few reasons:</p><ul><li><strong>The usage dashboard served as a &#x201C;push&#x201D; for devs to use AI tools, early-on. </strong>Last year, devs were mostly experimenting with AI tools because they were not as performant as today. The usage dashboard encouraged developers to try new tools, and highlighted power users.</li><li><strong>Circuit breakers helped.</strong> Cutting off spend when usage spikes helped catch &#x201C;runaway agents.&#x201D;</li><li><strong>High usage is looked at.</strong> Farhan checks-in with top-spending individuals to understand the use cases. Any tokenmaxxing would likely have been spotted at this stage, which would have been a bit embarrassing for the user!</li></ul><p>One more interesting learning Farhan shared with me: it&#x2019;s more interesting to not look at &#x201C;who spent the most in <em>overall</em> token cost?&#x201D; but instead, &#x201C;whose <em>tokens</em> cost the most?&#x201D; Devs who generate tokens that come out as expensive have turned out to do in-depth work that was interesting to learn about!</p><h3 id="tokenmaxxing-great-for-ai-vendors-bad-for-everyone-else"><strong>Tokenmaxxing: great for AI vendors, bad for everyone else</strong></h3><p>I see very few rational reasons why incentivizing tokenmaxxing makes sense for any company. It results in increasing AI spend &#x2013; by a lot! &#x2013; in return for little to no value. Heck, in some cases it actually incentivises slower work &#x2013; as shown by devs using the AI to answer questions when documentation is readily available &#x2013; and encouraging &#x2018;busywork&#x2019; where devs prompt projects that they don&#x2019;t even want to ship. Tokenmaxxing seems to push devs to focus on stuff that makes no difference to a business.</p><p>It feels to me that a good part of the industry is using token count numbers similarly to how the lines-of-code-produced metric was used years ago. There was a time when the number of lines written daily or monthly was an important metric in programmer productivity, until it became clear that it&#x2019;s a terrible thing to focus on. A lines-of-code metric can easily be gamed by writing boilerplate or throwaway code. Also, the best developers are not necessarily those who write the most code; they&#x2019;re the ones who solve hard problems for the business quickly and reliably with &#x2013; or without &#x2013; code!</p><p>Similarly, the number of tokens a dev generates can easily be gamed, and if this metric is measured then devs will indeed game it. But doing so generates a massive accompanying AI bill!</p><p><em>&#x2014;-</em></p><p><em>Read the full issue of </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-tokenmaxxing-as-a-weird?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em>, or check out </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-ai-token-spending-out-of?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>this week&#x2019;s The Pulse</em></a><em>. This week&#x2019;s issue covers:</em></p><ol><li><strong>New trend: token spend breaks budgets &#x2013; what next? </strong>In the past 2-3 months, spending on AI agents has exploded at many tech companies, and the ramifications of this are starting to dawn on engineering leaders. We&#x2019;ve sourced details from 15 companies, including the different ways they are coping with this realization.</li><li><strong>New trend: more AI vendors can&#x2019;t keep up with demand. </strong>Related to massively increased spending, GitHub Copilot and Anthropic are starting to limit less-profitable individual users, so they can serve business users whose spend has easily 10x&#x2019;d in the last few months. The exception is OpenAI and Codex.</li><li><strong>Morale at Meta hits all-time low? </strong>Business is booming but devs at Meta are furious and worried due to looming layoffs, and an invasive tracking program rolled out to all US employees.</li></ol>]]></content:encoded></item><item><title><![CDATA[The Pulse: is GitHub still best for AI-native development?]]></title><description><![CDATA[Availability has dropped to one nine (~90% – !!), partly due to not being able to handle increased traffic from AI coding agents. There’s also no CEO and an apparent lack of direction.]]></description><link>https://blog.pragmaticengineer.com/the-pulse-is-github-still-best-for-ai-native-development/</link><guid isPermaLink="false">69cfc72da95ab10001e47e0c</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Fri, 03 Apr 2026 15:03:38 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the article below eight days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em> subscribe here</em></a><em>.</em></p><p>We&#x2019;re used to highly reliable systems which target four-nines of availability (99.99%, meaning about 52 minutes of downtime per year), and for it to be embarrassing to barely hit three nines (around 9 hours of downtime per year.) And yet, in the past month, GitHub&#x2019;s reliability is down to one nine!</p><p>Here&#x2019;s data from the third-party, &#x201C;<a href="https://mrshu.github.io/github-statuses/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">missing GitHub status page</a>&#x201D;, which was built after GitHub stopped updating its own status page due to terrible availability. Recently, things have looked poor:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/04/image.png" class="kg-image" alt loading="lazy" width="1456" height="399" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/04/image.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/04/image.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/04/image.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">GitHub down at one nine. Source: </em></i><a href="https://mrshu.github.io/github-statuses/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">The Missing GitHub Status Page</em></i></a></figcaption></figure><p>This means that for every 30 days, GitHub had issues on 3 days, or issues/degradations for 2.5 hours daily (around 10% of the time.)</p><p><strong>GitHub seems unable to keep up with the massive increase in infra load from agents. </strong>One software engineer built a clever website called &#x201C;<a href="https://www.claudescode.dev/?window=90d&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Claude&#x2019;s Code</a>&#x201D; that tracks Claude Code bot contributions across GitHub. Growth in the past three months has been enormous:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/04/image-1.png" class="kg-image" alt loading="lazy" width="1456" height="909" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/04/image-1.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/04/image-1.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/04/image-1.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Load from Claude Code has 6x&#x2019;d in 3 months. Source: </em></i><a href="https://www.claudescode.dev/?window=90d&amp;ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Claude&#x2019;s Code</em></i></a></figcaption></figure><h3 id="stream-of-github-outages-from-infra-overload">Stream of GitHub outages from infra overload</h3><p>GitHub&#x2019;s CTO, Vladimir Fedorov, addressed availability issues <a href="https://github.blog/news-insights/company-news/addressing-githubs-recent-availability-issues-2/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">in a blog post</a> and covered three major incidents:</p><ul><li><a href="https://www.githubstatus.com/incidents/xwn6hjps36ty?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">2 February</a>: security policies unintentionally blocked access to virtual machine metadata</li><li><a href="https://www.githubstatus.com/incidents/lcw3tg2f6zsd?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">9 February</a>: a database cluster got overloaded</li><li><a href="https://www.githubstatus.com/incidents/g5gnt5l5hf56?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">5 March</a>: writes failed on a Redis cluster</li></ul><p>Software engineer Lori Hochstein did <a href="https://surfingcomplexity.blog/2026/03/12/quick-thoughts-on-github-ctos-post-on-availability/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">a helpful analysis</a> of these outages and the CTO&#x2019;s response, and has interesting observations:</p><ul><li><strong>Saturation</strong>: the database cluster incident (9 Feb) was a case of the database getting saturated, due to higher-than-expected usage. Databases are harder to scale up than stateless services. GitHub also underestimated how much additional traffic there would be.</li><li><strong>Failover + telemetry gap</strong>: the 2 Feb incident was a combination of an infra issue in one region failing over to a healthy region, and making things worse with a telemetry gap (incorrect security policies were applied in the new regions which blocked access to VM metadata)</li><li><strong>Failover + configuration issue</strong>: the 5 March incident was uncannily similar: after a failover, a configuration issue blocked writes on a Redis cluster</li></ul><p>It is certainly nice to get details from GitHub on these outages. It feels to me that infra strains are causing more infra issues &#x2192; they trigger constraints faster &#x2192; failovers are not as smooth as they should be. Could it be because GitHub keeps changing their existing systems?</p><h3 id="startup-shows-github-how-it%E2%80%99s-done">Startup shows GitHub how it&#x2019;s done</h3><p>While GitHub struggles to keep up with the increase in load from AI agents generating more code and pull requests, a new startup called Pierre Computer claims to have built an &#x201C;AI-native&#x201D; solution for AI agents pushing code, which scales far beyond what GitHub can do. Pierre was founded by <a href="https://www.linkedin.com/in/jacob-thornton-13a6a5162/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Jacob Thornton</a>: formerly an engineer at Coinbase, Medium, and Twitter, and also the creator of the once-very popular <a href="https://getbootstrap.com/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Bootstrap</a> CSS library.</p><p>Here&#x2019;s what Pierre supports, which GitHub does not:</p><blockquote>&#x201C;In October [2025], Github shared they were averaging ~230 new repos per minute.<br><br>Last week we [at Pierre Computer] hit a sustained peak of &gt; 15,000 repos per minute for 3 hours.<br><br>And in the last 30 days customers have created &gt; 9M repos&#x201D;</blockquote><p>These are incredible numbers &#x2013; if also self-reported &#x2013; and something that GitHub clearly cannot get close to, at least not today! There are few details about customers, while the product &#x2013; called <a href="https://code.storage/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Code.storage</a> &#x2013; seems to be in closed beta.</p><p>Still, this is the type of &#x201C;git for AI agents&#x201D; that GitHub has failed to build, and the type of infrastructure it needs badly.</p><h3 id="has-github-lost-focus-and-purpose">Has GitHub lost focus and purpose?</h3><p>GitHub&#x2019;s reliability issues are acute enough that, if it keeps up, teams will start giving alternatives like small startups such as Pierre a try, or perhaps even consider self-hosting Git. But how did the largest Git host in the world neglect its customers, and fail to prepare its infra for an increase in code commits and pull requests?</p><p>Mitchell Hashimoto, founder of Ghostty, and a heavy user of GitHub himself, had advice on what he would do if he was in charge of GitHub, after growing frustrated with the state of its core offering. He <a href="https://x.com/mitchellh/status/2036866220449030168?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">writes</a> (emphasis mine)</p><blockquote>&#x201C;Here&#x2019;s what I&#x2019;d do if I was in charge of GitHub, in order:<br><br><strong>1. Establish a North Star plan around being critical infrastructure for agentic code</strong> lifecycles and determine a set of ways to measure that.<br><br><strong>2. Fire everyone who works on or advocates for Copilot and shut it down.</strong> It&#x2019;s not about the people, I&#x2019;m sure there&#x2019;s many talented people; you&#x2019;re just working at the wrong company.<br><br><strong>3. Buy Pierre and launch agentic repo hosting as the first agentic product.</strong> Repos would be separate from the legacy web product to start, since they&#x2019;re likely burdened with legacy cross product interactions.<br><br><strong>4. Re-evaluate all product lines and initiatives against the new North Star. </strong>I suspect 50% get cut (to make room for different ones).<br><br>The big idea is all agentic interactions should critically rely on GitHub APIs. Code review should be agentic but the labs should be building that into GH (not bolted in through GHA like today, real first class platform primitives). GH should absolutely launch an agent chat primitive, agent mailboxes are obviously good. GH should be a platform and not an agent itself.<br><br>This is going to be very obviously lacking since I only have external ideas to work off of and have no idea how GitHub internals are working, what their KPIs are or what North Star they define, etc.<br><br>But, with imperfect information, this is what I&#x2019;d do.&#x201D;</blockquote><p>My sense is that GitHub has three concurrent problems:</p><ul><li><strong>GitHub and Copilot are entangled with Microsoft&#x2019;s internal politics. </strong>GitHub&#x2019;s Copilot in 2021 was the first massively successful &#x201C;AI product.&#x201D; Microsoft took the &#x201C;Copilot&#x201D; brand and used it across all of their product lines, creating low-quality AI integrations. Simultaneously, internal Microsoft orgs like Azure and Microsoft AI were trying to get their hands on GitHub, which is one of the most positive developer brands at Microsoft.</li><li><strong>GitHub has no leader, seemingly by design. </strong>GitHub&#x2019;s last CEO was Thomas Dohmke, who stepped down voluntarily, and Microsoft never backfilled the CEO role; instead carrying out a reorg to make GitHub part of Microsoft&#x2019;s AI group and stripping its independence. It seems the &#x201C;Microsoft AI&#x201D; side won that battle.</li><li><strong>GitHub has no focus, and is stuck chasing Copilot as a revenue source. </strong>GitHub has no CEO and is caught up in internal politics, so, what can GitHub teams do? The safest bet is to increase revenue and the best way to do that is by investing more into GitHub Copilot, and ignoring long-term issues like reliability.</li></ul><p>I agree with Mitchell: GitHub has no &#x201C;North Star&#x201D; and we see a large org being dysfunctional. That lack of vision &#x2013; and CEO &#x2013; is hitting hard:</p><ul><li>GitHub Copilot went from the most-used AI agent in 2021, to be overtaken by Claude Code, and is soon to be overtaken by Cursor.</li><li>As a platform, GitHub has no vision for how to evolve to support AI agents. Sure, GitHub has an MCP server, but it has no &#x201C;AI-native git platform&#x201D; that can handle the massive load AI agents generate.</li><li>GitHub keeps shipping small features and improvements without direction. For example, in October 2025, they <a href="https://x.com/jaredpalmer/status/1980619222918262842?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">started to work on</a> stacked diffs. However, when it ships, the stacked diffs workflow might be mostly obsolete &#x2013; at least with AI agents!</li></ul><p>It&#x2019;s easy to win a market when you do one thing better than anyone else in the world. Right now, GitHub is doing too many things and doing a subpar job with Copilot, its platform, and AI infra.</p><hr><p>Read the full issue of <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">last week&#x2019;s The Pulse</a>, or check out <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-industry-leaders-return?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">this week&#x2019;s The Pulse</a>.</p><p>Catch up with recent The Pragmatic Engineer issues:</p><ul><li><a href="https://newsletter.pragmaticengineer.com/p/scaling-uber-with-thuan-pham-ubers?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>Scaling Uber with Thuan Pham</strong></a> (Uber&#x2019;s first CTO &#x2014; podcast). We went into topics like scaling Uber from constant outages to global infrastructure, the shift to microservices and platform teams, and how AI is reshaping engineering.</li><li><a href="https://newsletter.pragmaticengineer.com/p/building-whatsapp-with-jean-lee?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>Building WhatsApp with Jean Lee</strong></a> (podcast): Jean Lee, engineer #19 at WhatsApp, on scaling the app with a tiny team, the Facebook acquisition, and what it reveals about the future of engineering.</li><li><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-what-will-the-staff-engineer?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>What will the Staff Engineer role look like in 2027 and beyond</strong></a><strong>?</strong> What happens to the Staff engineer role when agents write more code? Actually, they could be more in demand than ever!</li></ul>]]></content:encoded></item><item><title><![CDATA[Is the FDE role becoming less desirable?]]></title><description><![CDATA[Job postings for Forward Deployed Engineers (FDEs) have surged, but many professionals don’t want the role because it’s more like solutions engineering than software development.]]></description><link>https://blog.pragmaticengineer.com/is-the-fde-role-becoming-less-desirable/</link><guid isPermaLink="false">69c5918c3f13830001776a97</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Fri, 27 Mar 2026 10:29:33 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-is-the-fde-role-becoming?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the article below seven days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em> subscribe here</em></a><em>.</em></p><p>An interesting trend highlighted <a href="https://www.wsj.com/cio-journal/the-hottest-job-in-tech-isnt-very-glamorous-dc29ab3e?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">by The Wall Street Journal</a>: companies want to hire for FDE roles, but devs are just not that interested:</p><blockquote>&#x201C;Job postings on Indeed grew more than 10-fold in 2025 compared with 2024. The number of public company transcripts mentioning the role jumped to 50 from eight over the same period, according to data from AlphaSense.<br><br>The only problem? Few engineers want the job, which has historically been seen as demanding, undesirable, and less prestigious than product-focused engineering roles.<br><br>&#x201C;Everyone wants them and there&#x2019;s only maybe 10% of the market that wants that role,&#x201D; said Patrick Kellenberger, president and chief operating officer at Betts Recruiting.&#x201C;</blockquote><p>Last summer, we covered <a href="https://newsletter.pragmaticengineer.com/p/forward-deployed-engineers?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">the rise of the FDE role</a>, and looked into what it&#x2019;s like. Back then, this is how I visualized what was then a very hot role:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-3.png" class="kg-image" alt loading="lazy" width="1280" height="798" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/03/image-3.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/03/image-3.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-3.png 1280w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">My 2025 visualization of the FDE role</em></i></figcaption></figure><p>At the companies where I interviewed FDE folks &#x2013; OpenAI and Ramp &#x2013; the role seemed to live up to this visualization. However, I&#x2019;ve since talked with two engineers who took FDE roles and were disappointed. This is how they saw it, in practice:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-4.png" class="kg-image" alt loading="lazy" width="1400" height="1094" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/03/image-4.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/03/image-4.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-4.png 1400w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Reality of the FDE role: less software engineering, and even less platform engineering</em></i></figcaption></figure><p>The role seems akin to a &#x201C;sales engineer&#x201D; where FDEs help close the deals, or a solutions engineer (or even consultant), where FDEs deploy to a customer to build them a solution. They don&#x2019;t contribute back into the platform, and don&#x2019;t do much that&#x2019;s considered &#x201C;software engineering&#x201D; beyond integrating software which the product team built.</p><p>Some engineers figure out the nature of the role during the interview process and pass on it. Meanwhile, some others take the job and later quit. Here&#x2019;s what a dev told me who accept an FDE role at a company, but didn&#x2019;t find what they expected:</p><blockquote>&#x201C;This FDE job was a typical IT services mindset. The company wanted to use me more on the engagement lead side, and nothing on software development. It&#x2019;s not what I signed up for, and I didn&#x2019;t like the vibe and culture. I quit 4 weeks later.&#x201D;</blockquote><p>In today&#x2019;s job market, if there&#x2019;s high demand for a role which pays decently but attracts little interest from engineers, there&#x2019;s always a reason!</p><hr><p>Read the full issue of <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-is-the-fde-role-becoming?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">last week&#x2019;s The Pulse</a>, or check out <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">this week&#x2019;s The Pulse</a>.</p><p>Catch up with recent The Pragmatic Engineer issues:</p><ul><li><a href="https://newsletter.pragmaticengineer.com/p/building-whatsapp-with-jean-lee?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>Building WhatsApp with Jean Lee</strong></a> (podcast): Jean Lee, engineer #19 at WhatsApp, on scaling the app with a tiny team, the Facebook acquisition, and what it reveals about the future of engineering.</li><li><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-what-will-the-staff-engineer?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>The Pulse: What will the Staff Engineer role look like in 2027 and beyond?</strong></a><strong> </strong>What happens to the Staff engineer role when agents write more code? Actually, they could be more in demand than ever!</li><li><a href="https://newsletter.pragmaticengineer.com/p/from-ides-to-ai-agents-with-steve?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>From IDEs to AI Agents with Steve Yegge (podcast):</strong></a> Steve Yegge on how AI is reshaping software engineering, the rise of &#x201C;vibe coding,&#x201D; and why developers must adapt to a rapidly changing craft.</li></ul>]]></content:encoded></item><item><title><![CDATA[The Pulse: Cloudflare rewrites Next.js as AI rewrites commercial open source]]></title><description><![CDATA[An engineer at Cloudflare rewrote most of Vercel’s Next.js in one week with AI agents. It looks like a sign of how AI will disrupt existing moats and business models. Analysis]]></description><link>https://blog.pragmaticengineer.com/the-pulse-cloudflare-rewrites-next-js-as-ai-rewrites-commercial-open-source/</link><guid isPermaLink="false">69a9c3bb4c4eb80001b25ced</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 05 Mar 2026 18:03:16 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the </em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noreferrer"><em>Pragmatic Engineer</em></a><em>. This issue is the </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-164-nextjs?ref=blog.pragmaticengineer.com" rel="noreferrer"><em>entire The Pulse issue</em></a><em> from the past week, which paying subscribers received seven days ago. This piece generated </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-164-nextjs/comments?ref=blog.pragmaticengineer.com" rel="noreferrer"><em>quite a few comments across subscribers</em></a><em>, and so I&apos;m sharing it more broadly, especially as it raises questions on what is defensible and what is not with open source.</em></p><p><em>If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com"><em> <u>subscribe here</u></em></a><em> to get issues like this in your inbox.</em></p><p>Today&#x2019;s issue of The Pulse focuses on a single event because it&#x2019;s a significant one with major potential ripple effects. On Tuesday, Cloudflare shocked the dev world by announcing that they have rewritten&#xA0;<a href="http://next.js/?ref=blog.pragmaticengineer.com">Next.js</a>&#xA0;in just one week, with a single developer who used only $1,100 in tokens:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image.png" class="kg-image" alt loading="lazy" width="1186" height="1342" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/03/image.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/03/image.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image.png 1186w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Cloudflare CTO Dane Knecht&#xA0;</em></i><a href="https://x.com/dok2001/status/2026386974580330830?s=20&amp;ref=blog.pragmaticengineer.com" rel><i><em class="italic" style="white-space: pre-wrap;">on X</em></i></a></figcaption></figure><p>There are several layers to dig into here:</p><ol><li><strong>The Next.js ecosystem: a recap</strong>. Close to half of React devs use Next.js, and the best place to deploy Next.js is on Vercel &#x2013; partly thanks to its proprietary build output.</li><li><strong>What Cloudflare did with Next.js</strong>. Replacing the build engine in Next.js with the more standard Vite one, allowing Next.js apps to be easily deployed on Cloudflare.</li><li><strong>AI brings the impossible within reach</strong>. What would take years in engineering terms was executed in one week with some tokens.</li><li><strong>&#x201C;AI slop&#x201D; still an issue.</strong>&#xA0;Contrary to Cloudflare&#x2019;s claims, vinext is not production-ready, and will need plenty of cleanup and auditing to make it on par with Next.js.</li></ol><h2 id="1-the-nextjs-ecosystem-a-recap"><br>1. The Next.js ecosystem: a recap</h2><p>First, some background.&#xA0;<a href="https://nextjs.org/?ref=blog.pragmaticengineer.com">Next.js</a>&#xA0;is the most popular fullstack React framework and around half of all React devs use it, as per recent research such as the 2025 Stack Overflow developer survey. Next.js is an open source project, built and mostly maintained by Vercel, which is the preferred deployment target for Next.js applications for many reasons. One of them is that Next.js is ideal to deploy to Vercel because Next.js applications are built with Vercel&#x2019;s Turbopack build tool. The output of a build is a proprietary format. As Netlify engineer Eduardo Bou&#xE7;as&#xA0;<a href="https://eduardoboucas.com/posts/2025-03-25-you-should-know-this-before-choosing-nextjs/?ref=blog.pragmaticengineer.com">writes</a>:</p><blockquote>&#x201C;The output of a Next.js build has a proprietary and undocumented format that is used in Vercel deployments to provision the infrastructure needed to power the application.<br><br>This means that any hosting providers other than Vercel must build on top of undocumented APIs that can introduce unannounced breaking changes in minor or patch releases. (And they have)&#x201D;.</blockquote><p>Next.js is an interestingly built project, where everything is open source, and the best place to deploy a Next.js application is on Vercel, as it&#x2019;s optimized to run undocumented build artifacts the most efficiently. This is a smart strategy from Vercel which competitors will dislike, as any hosting provider would prefer Next.js to produce a standard build format. To do this, the build engine, Turbopack, would need to be replaced with something more standard.</p><p><strong>Let&#x2019;s talk about build tools for web development.&#xA0;</strong>According to the&#xA0;<a href="https://2025.stateofjs.com/en-US/libraries/?ref=blog.pragmaticengineer.com">State of JS 2025 survey</a>, the most popular in the web ecosystem are:</p><ol><li><a href="https://vite.dev/?ref=blog.pragmaticengineer.com"><strong>Vite</strong></a>: the most popular choice for new projects due to its speed and developer experience. Uses projects like&#xA0;<a href="https://esbuild.github.io/?ref=blog.pragmaticengineer.com">esbuild</a>&#xA0;and&#xA0;<a href="https://rollupjs.org/?ref=blog.pragmaticengineer.com">Rollup</a>&#xA0;under the hood</li><li><a href="https://webpack.js.org/?ref=blog.pragmaticengineer.com"><strong>Webpack</strong></a>: a legacy tool that&#x2019;s not very performant, but still widely deployed in older projects</li><li><a href="https://nextjs.org/docs/app/api-reference/turbopack?ref=blog.pragmaticengineer.com"><strong>Turbopack</strong></a>: Created by Vercel and optimized for larger&#xA0;<a href="http://next.js/?ref=blog.pragmaticengineer.com">Next.js</a>&#xA0;applications. Built in Rust and intended to be more performant</li><li><a href="https://bun.com/?ref=blog.pragmaticengineer.com"><strong>Bun</strong></a>: a relatively new, all-in-one runtime and bundler. Anthropic acquired the team&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/180722007/anthropic-acquires-javascript-runtime-bun?ref=blog.pragmaticengineer.com">in December</a>, and some Bun folks are now focused on improving Claude Code&#x2019;s performance.</li></ol><p>So, most of the web ecosystem uses Vite as a build tool; Next.js uses Turbopack, and the majority of React applications with a full-stack React framework use Next.js. Basically, most devs using Next.js are likely to use Vite as their build tool.</p><h2 id="2-what-cloudflare-did-with-nextjs"><br>2. What Cloudflare did with Next.js</h2><p>Here&#x2019;s a naive idea: what if Next.js used Vite to generate build outputs? In that case, build outputs would be standardized and would run equally well on any cloud provider, as there would be nothing proprietary or undocumented to Vercel.</p><p>And this is what Cloudflare did: replace Turbopack with Vite and call the new package &#x2018;vinext&#x2019;:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-1.png" class="kg-image" alt loading="lazy" width="1442" height="1024" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/03/image-1.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/03/image-1.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-1.png 1442w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Cloudflare replaced the Turbopack build dependency with Vite to create vinext</em></i></figcaption></figure><p>Buried midway in the announcement is how this project&#xA0;<a href="https://blog.cloudflare.com/vinext/?ref=blog.pragmaticengineer.com#status-experimental">is experimental</a>&#xA0;and not at all guaranteed to work okay: it&#x2019;s a &#x2018;use-at-own-risk&#x2019; project. Still, the mere fact of this development feels like an earthquake in the tech world because of&#xA0;<em>how</em>&#xA0;it was pulled off.</p><h2 id="3-ai-brings-the-impossible-within-reach"><br>3. AI brings the impossible within reach</h2><p>In a blog post announcing the project, Cloudflare claims only one engineer &#x201C;rebuilt&#x201D; the whole thing in a way that&#x2019;s trivial to deploy to Cloudflare&#x2019;s own infrastructure, and only cost $1,100 in tokens. From Cloudflare&#x2019;s&#xA0;<a href="https://blog.cloudflare.com/vinext/?ref=blog.pragmaticengineer.com">statement</a>:</p><blockquote>&#x201C;Last week, one engineer and an AI model rebuilt the most popular front-end framework from scratch. The result, vinext (pronounced &#x201C;vee-next&#x201D;), is a drop-in replacement for Next.js, built on Vite, that deploys to Cloudflare Workers with a single command. In early benchmarks, it builds production apps up to 4x faster and produces client bundles up to 57% smaller. And we already have customers running it in production.<br><br>The whole thing cost about $1,100 in tokens&#x201D;.</blockquote><p>What Cloudflare did:</p><ul><li>Took the Next.js public API</li><li>Reimplemented behaviour using Vite</li><li>Created build output whose behaviour matches the &#x201C;original&#x201D; Next.js implementation</li></ul><p>After 10 years, the core of Next has around 194,000 lines of code (LOC)**. Meanwhile,&#xA0;<a href="https://github.com/cloudflare/vinext?ref=blog.pragmaticengineer.com">vinext</a>&#xA0;is about 67,000 lines of code which suggests a much leaner implementation: for example, vinext does not need to support legacy Next APIs, and vinext currently supports 94% of the Next.js API (and it&#x2019;s safe to assume they left complex edge cases in the remaining 6%).<br><br>** the Next.js repository is closer to 2M lines of code: 1M is bundled dependencies (eg React bundles, CSS build etc), tests are 308,000 LOC, Turbopack 311,000 LOC.</p><p><strong>Pre-AI, this reimplementation would have taken years of engineering time to complete.&#xA0;</strong>Doing what Cloudflare did was always possible<em>&#xA0;in theory</em>, but never seemed practical. I mean, why have a team of engineers spend potentially years on generating a standardized build output for Next.js apps? Even if they did, the dev community would have doubts about whether Cloudflare would maintain the project.</p><p>This is the thing with forking or rewriting open source projects: a major value proposition for commercial open source is to know that they will be&#xA0;<em>maintained</em>. Vercel has proved it&#x2019;s a reliable custodian of Next.js for the past 10 years. Without AI, it could be assumed that any new reimplementation would eventually run out of steam.</p><p><strong>Separately but relatedly, Cloudflare has now proved that the cost of rewriting&#xA0;<em>existing</em>&#xA0;software has become ~100x cheaper, thanks to AI, and this economy is likely to be the case for maintenance, too.&#xA0;</strong>Considering how trivial it was to rebuild one of the more complex open source projects, this augers well for it being trivial and much cheaper to maintain in the future. Potentially, Cloudflare no longer needs to budget an engineering team only for maintenance, if a single engineer could maintain the project, part-time!</p><p>Cloudflare had a project measured in engineering years, and completed it in&#xA0;<em>one engineering week</em>! It just took a single engineer using&#xA0;<a href="https://opencode.ai/?ref=blog.pragmaticengineer.com">OpenCode</a>&#xA0;(open source coding agent), Opus 4.5, and a bunch of tokens, then: &#x2018;<em>boom&#x2019;</em>,&#xA0;<em>vinext</em>&#xA0;was born.</p><h2 id="4-%E2%80%9Cai-slop%E2%80%9D-still-an-issue">4. &#x201C;AI slop&#x201D; still an issue</h2><p>There are questions about the quality of vinext, though.<strong>&#xA0;</strong>Vercel, naturally, is unhappy and hit out at the obvious weakness that vinext is unfit for production usage because it&#x2019;s insecure. Vercel CEO, Guillermo Rauch, did not miss a beat by tying Cloudflare&#x2019;s effort to the &#x201C;vibe coding&#x201D; stereotype of sloppy work executed with a lack of understanding:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-2.png" class="kg-image" alt loading="lazy" width="1194" height="794" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/03/image-2.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/03/image-2.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/03/image-2.png 1194w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Guillermo Rauch&#xA0;</em></i><a href="https://x.com/rauchg/status/2026864132423823499?s=20&amp;ref=blog.pragmaticengineer.com" rel><i><em class="italic" style="white-space: pre-wrap;">on X</em></i></a></figcaption></figure><p>Guillermo has a point: anyone who stopped reading&#xA0;<a href="https://blog.cloudflare.com/vinext/?ref=blog.pragmaticengineer.com">Cloudflare&#x2019;s launch announcement</a>&#xA0;after the first few sentences would assume it&#x2019;s production-ready, with the first paragraph of this announcement closing with:</p><p>&#x201C;And we already have customers running it in production.&#x201D;</p><p>However, Cloudflare doesn&#x2019;t&#xA0;<a href="https://blog.cloudflare.com/vinext/?ref=blog.pragmaticengineer.com#status-experimental">share</a>&#xA0;the rather crucial detail that &#x201C;running in production&#x201D; means that vinext has been deployed onto a beta site, until more than 1,000 words (around 2&#x2013;3 pages) into the announcement:</p><blockquote>&#x201C;We want to be clear: vinext is experimental. It&#x2019;s not even one week old, and it has not yet been battle-tested with any meaningful traffic at scale. (...)<br><br>We&#x2019;ve been working with National Design Studio, a team that&#x2019;s aiming to modernize every government interface,&#xA0;<strong>on one of their beta sites</strong>, CIO.gov.</blockquote><p>Oh. So, &#x201C;customers running it in production&#x201D; at Cloudflare apparently means &#x201C;customer running a beta site in production without meaningful traffic.&#x201D; This is a first from the infrastructure giant, which usually prides itself on accurate statements!</p><p>This detail was also absent when Cloudflare&#x2019;s CEO and CTO&#xA0;<a href="https://x.com/eastdakota/status/2026389179345916255?s=20&amp;ref=blog.pragmaticengineer.com">were boosting</a>&#xA0;vinext like it was a mature, battle-tested product. In that context, Vercel&#x2019;s raising of the issue of security vulnerabilities is more than fair game, in my view.</p><p>Still, all that doesn&#x2019;t alter the core learning from this project: that AI has the power to drastically reduce engineering time by up to ~100x and deliver&#xA0;<em>usable-enough</em>&#xA0;output, for relatively negligible financial cost.&#xA0;<em>Just keep in mind that security and reliability issues will probably take plenty of extra time and effort to address.</em></p><h2 id="5-new-attack-vector-on-commercial-open-source">5. New attack vector on commercial open source?</h2><p>If arch-rivalries exist in tech, then Cloudflare and Vercel are a prime example. Both are gunning to become the most popular platform for developers to deploy their code, and the CEOs are regularly seen in public taking shots at the other side. One such spat happened&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/160004343/ceos-scrap?ref=blog.pragmaticengineer.com">in March</a>, as covered at the time:</p><blockquote>&#x201C;Things kicked off on social media, with developers confused about the severity of the incident, and about why Next.js seemed silent, and also why Cloudflare sites were breaking due to its fix for the CVE causing its own issues. It was at that point that Cloudflare&#x2019;s CEO, Matthew Prince, entered the chat to accuse Vercel of&#xA0;<a href="https://x.com/rauchg/status/1903590962498326771?ref=blog.pragmaticengineer.com">not caring about security</a>:<br><br>Given the security incident was ongoing, this felt a bit &#x201C;below the belt&#x201D; by the Cloudflare chief. Criticizing rivals is fair game, but why not wait until the incident is over? The punch landed, and Vercel&#x2019;s CEO Guillermo Rauch is not someone to take it lying down, so he&#xA0;<a href="https://x.com/rauchg/status/1903590962498326771?ref=blog.pragmaticengineer.com">hit back</a>.<br><br>Cloudflare&#x2019;s CEO then responded with a cartoon&#xA0;<a href="https://x.com/eastdakota/status/1903690805576909227?ref=blog.pragmaticengineer.com">implying</a>&#xA0;that although Vercel is much larger than its competitor Netlify, Cloudflare is 100x bigger than both, and could stomp them into the ground at will.&#x201D;</blockquote><p>Serving the public interest wasn&#x2019;t why Cloudflare rewrote&#xA0;<a href="http://next.js/?ref=blog.pragmaticengineer.com">Next.js</a>: they did it because they want Next.js sites to be deployed onto Cloudflare, but doing so made little sense until now because Next.js produced bespoke build output optimized for Vercel&#x2019;s infrastructure. With this change, Cloudflare&#xA0;<a href="https://blog.cloudflare.com/vinext/?ref=blog.pragmaticengineer.com">claims</a>&#xA0;it provides&#xA0;<em>superior&#xA0;</em>performance when hosting Next.js apps, according to their own measurements.</p><p><em>I&#x2019;d just add that performance is important for developers, but other things matter, too. Cost, reliability, developer experience, and how much devs like a company, are all factors in choosing between vendors. Also, performance measurements from a vendor about its own service must be taken with a large pinch of salt.</em></p><p><strong>Zooming out from this episode, it seems that AI is bringing the value of existing commercial open source moats into question.&#xA0;</strong>Vercel carved out a clever open source strategy that helped turn its open source investment into business revenue:</p><ol><li>Build and maintain Next.js, delivering the best developer experience (DX).</li><li>Optimize Vercel to serve the specific (and undocumented) build output of Next.js.</li><li>Most developers onboarding to Next.js will decide to deploy on Vercel to get the most benefit, in terms of DX and performance.</li><li>&#x2026; repeat for years while the business becomes worth billions! (Vercel was&#xA0;<a href="https://startupwired.com/2025/10/01/vercel-raises-300-million-reaches-9-3-billion-valuation/?ref=blog.pragmaticengineer.com">valued</a>&#xA0;at $9B last October).</li></ol><p>Underpinning this success are some assumptions:</p><ol><li>Next.js will remain the #1 choice for developers to build React applications, thanks to ongoing investment.</li><li>It is expensive to rewrite Next.js to be deployable and performant on another cloud vendor.</li><li>Even if someone did #2, developers would be skeptical and not switch over.</li></ol><p>Vercel can invest in #1 to keep Next as best-in-class, while knowing that the risk of #2 occurring is minor. However, Cloudflare has now &#x201C;cloned&#x201D; Next, and can easily keep up with all changes in the future, and port them back to vinext.</p><p><strong>But AI makes it trivial to &#x201C;piggyback&#x201D; off any commercial open source project, which is a massive problem for commercial open source startups.&#xA0;</strong>It puts all the effort and investment into building and maintaining&#xA0;<a href="http://next.js/?ref=blog.pragmaticengineer.com">Next.js</a>, while Cloudflare enjoys the benefit of this hard work (the Next.js public API) which is easily deployable to Cloudflare, and it can now undercut Vercel on price. For all future Next.js changes, Cloudflare will just sync it to vinext, using AI!</p><p>WordPress had&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/149770356/2-open-source-business-model-struggles-wordpress?ref=blog.pragmaticengineer.com">a similar problem</a>, with WP Engine &#x201C;piggybacking&#x201D; off its work and undercutting their pricing in 2024. As I analyzed at the time:</p><blockquote>&#x201C;Free-riding on permissive open source is too tempting to pass on for other vendors. WP Engine uses a common loophole of contributing almost nothing in R&amp;D to WordPress, while selling it as a managed service. This means that they could either easily undercut the pricing of larger players like Automattic which do spend on WordPress&#x2019;s R&amp;D. Alternatively, a company like WP Engine could charge as much, or more, as Automattic, but be able to spend a lot more on marketing, while being similarly profitable. &#x201C;Saving&#x201D; on R&amp;D gives the &#x201C;free-riders&#x201D; plenty of options to grow their businesses: options not necessarily open to Automattic while they invest as much into R&amp;D as they do.<br><br>Commercial open source vendors pressure to end &#x201C;freeriding&#x201D;. Automattic is likely facing lower revenue growth, with customers choosing vendors like WP Engine which offer a similar service &#x2014; getting these customers either via a cheaper price or thanks to more marketing spend. This legal fight could be an effort to force WP Engine to stop eating Automattic&#x2019;s lunch, or perhaps get WP Engine to sell to Automattic, which would cement its leading status in managed Wordpress, while also boosting revenue by $400M a year &#x2013; according to its own figures&#x201D;.</blockquote><p>Vercel managed to avoid the &#x201C;free-riding&#x201D; problem with&#xA0;<a href="http://next.js/?ref=blog.pragmaticengineer.com">Next.js</a>, but that&#x2019;s no longer possible now that AI makes it trivial to rewrite.</p><h2 id="6-defense-or-offense"><br>6. Defense or offense?</h2><p>How should commercial open source companies respond to the threat that a competitor can easily rewrite the software behind the managed solutions which they sell as services?</p><p><strong>One obvious response is to make tests private, so that replication is harder for AI.&#xA0;</strong>One thing that made it so easy for Cloudflare to rewrite Next was the project&#x2019;s comprehensive test suite. From&#xA0;<a href="https://blog.cloudflare.com/vinext/?ref=blog.pragmaticengineer.com">their announcement<u>&#xA0;</u></a>(emphasis mine):</p><blockquote>&#x201C;We also want to acknowledge the Next.js team. They&#x2019;ve spent years building a framework that raised the bar for what React development could look like.&#xA0;<strong>The fact that their</strong>&#xA0;API surface is so well-documented and their&#xA0;<strong>test suite so comprehensive</strong>&#xA0;is a big part of what made this project possible.&#x201D;</blockquote><p>Database solution SQLite is famous for its incredible test suite. What some people don&#x2019;t know is that while core&#xA0;<a href="https://sqlite.org/?ref=blog.pragmaticengineer.com">SQLite</a>&#xA0;tests are open source, its most comprehensive test suite &#x2013;&#xA0;<a href="https://sqlite.org/testing.html?ref=blog.pragmaticengineer.com">TH3</a>&#xA0;&#x2013; is closed source. SQLite monetizes its advanced infrastructure as a&#xA0;<a href="https://sqlite.org/prosupport.html?ref=blog.pragmaticengineer.com">service</a>&#xA0;for purchase. This is a fair tradeoff: for most contributors, the basic open source tests work well enough. For enterprise users or customers who really care about correctness, it makes sense to purchase advanced testing services from the service&#x2019;s creator.</p><p>Open source canvas project, tldraw,&#xA0;<a href="https://github.com/tldraw/tldraw/issues/8082?ref=blog.pragmaticengineer.com">announced</a>&#xA0;it will relocate its test suite to a closed source repository; a move which makes plenty of sense. Here&#x2019;s commentary from Simon Willison:</p><blockquote>&#x201C;It&#x2019;s become very apparent over the past few months that a comprehensive test suite is enough to build a completely fresh implementation of any open source library from scratch, potentially in a different language.&#x201D;</blockquote><p>In the event, tldraw&#x2019;s announcement turned out&#xA0;<a href="https://github.com/tldraw/tldraw/issues/8082?ref=blog.pragmaticengineer.com#issuecomment-3964650501">to be a joke</a>, but who&#x2019;s laughing now? An open source project with excellent tests is an easy target for an AI agent to execute a full rewrite of it.</p><p><strong>Could new licenses be created for the AI era?&#xA0;</strong>Existing open source licenses were created on the assumption that humans read open source code, and humans modify it. Agents break that assumption.</p><p>Could we see new license types emerge to ban AI agents from modifying projects&#x2019; source code? It seems pretty far-fetched and hard to implement, but not beyond the realms of possibility.</p><p>AI agents are still very new, and going mainstream in tech. Once they break into other industries, I wouldn&#x2019;t be surprised if legal frameworks are reworded to also apply to AI agents. If and when this happens, it would open the path for open source licenses to distinguish between agents and humans.</p><p><strong>What is a moat, if code can be trivially ported?&#xA0;</strong>A team operating a popular open source project can no longer assume it&#x2019;s expensive to fork or to be completely rewritten, meaning it makes sense to focus on other moats, such as:</p><ul><li><strong>Outstanding (paid) support.</strong>&#xA0;AI could make this much easier at a higher quality, if done right.</li><li><strong>Smaller open core, larger closed source part.&#xA0;</strong>&#x201C;Open core&#x201D; as a business model has been dominant for commercial open source: keep the core of the software open source, while advanced enterprise features are source available or closed source. I would expect more companies to move their additional services to closed source, not source available.</li><li><strong>In-person connection and community.</strong>&#xA0;Projects with a real-world community will form a sense of connection that goes beyond code. For example, it&#x2019;s hard to imagine vinext meetups popping up &#x2013; whereas there are many Next.js communities.</li><li><strong>Infrastructure and hardware remains a massive moat.&#xA0;</strong>In a world where software is trivial to copy, infrastructure remains a moat. Commercial open source might make most sense for players that own and operate superior infrastructure layers than their rivals: and being able to offer lower cost, higher reliability, lower latency, higher performance, or a combination of these.</li></ul><h2 id="7-ai-world-reality"><br>7. AI-world reality</h2><p><strong>One of the single best AI use cases is full-on rewrites of well-tested products.&#xA0;</strong>I estimate that AI sped up the creation of vinext by at least 100x, which is massive. But we don&#x2019;t really see efficiency boosts of anything like that with AI tools, in general. As Laura Tacho&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/189035949/1-data-vs-hype-how-orgs-actually-win-with-ai?ref=blog.pragmaticengineer.com">shared</a>&#xA0;at The Pragmatic Summit in San Francisco, the average self-reported efficiency &#x2018;AI gain&#x2019; seems to be circa 10%.</p><p>I suspect this vast chasm in efficiency boosts is because AI is many times more efficient at &#x201C;no-brainer tasks&#x201D; where correctness can be verified with tests, versus those which are more open ended or involve more creativity.</p><p><strong>In general, tests are incredibly important for efficient AI usage.&#xA0;</strong>On The Pragmatic Engineer Podcast, Peter Steinberger stressed how important &#x201C;closing the loop&#x201D; in his developer flow is by instructing the AI to test itself, and ensuring the AI has tests to run that verify correctness.</p><p>Automated tests were always considered a best practice for creating maintainable code. Now, having a codebase with extensive tests is the baseline to make AI agents work productively for refactors, rewrites &#x2013; or even adding new features and verifying that things did not break!</p><p><strong>Vendors will start to deploy &#x201C;migration AI agents&#x201D; to move customers over to their own stacks.&#xA0;</strong>This got lost in Cloudflare&#x2019;s announcement, but it&#x2019;s&#xA0;<a href="https://github.com/cloudflare/vinext?ref=blog.pragmaticengineer.com">important</a>:</p><blockquote>vinext includes an Agent Skill that handles migration for you. It works with Claude Code, OpenCode, Cursor, Codex, and dozens of other AI coding tools. Install it, open your Next.js project, and tell the AI to migrate:<br><br><em>&gt; npx skills add cloudflare/vinext</em><br><br>Then open your Next.js project in any supported tool and say:<br><br><em>&gt; migrate this project to vinext</em><br><br>The skill handles compatibility checking, dependency installation, config generation, and dev server startup. It knows what vinext supports and will flag anything that needs manual attention.</blockquote><p>This is very clever from Cloudflare, and a true &#x201C;AI-native&#x201D; move. They have not only used AI to migrate Next.js, but also built an &#x201C;AI plugin&#x201D; (a skill) to help customers migrate their existing codebases over to vinext &#x2013; and deploy on Cloudflare!</p><p>This move will surely be copied by other vendors, since migrations which are tedious for humans are much less effort with agents.</p><p><strong>AI is making the tech industry more ruthless when it comes to business practices.&#xA0;</strong>Laura Tacho said something interesting at The Pragmatic Summit:</p><blockquote>&#x201C;AI is an accelerator, it&#x2019;s a multiplier, and it is moving organizations in different directions.&#x201D;</blockquote><p>AI seems to be accelerating the ruthlessness of competition for customers and the speed at which this happens. In one week, Cloudflare rebuilt Next.js, and it&#x2019;s attacking Vercel full-on: claiming their &#x201C;vibe coded&#x201D; alternative is more performant and production-ready, and burying at the foot of the launch announcement the crucial information that vinext is very much experimental.</p><p>I sense vendors are realizing that there&#x2019;s a limited amount of time in which to use AI to their advantage, and some will decide to use it like Cloudflare has.</p><p><strong>On the other hand, AI could be great news for non-commercial open source.&#xA0;</strong>AI presents as a threat to commercial open source because it removes existing moats which make code hard to fully rewrite. However, beyond that, AI could help non-commercial open source to thrive:</p><ul><li>With AI, it&#x2019;s easy to fork an open source project and keep the fork in-sync with the original.</li><li>It&#x2019;s trivial to instruct AI to rewrite an open source project to another language or framework.</li><li>&#x2026;and it&#x2019;s equally trivial for AI to add features to a fork.</li></ul><p>For these reasons, I believe there could be a lot more forks and rewrites to come, and more open source projects and code, in general.</p><h2 id="takeaways"><br>Takeaways</h2><p>Personally, I could not have imagined things changing this quickly in software. Rewriting Next.js in a single week, even to a version that is not quite there &#x2013; but mostly works? This was out of the question as recently as a few months ago.</p><p>Things changed around last December, when Opus 4.5 and GPT-5.2 came out and proved capable&#xA0;<a href="https://newsletter.pragmaticengineer.com/p/when-ai-writes-almost-all-code-what?ref=blog.pragmaticengineer.com">of writing most of the code</a>. What used to be expensive is now cheap &#x2013; like rewriting complete projects &#x2013; and we still need to learn what the &#x201C;new&#x201D; expensive parts of software engineering are.</p><p>All this is new territory for everyone. To succeed in the tech industry, you need to be able to capitalize upon change, as Cloudflare has clearly done in this case by making the most of an opportunity created by new technology. It&#x2019;s unclear how popular vinext will become, and how much of a moat Vercel has around the broader Next.js ecosystem, but I suspect that it&#x2019;d take more than a Next rewrite to make Cloudflare into a viable Next.js platform-as-a-service provider.</p>]]></content:encoded></item><item><title><![CDATA[I replaced a $120/year micro-SaaS in 20 minutes with LLM-generated code]]></title><description><![CDATA[ I used to pay $120/year for a SaaS that hasn’t added new features in four years, and didn’t fix its broken billing system for three years. Using an LLM, I managed to rewrite all the functionality I used to pay for in 20 minutes. Is this bad news for “write once, don’t update later” SaaS?]]></description><link>https://blog.pragmaticengineer.com/i-replaced-a-120-year-micro-saas-in-20-minutes-with-llm-generated-code/</link><guid isPermaLink="false">697ba13c7779050001e3775d</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 29 Jan 2026 18:41:45 GMT</pubDate><content:encoded><![CDATA[<p>I have been sceptical of the manifold claims that software-as-a-service (SaaS) will be killed by LLMs. The theory behind this idea is:</p><ol><li>SaaS is a pure software product. People who pay SaaS vendors do so because it&#x2019;s cheaper to buy this software than build it.</li><li>LLMs dramatically reduce the time and cost of building custom software.</li><li>Therefore, most SaaS vendors will go out of business because most companies/teams will prompt an LLM to write the software they need, such as for ticketing, meetings, customer relationship management, etc.</li></ol><p>The reason for my scepticism has been that SaaS such as HR software Workday is&#xA0;<em>more</em>&#xA0;than just software. Workday, for example, keeps up with compliance requirements (e.g., for holiday pay in different countries), guarantees correctness (e.g., payslips that comply with local regulations), and over time the software keeps up to date with changes in the external and internal environments.</p><p><strong>However, this week I had first-hand experience of how ridiculously easy it is now to replace SaaS with LLMs.&#xA0;</strong>On my website &#x2013;&#xA0;<a href="http://pragmaticengineer.com/?ref=blog.pragmaticengineer.com">pragmaticengineer.com</a>&#xA0;&#x2013; I have a testimonials section, which displays real LinkedIn and X posts about this publication. It cost $120/year for a small service called&#xA0;<a href="https://shoutout.io/?ref=blog.pragmaticengineer.com">Shoutout.io</a>, and looked like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/01/image.png" class="kg-image" alt loading="lazy" width="1390" height="1120" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/01/image.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/01/image.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/01/image.png 1390w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Testimonials, nicely collected and rendered by Shoutout</em></i></figcaption></figure><p>And this is the backend: nothing fancy, just a way to add, edit, reorganize, and delete testimonials.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/01/image-1.png" class="kg-image" alt loading="lazy" width="1456" height="922" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/01/image-1.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/01/image-1.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/01/image-1.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Shoutout&#x2019;s admin interface</span></figcaption></figure><p>I was a customer for four years and logged in perhaps once a year. My latest login was to get an annual invoice for my expenses. Unfortunately, the billing section was broken, so I emailed support and they sent me a broken link instead of the invoice. This was frustrating: why pay for a SaaS with broken billing? I couldn&#x2019;t even tell what they would charge me next year.</p><p><strong>So I asked myself if I could rebuild my own use case with an LLM, and do it rapidly.&#xA0;</strong>My use case was much simpler than the SaaS itself:</p><ul><li>Display existing testimonials in a similar way</li><li>Make it easy to add new ones, e.g., store testimonials in some JSON format</li><li>Make it look good</li></ul><p>To my surprise, this whole effort from start to finish took exactly 20 minutes with Codex. The steps I took were straightforward enough:</p><ul><li>Asked Codex to make a plan on how to remove this third-party dependency and host all testimonials in my codebase (a GitHub repo, deployed onto Netlify)</li><li>Tweaked the plan: I pushed for a modular approach where testimonials are in a separate JSON file, and they get generated into HTML with a compile-time build step</li><li>Added this build step both locally and as a build trigger on Netlify</li><li>Tested the solution</li><li>Tweaked the UX and generated a schema</li><li>Deployed it</li></ul><p>The end result is visually the same as before, except I no longer have a third-party dependency rendering all of this!</p><h3 id="what-does-this-mean-for-saas-products-and-software-engineers">What does this mean for SaaS products and software engineers?</h3><p>What it means for software engineers:</p><ul><li><strong>Devs are (probably) a lot more comfortable using the command line for future updates than regular users.&#xA0;</strong>To add a future testimonial, I&#x2019;ll need to turn to my AI agent to insert it in my codebase, and I&#x2019;ll then need to verify that it looks good. This is not a big deal for me, but it might be a dealbreaker for someone not comfortable with verifying the code output of an LLM.</li><li><strong>It&#x2019;s a lot faster for a dev to &#x201C;port&#x201D; a SaaS than for anyone else.&#xA0;</strong>I first told Codex to copy the UI and it got things wrong because it tried to use a flexbox model. I had to tell it that this UI layout was not what I wanted, and then make the decision on which framework to use for the UI layout. A non-developer could probably figure all this out, but it would take longer.</li><li><strong>Honestly, it&#x2019;s fun and interesting to rewrite a third-party feature. I recommend it.&#xA0;</strong>Part of why I took on this project is because I expected it to be an interesting challenge. I thought the effort would be more than what it was, and I&#x2019;ve learned more about how well these tools work. I also used Codex in order to experience it more.</li></ul><p>What this could mean for SaaS software:</p><ul><li><strong>Rebuilding a SaaS still feels much harder than rebuilding&#xA0;<em>your specific</em>&#xA0;use case.&#xA0;</strong>I did not &#x201C;rebuild&#x201D; Shoutout in any way. Shoutout has 10x or more features, like adding quotes from 10 different platforms, authentication, billing (which didn&#x2019;t work for me), and more.</li><li><strong>A SaaS that doesn&#x2019;t give ongoing value is at risk of being replaced by customers.&#xA0;</strong>Shoutout doesn&#x2019;t provide ongoing value after it displays my testimonials, and this static nature means it&#x2019;s easy to replace. In contrast, it would be harder to rebuild if I paid for the platform to stay compliant, provide analytics or alerting, and do other real-time things that helped my business.</li><li><strong>Buying and selling SaaS businesses could become less profitable.&#xA0;</strong>The original version of Shoutout that I signed up for in 2021 was built in 2020 by an independent developer. In 2022, this developer&#xA0;<a href="https://www.indiehackers.com/post/my-startup-shoutout-has-been-acquired-0350ae659c?ref=blog.pragmaticengineer.com">sold this micro-SaaS</a>&#xA0;to a product studio. Then, in 2025, Shoutout&#xA0;<a href="https://x.com/davidsonkyle/status/1942207611006542317?s=20&amp;ref=blog.pragmaticengineer.com">was sold</a>&#xA0;again to new developers. From my point of view, nothing changed except that the billing system broke. I assume the buyers of this SaaS figured that revenue could keep rising with zero investment. But perhaps at some point that ceases to be true when people get fed up with a broken product and quit &#x2013; especially when doing so is cheaper.</li></ul><p><strong>&#x201C;Broken windows&#x201D; not being fixed is less acceptable than it used to be.&#xA0;</strong>My journey away from Shoutout began with its billing system being broken. For example, below is what I saw when I went to my billing section to see the invoices:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/01/image-2.png" class="kg-image" alt loading="lazy" width="1220" height="428" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2026/01/image-2.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2026/01/image-2.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2026/01/image-2.png 1220w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A trigger to quit: Billing had been broken since 2023 and was never fixed</span></figcaption></figure><p>As well as this, the customer support sent me a broken link in response to my email. That was enough for me to decide to replace this dependency, and I was surprised by how easy this was with an LLM and knowing what I wanted it to build.&#xA0;<em>By the time customer support sent me a working link two hours later, I had finished migrating off the SaaS.</em></p>]]></content:encoded></item><item><title><![CDATA[The grief when AI writes most of the code]]></title><description><![CDATA[When AI writes almost all code, what happens to software engineering? There is grief involved for us developers, that's for sure.]]></description><link>https://blog.pragmaticengineer.com/the-grief-when-ai-writes-most-of-the-code/</link><guid isPermaLink="false">695eab59af96490001536b9c</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Wed, 07 Jan 2026 18:53:57 GMT</pubDate><content:encoded><![CDATA[<p>I&#x2019;m coming to terms with the high probability that AI will write most of&#xA0;<em>my</em>&#xA0;code which I ship to prod, going forward. It already does it faster, and with similar results to if I&#x2019;d typed it out. For languages/frameworks I&#x2019;m less familiar with, it does a better job than me.</p><p>It feels like something valuable is being taken away, and suddenly. It took a&#xA0;<em>lot</em>&#xA0;of effort to get good at coding and to learn how to write code that works, to read and understand complex code, and to debug and fix when code doesn&#x2019;t work as it should. I still remember how daunting my first &#x201C;real&#x201D; programming class was at university (learning C), how lost I felt on my first job with a complex codebase, and how it took years of practice, learning from other devs, books, and blogs, to get better at the craft. Once you&#x2019;re pretty good, you have something that&#x2019;s valuable and easy to validate by writing code that works!</p><p>Some of my best memories of building software are about coding. Being &#x201C;locked in&#x201D; and balancing several ideas while typing them out, of being in the zone, then compiling the code, running it and seeing that &#x201C;<em>YES&#x201D;,</em>&#xA0;it worked as expected!</p><p>It&#x2019;s been a love-hate relationship, to be fair, based on the amount of focus needed to write complex code. Then there&#x2019;s all the conflicts that time estimates caused: time passes differently when you&#x2019;re locked in and working on a hard problem.</p><p>Now, all that looks like it will be history.</p><p>I wonder if I&#x2019;ll still get the same sense of satisfaction from the fact that writing complicated code is&#xA0;<em>hard</em>? Yes, AI is convenient, but there&#x2019;s also a loss.</p><p>Or perhaps with AI agents, being &#x201C;in the zone&#x201D; will shift to thinking about higher-level problems, while instructing more complex code to be written?</p><hr><p>This was a section from my analysis piece <a href="https://newsletter.pragmaticengineer.com/p/when-ai-writes-almost-all-code-what?ref=blog.pragmaticengineer.com" rel="noreferrer">When AI writes almost all code, what happens to software engineering?</a>. Read the full one <a href="https://newsletter.pragmaticengineer.com/p/when-ai-writes-almost-all-code-what?ref=blog.pragmaticengineer.com" rel="noreferrer">here</a>.</p>]]></content:encoded></item><item><title><![CDATA[The Pulse: Cloudflare’s latest outage proves dangers of global configuration changes (again)]]></title><description><![CDATA[Deja vu: a large Cloudflare outage caused by an instantly rolled-out global config change – two weeks after a similar problem]]></description><link>https://blog.pragmaticengineer.com/the-pulse-cloudflares-latest-outage/</link><guid isPermaLink="false">69443c5d272393000120055e</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 18 Dec 2025 17:44:21 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-156?ref=blog.pragmaticengineer.com" rel="noreferrer"><em><u>last week&#x2019;s The Pulse</u></em></a><em> issue. Full subscribers received the below article seven days ago. If you&#x2019;ve been forwarded this email, you can</em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com"><em> <u>subscribe here</u></em></a><em>.</em></p><p>A mere two weeks after <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Cloudflare suffered a major outage</a> and took down half the internet, the same thing has happened again. Last Friday, 5th December, thousands of sites went down or partially down once more, in a global Cloudflare outage lasting 25 minutes.</p><p>As per last time, Cloudflare was speedy to share <a href="https://blog.cloudflare.com/5-december-2025-outage/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">a full postmortem</a> on the same day. It estimated that 28% of Cloudflare&#x2019;s HTTP traffic was impacted. The cause of this latest outage was Cloudflare making a seemingly innocent &#x2013; but <em>global</em> &#x2013; configuration change that went on to take out a good portion of Cloudflare, <em>globally</em>, until being reverted. Here&#x2019;s what happened:</p><ul><li>Cloudflare was rolling out a fix for a nasty React security vulnerability</li><li>The fix caused an error in an internal testing tool</li><li>The Cloudflare team disabled the testing tool with a global killswitch</li><li>As this global configuration change was made, the killswitch unexpectedly caused a bug that resulted in HTTP 500 errors across Cloudflare&#x2019;s network</li></ul><p><strong>In this latest outage, Cloudflare was burnt by yet another global configuration change. </strong>The previous outage <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">in November</a> happened thanks to a global database permissions change. In the postmortem of that incident, the Cloudflare team closed with this action item:</p><blockquote>&#x201C;Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input&#x201D;</blockquote><p>This change would make it so that Cloudflare&#x2019;s configuration files do not propagate immediately to the full network, as they still do now. But making <em>all</em> global configuration files have staged rollouts is a large implementation that could take months. Evidently, there wasn&#x2019;t time to make it yet, and it has come back to bite Cloudflare.</p><p>Unfortunately for Cloudflare, customers are likely to find unacceptable a second outage with similar causes to a previous one, only weeks ago. If Cloudflare proves unreliable, customers should plan to onboard to <em>backup</em> CDNs at the very least, and a backup CDN vendor will do its best to convince new customers to use it as the primary CDN.</p><p>Cloudflare&#x2019;s value-add rests on rock-solid reliability without customers needing to budget for a backup CDN. Yes, publishing postmortems on the same day as an outage occurs helps restore trust, but that will crumble anyway with repeated large outages.</p><p><strong>To be fair, the company is doubling down on implementing staged configuration rollouts. </strong>In its postmortem, Cloudflare is its own biggest critic. CTO Dane Knecht <a href="https://blog.cloudflare.com/5-december-2025-outage/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">reflected</a>:</p><blockquote>&#x201C;[Global configuration changes rolling out globally] remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:<strong>Enhanced Rollouts &amp; Versioning:</strong> Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.<strong>Streamlined break glass capabilities: </strong>Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.<strong>&#x201C;Fail-Open&#x201D; Error Handling: </strong>As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.<br>These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours&#x201D;.</blockquote><h3 id="global-configuration-errors-often-trigger-large-outages">Global configuration errors often trigger large outages</h3><p>There&#x2019;s a pattern of implicit or explicit global configuration errors causing large outages, and some of the biggest ones in recent years were caused by a single change being rolled out to a whole network of machines:</p><ul><li><strong>DNS and DNS-related systems like BGP:</strong> DNS changes are global by default, so it&#x2019;s no wonder that DNS changes can cause global outages. Meta&#x2019;s <a href="https://en.wikipedia.org/wiki/2021_Facebook_outage?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">7-hour outage in 2021</a> was related to DNS changes (more specifically, Border Gateway Protocol changes.) Meanwhile, the AWS outage in October <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-aws-takes-down-a-good-part?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">started with</a> the internal DNS system.</li><li><strong>OS updates happening at the same time, globally: </strong>Datadog&#x2019;s <a href="https://newsletter.pragmaticengineer.com/p/inside-the-datadog-outage?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">2023 outage</a> cost the company $5M and was caused by Datadog&#x2019;s Ubuntu machines executing an OS update within the same time window, globally. It caused issues with networking, and it didn&#x2019;t help that Datadog ran its infra on 3 different cloud providers across 3 networks. The same kind of Ubuntu update also <a href="https://newsletter.pragmaticengineer.com/p/why-reliability-is-hard-at-scale?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">caused a global outage</a> for Heroku in 2024.</li></ul><p><strong>Globally replicating configs: </strong><a href="https://newsletter.pragmaticengineer.com/i/168964142/google-cloud-globally-replicating-a-config-triggers-worldwide-outage?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">in 2024</a>, a configuration policy change was rolled out globally and crashed every Spanner database node straight away. As Google concluded in <a href="https://newsletter.pragmaticengineer.com/i/168964142/google-cloud-globally-replicating-a-config-triggers-worldwide-outage?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">its postmortem</a>: &#x201C;Given the global nature of quota management, this metadata was replicated globally within seconds&#x201D;.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/12/image.png" class="kg-image" alt loading="lazy" width="1456" height="970" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2025/12/image.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2025/12/image.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/12/image.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Step 2 &#x2013; replicating a configuration file globally across GCP &#x2013; </em></i><a href="https://newsletter.pragmaticengineer.com/i/168964142/google-cloud-globally-replicating-a-config-triggers-worldwide-outage?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">caused a global outage</em></i></a><i><em class="italic" style="white-space: pre-wrap;"> in 2024</em></i></figcaption></figure><p>Implementing gradual rollouts for <em>all</em> configuration files is a <em>lot</em> of work. It&#x2019;s also invisible labor because when done well, then its benefits will be undetectable, except in the absence of incidents, thanks to better infrastructure!</p><p><strong>The largest systems in the world will likely have to implement safer ways to roll out configs &#x2013; but not everybody needs to. </strong>Staged configuration rollout doesn&#x2019;t make much sense for smaller companies and products because this infra work slows down product development.</p><p>It doesn&#x2019;t just slow down building, but every deployment, too, and this friction is designed to make everything slower. As such, they don&#x2019;t make much sense unless the stability of mature systems is more important than fast iteration.</p><p>Software engineering is a field where tradeoffs are a fact of life, and universal solutions don&#x2019;t exist. The development which worked for a system with 1/100th of the load and users a year ago, may not make sense today.</p><p><em>This was one out of the four topics covered in this week&#x2019;s The Pulse. </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-156?ref=blog.pragmaticengineer.com" rel="noreferrer"><em>The full edition</em></a><em> additionally covers:</em></p><ol><li><strong>Industry Pulse.&#xA0;</strong>Poor capacity planning at AWS, Meta moves to a &#x201C;closed AI&#x201D; approach, a looming RAM shortage, early-stage startups hiring slower than before, how long it takes to earn $600K at Amazon and Meta, Apple loses execs to Meta, and more</li><li><strong>How the engineering team at Oxide uses LLMs.&#xA0;</strong>They find LLMs great for reading documents and lightweight research, mixed for coding and code review, and a poor choice for writing documents &#x2013; or any kind of writing, really!</li><li><strong>Linux officially supports Rust in the kernel.&#xA0;</strong>Rust is now a first-class language inside the Linux kernel, eight months after a Linux Foundation Fellow&#xA0;<a href="https://newsletter.pragmaticengineer.com/p/how-linux-is-built-with-greg-kroah?ref=blog.pragmaticengineer.com">predicted</a>&#xA0;more support for Rust. A summary of the pros and cons of Rust support for Linux</li></ol><p><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-156?ref=blog.pragmaticengineer.com" rel="noreferrer"><strong>Read the full The Pulse issue</strong></a><strong>.</strong></p>]]></content:encoded></item><item><title><![CDATA[The Pulse: Could a 5-day RTO be around the corner for Big Tech?]]></title><description><![CDATA[From next February, workers at Instagram must be in the office, five days a week. This makes Meta the second tech giant after Amazon to mandate a 5-day RTO. Will more big companies do the same?]]></description><link>https://blog.pragmaticengineer.com/the-pulse-could-a-5-day-rto-be-around-the-corner-for-big-tech/</link><guid isPermaLink="false">693b1247dd0e8a0001c79f46</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Sat, 13 Dec 2025 15:21:25 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-155?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>last week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the below article seven days ago. To get articles like this in your inbox, every week, </em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>subscribe here</em></a><em>.</em></p><hr><p>A year ago, Amazon became the first tech giant to bring staff back into the office for the full five days per week. Back then, I&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/149104874/what-does-amazons-day-rto-mean-for-tech?ref=blog.pragmaticengineer.com">analyzed</a>&#xA0;the reasons for the change, and whether other workplaces would follow suit by dropping the widespread hybrid policy of 2-3 days/week in the office.</p><p>Now, Meta employees in the Instagram division have become the latest subjects of a full return to the office, following an announcement by the social media platform this week.</p><h3 id="instagram%E2%80%99s-5-day-return-to-office">Instagram&#x2019;s 5-day return to office</h3><p>Instagram employees&#xA0;<a href="https://sources.news/p/instagrams-return-to-office-mandate?ref=blog.pragmaticengineer.com">received the unexpected email on Monday</a>, reports fellow Substacker, Alex Heath, who acquired a copy of the message. It was sent internally by Instagram CEO Adam Mosseri, who wrote:</p><blockquote>&#x201C;<strong>1. Back to the office:</strong>&#xA0;I believe that we are more creative and collaborative when we are together in-person. (...)<br><br><strong>2. Fewer meetings:</strong>&#xA0;We all spend too much time in meetings that are not effective, and it&#x2019;s slowing us down. Every six months, we&#x2019;ll cancel all recurring meetings and only re-add the ones that are absolutely necessary (...)<br><br><strong>3. More demos, less decks:</strong>&#xA0;Most product overviews should be prototypes instead of decks.<br><br><strong>4. Faster decision-making:</strong>&#xA0;We&#x2019;re going to have a more formalized unblocking process with DRIs, and I&#x2019;ll be at the priorities progress unblocking meeting every week.&#x201D;</blockquote><p>This decision by Meta affects around a quarter of company staff, and it&#x2019;s hard to imagine other divisions not following Instagram&#x2019;s lead; after all, everything in Mosseri&#x2019;s memo likely applies across the business.</p><p>Five years ago, CEO Mark Zuckerberg predicted 50% of Meta staff would work remotely by now, which didn&#x2019;t happen. Indeed, with Instagram&#x2019;s new 5-day RTO, I&#x2019;d be surprised if 5% of Meta folks work remotely in two years&#x2019; time.</p><p><strong>The reason for Insta&#x2019;s RTO seems rooted in the leadership&#x2019;s belief that in-office is more productive,&#xA0;</strong>as indicated by the top bullet point of Mosseri&#x2019;s message. That message in full:</p><p>&#x201C;I believe that we are more creative and collaborative when we are together in-person. I felt this pre-COVID and I feel it any time I go to our New York office where the in-person culture is strong.</p><p>Starting February 2, I&#x2019;m asking everyone in my rollup based in a US office with assigned desks to come back full time (five days a week). The specifics:</p><ul><li>You&#x2019;ll still have the flexibility to work from home when you need to, since I recognize there will be times you won&#x2019;t be able to come into the office. I trust you all to use your best judgment in figuring out how to adapt to this schedule.</li><li>In the NY office, we won&#x2019;t expect you to come back full time until we&#x2019;ve alleviated the space constraints. We&#x2019;ll share more once we have a better sense of timeline.</li><li>In MPK [Menlo Park, the HQ], we&#x2019;ll move from MPK21 to MPK22 on January 26 so everyone has an assigned desk. We&#x2019;re also offering the option to transfer from the MPK to SF office for those people whose commute would be the same or better with that change. We&#x2019;ll reach out directly to those people with more info.</li><li>XFN [cross-functional] partners will continue to follow their own org norms.</li><li>There is no change for employees who are currently remote&#x201D;.</li></ul><p>From what I&#x2019;ve seen of Mosseri from afar, he seems like a pretty straight shooter. It&#x2019;s clear that he feels in-office creates more energy, and in Mosseri&#x2019;s defense, I hear similar from many startup founders and leaders who say remote work causes a bunch of headaches: it&#x2019;s harder to spot motivational problems and performance issues, information travels more slowly, and rallying teams is harder.</p><p><strong>There&#x2019;s no doubt that running a full-remote company is a lot of effort.&#xA0;</strong>There&#x2019;s often-overlooked labor involved in hiring, onboarding, performance management, team celebrations, and even company-wide meetings &#x2013; none of it is easy.</p><p>Linear is a full-remote company with nearly 50 people working there, which&#xA0;<a href="https://linear.app/now/designing-remote-work-at-linear?ref=blog.pragmaticengineer.com">recently published details about how it operates</a>. They&#x2019;re introducing the concept of &#x201C;coworking hubs&#x201D;, flying in teams for in-person events, and holding regular off-sites, while being careful to hire people who fit the culture.</p><p><strong>My feeling is that remote work policies at tech companies are going to become questions of their leaders&#x2019; preferences.&#xA0;</strong>Many devs prefer remote work: there&#x2019;s fewer interruptions, more deep focus, and less commuting. Most of us would probably be just as productive &#x2013; and probably more so &#x2013; than when being interrupted in-office.</p><p>Leaders who prefer full-remote can cite flexibility and easier hiring from a larger pool of candidates as clear benefits. Meanwhile, those most comfortable with in-person will always have enough reasons to justify a 5-day RTO, along the lines of Mosseri&#x2019;s reasoning. Advocates of hybrid setups cite balancing of focus time and efficiency.</p><p>In today&#x2019;s job market, any company that pays closer to the top of the market can probably get away with five-days-a-week RTO. Meta is in this space, and although I&#x2019;m sure plenty of devs will dislike the change, the alternative is to go out on the job market, accept a pay cut to join a new company, and start rebuilding your internal network.</p><p>Since we&#x2019;re in the&#xA0;<a href="https://newsletter.pragmaticengineer.com/p/state-of-the-tech-market-in-2025?ref=blog.pragmaticengineer.com">midst of a weird job market</a>, it makes switching jobs more difficult than before, when the job market was very hot. In this respect, Instagram has external conditions on its side. For devs at Meta, one upside is that Big Tech experience&#xA0;<a href="https://newsletter.pragmaticengineer.com/p/tech-jobs-market-2025-part-3?ref=blog.pragmaticengineer.com">opens more doors</a>, even in this tough job market.</p><p>One caveat is that a 5-day RTO is unlikely in places where it&#x2019;s hard to hire the right people. So, AI engineers and those working on AI products should be pretty safe, for instance, because those roles are&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/172584839/ai-engineering-trends?ref=blog.pragmaticengineer.com">incredibly in-demand</a>, as indicated by the&#xA0;<a href="https://newsletter.pragmaticengineer.com/i/165280420/new-trend-higher-base-salaries-for-ai-engineers?ref=blog.pragmaticengineer.com">trend of higher base salaries for AI engineers</a>. Based on that, few companies should want to push those workers to quit to join competitors.</p><p></p><p><em>Many subscribers expense this newsletter to their learning and development budget. If you have such a budget, here&#x2019;s</em><a href="https://blog.pragmaticengineer.com/request-to-expense-the-pragmatic-engineer-newsletter/" rel="noopener noreferrer nofollow"><em> an email you could send to your manager</em></a><em>.</em></p>]]></content:encoded></item><item><title><![CDATA[Downdetector and the real cost of no upstream dependencies]]></title><description><![CDATA[During the Cloudflare outage, Downdetector was also unavailable. I got details from the team about why they have a hard dependency on Cloudflare, and why that won’t change anytime soon.]]></description><link>https://blog.pragmaticengineer.com/downdetector-and-the-real-cost-of-no-upstream-dependencies/</link><guid isPermaLink="false">6932a20b097ffa00013da35c</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Fri, 05 Dec 2025 09:14:50 GMT</pubDate><content:encoded><![CDATA[<p><em>The below is one out of five topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com" rel="noreferrer"><em>The Pulse #154.</em></a><em> Full subscribers received the below article two weeks ago. To get articles like this in your inbox, every week, </em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com"><em><u>subscribe here</u></em></a><em>.</em></p><p><em>Many subscribers expense The Pragmatic Engineer Newsletter to their learning and development budget. If you have such a budget, here&#x2019;s</em><a href="https://blog.pragmaticengineer.com/request-to-expense-the-pragmatic-engineer-newsletter/"><em><u> an email you could send to your manager</u></em></a><em>.</em></p><hr><p>One amusing detail of the <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com" rel="noreferrer">November 2025 Cloudflare outage</a> is that the realtime outage and monitoring service, Downdetector, went down, revealing a key dependency on Cloudflare. At first, this looks odd; after all, Downdetector is about monitoring uptime, so why would it take on a key dependency like Cloudflare if it means this can happen?</p><p><strong>Downdetector was built multi-region and multi-cloud,</strong>&#xA0;which<strong>&#xA0;</strong>I confirmed by talking with Senior Director of Engineering,&#xA0;<a href="https://x.com/damndhruv?ref=blog.pragmaticengineer.com">Dhruv Arora</a>, at Ookla, the company behind Downdetector. Multi-cloud resilience makes little sense for most products, but Downdetector was built to detect cloud provider outages, as well. And for this, they needed to be multi-cloud!</p><p>Still, Downdetector uses Cloudflare for DNS, Content Delivery (CDN), and Bot Protection. So, why would it take on this one key dependency, as opposed to hosting everything on its own servers?</p><p><strong>A CDN has advantages that are hard to ignore,&#xA0;</strong>such as:</p><ul><li>Drastically lower bandwidth costs &#x2013; assets cached on the CDN are much faster</li><li>Faster load times because assets on a CDN are served from Edge nodes nearer users</li><li>Protection from sudden traffic spikes, as would be common for Downdetector, especially during outages! Without a CDN, those spikes could overload their services</li><li>DDoS protection from bad actors taking the site offline with a distributed denial of service attack</li><li>Reduced infrastructure requirements, as Downdetector can run on fewer servers</li></ul><p>Downdetector&#x2019;s usage patterns reflect that it&#x2019;s a service very heavily used by consumers whom the business doesn&#x2019;t really monetize (Downdetector is free to use.) So, Downdetector could get rid of Cloudflare, but costs would surge, the site would become slower to load, and revenue wouldn&#x2019;t change.</p><p>In the end, Downdetector&#x2019;s dependence on Cloudflare could be a pragmatic choice based on the business model, and how removing its upstream dependency upon Cloudflare could get very expensive!</p><p>Dhruv confirmed this and sharing more about the design choices at Downdetector:</p><blockquote>&#x201C;<strong>Building redundancy at the DNS &amp; CDN layers would require enormous overhead.</strong>&#xA0;This is especially true as Cloudflare&#x2019;s Bot Protection is world-class, and building similar functionality would be a lot of effort. There are hyperscalers [cloud providers] that have this kind of redundancy built in. We will look into what we can do, but with a team size in the double digits, building up a core piece of infra like this is a pretty tall order: not just for us, but for any mid-sized team.<br><br>We&#x2019;ve learned that there are more things that we can improve, for the future. For example, during the outage, the Cloudflare control pane was down, but their API wasn&#x2019;t. So, us having more Infrastructure as Code could have helped bring back Downdetector sooner.<br><br>On our end, we also noticed that the outage wasn&#x2019;t global, so we were able to shift traffic around and reduce the impact.<br><br>One more interesting detail: Cloudflare&#x2019;s Bot Protection went haywire during the outage, and started to block legitimate traffic. So, our team had to turn that off temporarily&#x201D;.</blockquote><p>Thanks very much to Dhruv and the Downdetector team for sharing details.</p>]]></content:encoded></item><item><title><![CDATA[A startup in Mongolia translated my book]]></title><description><![CDATA[A 30-person startup called Nasha Tech translated The Software Engineer's Guidebook for the benefit of their company and the Mongolian tech ecosystem.]]></description><link>https://blog.pragmaticengineer.com/traveling-to-mongolia/</link><guid isPermaLink="false">69206cafc3b7150001d419bf</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Fri, 21 Nov 2025 13:47:17 GMT</pubDate><content:encoded><![CDATA[<p>I published <a href="https://www.engguidebook.com/?ref=blog.pragmaticengineer.com" rel="noreferrer">The Software Engineer&apos;s Guidebook</a> two years ago. <em> I shared more details on how I self-published the book, and the learnings from publishing </em><a href="https://newsletter.pragmaticengineer.com/p/the-software-engineers-guidebook?ref=blog.pragmaticengineer.com" rel="noreferrer"><em>in this post.</em></a></p><p>An unexpected highlight of publishing the book was ending up in Mongolia in June of this year, at a small-but-mighty startup called <a href="https://nashatech.com/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Nasha Tech</a>. This was because the startup translated my book into Mongolian. Here&apos;s the completed book:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/Screenshot-2025-11-21-at-15.34.01.png" class="kg-image" alt loading="lazy" width="1078" height="1292" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2025/11/Screenshot-2025-11-21-at-15.34.01.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2025/11/Screenshot-2025-11-21-at-15.34.01.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/Screenshot-2025-11-21-at-15.34.01.png 1078w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The Software Engineer&apos;s Guidebook, in Mongolian. You can </span><a href="https://internom.mn/%D0%B1%D0%B0%D1%80%D0%B0%D0%B0/9789919053185-%D1%81%D0%BE%D1%84%D1%82%D0%B2%D1%8D%D0%B9%D1%80-%D0%B8%D0%BD%D0%B6%D0%B5%D0%BD%D0%B5%D1%80%D0%B8%D0%B9%D0%BD-%D1%85%D3%A9%D1%82%D3%A9%D1%87-%D0%BD%D0%BE%D0%BC?ref=blog.pragmaticengineer.com" rel="noreferrer"><span style="white-space: pre-wrap;">buy this translation here</span></a></figcaption></figure><p>Here&#x2019;s what happened:</p><p>A little over a year ago, a small startup from Mongolia reached out, asking if they could translate the book. I was skeptical it would happen because the unit economics appeared pretty unfavorable. Mongolia&#x2019;s population is 3.5 million; much smaller than other countries where professional publishers had offered to do a translation (Taiwan: 23M, South Korea: 51M, Germany: 84M, Japan: 122M, China: 1.43B people).</p><p>But I agreed to the initiative, and expected to hear nothing back. To my surprise, nine months later the translation was ready, and the startup printed 500 copies on the first run. They invited me to a book signing in the capital city of Ulaanbaatar, and soon I was on my way to meet the team, and to understand why a small tech company translated my book!</p><h3 id="japanese-startup-vibes-in-mongolia">Japanese startup vibes in Mongolia</h3><p>The startup behind the translation is called <a href="https://nashatech.com/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Nasha Tech</a>; a mix of a startup and a digital agency. Founded in 2018, its main business has been agency work, mainly for companies in Japan. They are a group of 30 people, mostly software engineers.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/image-1.png" class="kg-image" alt loading="lazy" width="1086" height="1264" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2025/11/image-1.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2025/11/image-1.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/image-1.png 1086w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Nasha Tech&#x2019;s offices in Ulaanbaatar, Mongolia</span></figcaption></figure><p>Their offices resembled a mansion more than a typical workplace, and everyone takes their shoes off when arriving at work and switches to &#x201C;office slippers&#x201D;. I encountered the same vibe later <a href="https://newsletter.pragmaticengineer.com/i/177384640/cursor-push-for-release?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">at Cursor&#x2019;s headquarters in San Francisco</a>, in the US.</p><p>Nasha Tech found a niche of working for Japanese companies thanks to one of its cofounders studying in Japan, and building up connections while there. Interestingly, another cofounder later moved to Silicon Valley, and advises the company from afar.</p><p><strong>The business builds the &#x201C;Uber Eats of Mongolia&#x201D;. </strong>Outside of working as an agency, Nasha Tech builds its own products. The most notable is called TokTok, the &#x201C;UberEats of Mongolia&#x201D;, which is the leading food delivery app in the capital city. The only difference between TokTok and other food delivery apps is scale: the local market is smaller than in some other cities. At a few thousand orders per day, it might not be worthwhile for an international player like Uber or Deliveroo to enter the market.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/image-2.png" class="kg-image" alt loading="lazy" width="1456" height="646" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2025/11/image-2.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2025/11/image-2.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/image-2.png 1456w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">The </em></i><a href="https://www.toktok.mn/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">TokTok</em></i></a><i><em class="italic" style="white-space: pre-wrap;"> app: a customer base of 800K, 500 restaurants, and 400 delivery riders</em></i></figcaption></figure><p>The tech stack Nasha Tech typically uses:</p><ul><li>Frontend: React / Next, Vue / Nuxt, TypeScript, Electron, Tailwind, Element UI</li><li>Backend and API: NodeJS (Express, Hono, Deno, NestJS), Python (FastAPI, Flask), Ruby on Rails, PHP (Laravel), GraphQL, Socket, Recoil</li><li>Mobile: Flutter, React Native, Fastlane</li><li>Infra: AWS, GCP, Docker, Kubernetes, Terraform</li><li>AI &amp; ML: GCP Vertex, AWS Bedrock, Elasticsearch, LangChain, Langfuse</li></ul><p>AI tools are very much widespread, and today the team uses Cursor, GitHub Copilot, Claude Code, OpenAI Codex, and Junie by Jetbrains.</p><p><strong>I detected very few differences between Nasha Tech and other &#x201C;typical&#x201D; startups I&#x2019;ve visited, in terms of the vibe and tech stack. </strong>Devs working on TokTok were very passionate about how to improve the app and reduce the tech debt accumulated by prioritizing the launch. A difference for me was the language and target market: the main language in the office is, obviously, Mongolian, and the products they build like TokTok also target the Mongolian market, or the Japanese one when working with clients.</p><p>One thing I learned was that awareness about the latest tools has no borders: back in June, a dev at Nasha Tech was already telling me that Claude Code was their daily driver, even though the tool had been released for barely a month at that point!</p><h3 id="why-translate-the-book-into-mongolian">Why translate the book into Mongolian?</h3><p>Nasha Tech was the only non-book publisher to express interest in translating the book. But why did they do it?</p><p>I was told the idea came from software engineer <a href="https://x.com/ssuuribaatar?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Suuribaatar Sainjargal</a>, who bought and enjoyed the English-language version. He <a href="https://x.com/GergelyOrosz/status/1937160382600343964?s=20&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">suggested</a> translating the book so that everyone at the company could read it, not only those fluent in English.</p><p>Nasha Tech actually had some in-house experience of translation. A year earlier, in 2024, the company translated Matt Mochary&#x2019;s <a href="https://www.amazon.com/Great-CEO-Within-Tactical-Building-ebook/dp/B07ZLGQZYC?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">The Great CEO Within</a> as a way to uplevel their leadership team, and to help the broader Mongolian tech ecosystem.</p><p>Also, the company&#x2019;s General Manager, <a href="https://www.linkedin.com/in/battsengel/?originalSubdomain=mn&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Batutsengel Davaa</a>, happened to have been involved in translating more than 10 books in a previous role. He took the lead in organizing this work, and here&#x2019;s how the timelines played out:</p><ul><li>Professional translator: 3 months</li><li>Technical editor revising the draft translation: 1 month</li><li>Technical editing #2 by a Support Engineer in Japan: 2 months</li><li>Technical revision: 15 engineers at Nasha Tech revised the book, with a &#x201C;divide and conquer&#x201D; approach: 2 months</li><li>Final edit and print: 1 month</li></ul><p>This was a real team effort. Somehow, this startup managed to produce a high-quality translation in around the same time as it took professional book publishers in my part of the world to do the same!</p><p>A secondary goal that Nasha Tech had was to advance the tech ecosystem in Mongolia. There&#x2019;s understandably high demand for books in the mother tongue; I observed a number of book stands selling these books, and book fairs are also popular. The translation of my book has been selling well, where you can <a href="https://internom.mn/%D0%B1%D0%B0%D1%80%D0%B0%D0%B0/9789919053185-%D1%81%D0%BE%D1%84%D1%82%D0%B2%D1%8D%D0%B9%D1%80-%D0%B8%D0%BD%D0%B6%D0%B5%D0%BD%D0%B5%D1%80%D0%B8%D0%B9%D0%BD-%D1%85%D3%A9%D1%82%D3%A9%D1%87-%D0%BD%D0%BE%D0%BC?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">buy the book</a> for 70,000 MNTs (~$19).</p><h3 id="book-signing-and-the-mongolian-startup-scene">Book signing and the Mongolian startup scene</h3><p>The book launch event was at Mongolia&#x2019;s startup hub, called <a href="https://digitalnomad.itpark.mn/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">IT Park</a>, which offers space for startups to operate in. I met a few working in the AI and fintech spaces &#x2013; and even one startup producing comics.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/image-3.png" class="kg-image" alt loading="lazy" width="1378" height="1184" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2025/11/image-3.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2025/11/image-3.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/image-3.png 1378w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Book launch event, and meeting startups inside Mongolia&#x2019;s IT Park</span></figcaption></figure><p>I had the impression that the government and private sector are investing heavily in startups, and want to help more companies to become breakout success stories:</p><ul><li><a href="https://digitalnomad.itpark.mn/ds_in_mongolia?ref=blog.pragmaticengineer.com#ds" rel="noopener noreferrer nofollow">IT Park report</a>: the country&#x2019;s tech sector is growing ~20%, year-on-year. The <em>combined</em> valuation of all startups in Mongolia is at $130M, today.<em> It&#x2019;s worth remembering that location is important for startups: being in hubs like the US, UK, and India confers advantages that can be reflected in valuations.</em></li><li><a href="https://www.jica.go.jp/overseas/mongolia/sjp04ove1698/__icsFiles/afieldfile/2024/08/28/Summary.pdf?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Mongolian Startup Ecosystem Report 2023</a>: the average pre-seed valuation of a startup in Mongolia is $170K, seed valuation at $330K, and Series A valuation at $870K. The numbers reflect market size; for savvy investors, this could also be an opportunity to invest early. I met a Staff Software Engineer at the book signing event who is working in Silicon Valley at Google, and invests and advises in startups in Mongolia.</li><li><a href="https://drive.google.com/file/d/1Ath-eOMd4Kr924cq1AkgLekfeJlXCBfd/view?usp=sharing&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Mongolian startup ecosystem Map</a>: better-known startups in the country.</li></ul><p>Two promising startups from Mongolia: <a href="https://chimege.com/en/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Chimege</a> (an AI+voice startup) <a href="https://and.global/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">AND Global</a> (fintech). Thanks very much to the <a href="https://nashatech.com/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Nasha Tech team</a> for translating the book &#x2013; keep up the great work!</p><h2 id></h2>]]></content:encoded></item><item><title><![CDATA[The Pulse: Cloudflare takes down half the internet – but shares a great postmortem]]></title><description><![CDATA[A database permissions change ended up knocking Cloudflare’s proxy offline. Pinpointing the root cause was tricky – but Cloudflare shared a detailed postmortem. Also: announcing The Pragmatic Summit]]></description><link>https://blog.pragmaticengineer.com/the-pulse-cloudflare-takes-down-half-the-internet/</link><guid isPermaLink="false">691f7b63e9904f00015006db</guid><dc:creator><![CDATA[Gergely Orosz]]></dc:creator><pubDate>Thu, 20 Nov 2025 20:36:19 GMT</pubDate><content:encoded><![CDATA[<p><em>Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of five topics from </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com"><em>this week&#x2019;s The Pulse</em></a><em> issue. Full subscribers received the below article seven days ago. To get articles like this in your inbox, every week, </em><a href="https://newsletter.pragmaticengineer.com/about?ref=blog.pragmaticengineer.com"><em>subscribe here</em></a><em>.</em></p><p><em>Many subscribers expense this newsletter to their learning and development budget. If you have such a budget, here&#x2019;s</em><a href="https://blog.pragmaticengineer.com/request-to-expense-the-pragmatic-engineer-newsletter/"><em> an email you could send to your manager</em></a><em>.</em></p><hr><p>Before we start: I&#x2019;m excited to share something new: <strong>The Pragmatic Summit.</strong></p><p>Four years ago, The Pragmatic Engineer started as a small newsletter: me writing about topics relevant for engineers and engineering leaders at Big Tech and startups. Fast forward to today, and the newsletter <a href="https://newsletter.pragmaticengineer.com/p/one-million?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">crossed one million readers</a>, and the publication expanded with <a href="https://newsletter.pragmaticengineer.com/podcast?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">a podcast</a> as well.</p><p>One thing that was always missing: meeting in person. Engineers, leaders, founders&#x2014;people who want to meet others in this community, and learn from each other. Until now that is:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/TPS_Social_RegLive_1200x627_110625.png" class="kg-image" alt loading="lazy" width="1200" height="627" srcset="https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w600/2025/11/TPS_Social_RegLive_1200x627_110625.png 600w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/size/w1000/2025/11/TPS_Social_RegLive_1200x627_110625.png 1000w, https://storage.ghost.io/c/39/f8/39f85cc7-8637-40fc-a57c-f45754453717/content/images/2025/11/TPS_Social_RegLive_1200x627_110625.png 1200w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The Pragmatic Summit. </span><a href="https://www.pragmaticsummit.com/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><span style="white-space: pre-wrap;">See more details and apply to attend</span></a></figcaption></figure><p>In partnership with <a href="http://statsig.com/pragmatic?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">Statsig</a>, I&#x2019;m hosting the first-ever <a href="https://www.pragmaticsummit.com/?utm_source=the-pragmatic-engineer&amp;utm_medium=newsletter&amp;utm_campaign=nov-20-paid-edition" rel="noopener noreferrer nofollow"><strong>Pragmatic Summit</strong></a>. Seats are limited, and tickets are priced at $499, covering the venue, meals, and production&#x2014;we&#x2019;re not aiming to make any profit from this event.</p><p><a href="https://www.pragmaticsummit.com/?ref=blog.pragmaticengineer.com">Apply to attend the Summit</a></p><p>I hope to see many of you there!</p><hr><h2 id="cloudflare-takes-down-half-the-internet-%E2%80%93-but-shares-a-great-postmortem">Cloudflare takes down half the internet &#x2013; but shares a great postmortem</h2><p>On Tuesday came another reminder about how much of the internet depends on Cloudflare&#x2019;s content delivery network (CDN), when thousands of sites went fully or partially offline in an outage that lasted 6 hours. Some of the higher-profile victims included:</p><ul><li>ChatGPT and Claude</li><li>Canva, Dropbox, Spotify,</li><li>Uber, Coinbase, Zoom</li><li>X and Reddit</li></ul><p>Separately, you may or may not recall that during a different recent outage caused by AWS, Elon Musk noted on his website, X, that AWS is a hard dependency for Signal, meaning an AWS outage could take down the secure messaging service at any moment. In response, a dev pointed out that it is the same for X with Cloudflare &#x2013; and so it proved earlier this week, when X was broken by the Cloudflare outage.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!IN2n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cfc94-1792-4a5e-8fb6-c1815df54ff0_1072x898.png" class="kg-image" alt loading="lazy" width="1072" height="898"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Predicting the future. Source: Mehul Mohan </em></i><a href="https://x.com/mehulmpt/status/1980382080602370144?s=20&amp;ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">on X</em></i></a></figcaption></figure><p>That AWS outage was in the company&#x2019;s us-east-1 region and <a href="https://newsletter.pragmaticengineer.com/p/the-pulse-aws-takes-down-a-good-part?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">took down a good part of the internet</a> last month. AWS released incident details three days later &#x2013; unusually speedy for the e-commerce giant &#x2013; although that postmortem was high-level and we never learned <em>exactly</em> what caused AWS&#x2019;s <a href="https://newsletter.pragmaticengineer.com/i/176934094/how-dynamodb-dns-management-happens?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">DNS Enactor</a> service to slow down, triggering an unexpected race condition that kicked off the outage.</p><h3 id="what-happened-this-time-with-cloudflare">What happened this time with Cloudflare?</h3><p>Within hours of mitigating the outage, Cloudflare&#x2019;s CEO Matthew Prince shared an <a href="https://blog.cloudflare.com/18-november-2025-outage/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">unusually detailed report </a>of what exactly went wrong. The root cause was to do with propagating a configuration file to Cloudflare&#x2019;s Bot Management module. The file crashed Bot Management, which took Cloudflare&#x2019;s proxy functionality offline.</p><p>Here&#x2019;s a brief overview of how Cloudflare&#x2019;s proxy layer works at a high level. It&#x2019;s the layer that protects the &#x201C;origin&#x201D; resources of customers &#x2013; minimizing network traffic to them by blocking malicious requests and caching static resources in Cloudflare&#x2019;s CDN:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!esOT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F132ad7a8-2c1d-4be1-8174-295941979ceb_1420x1312.png" class="kg-image" alt loading="lazy" width="1420" height="1312"><figcaption><i><em class="italic" style="white-space: pre-wrap;">How Cloudflare&#x2019;s proxy works. More details on </em></i><a href="https://blog.cloudflare.com/20-percent-internet-upgrade/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Cloudflare&#x2019;s engineering blog</em></i></a></figcaption></figure><p>Here&#x2019;s how the incident unfolded:</p><p><strong>A database permissions change in </strong><a href="https://en.wikipedia.org/wiki/ClickHouse?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>ClickHouse</strong></a><strong> kicked things off. </strong>Before the permissions changed, all queries to fetch feature metadata (to be used by the Bot Management module) would have only been run on distributed tables in Clickhouse, in a database called &#x201C;default&#x201D; which contains 60 features.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!NEwO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6f62c0a-5772-45a3-9be1-24e7c15c4e7b_1264x264.png" class="kg-image" alt loading="lazy" width="1264" height="264"><figcaption><span style="white-space: pre-wrap;">Before the permissions change: about 60 features were returned, that were fed to the Bot Module</span></figcaption></figure><p>Until now, these queries were running using a shared system account. Cloudflare&#x2019;s engineering team wanted to improve system security and reliability, and move from this shared system account to individual user accounts. User accounts already had access to another database called &#x201C;r0&#x201D;, so the team made the database permission change for access to r0 to be <em>implicit</em> instead of explicit.</p><p>As a side effect of this, the same query collecting the features to be passed to Bot Management started to fetch from the r0 database, and return many more features than expected:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!p5bm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e62f91e-7078-4b9d-8e2f-3b3fb357aef5_1220x252.png" class="kg-image" alt loading="lazy" width="1220" height="252"><figcaption><span style="white-space: pre-wrap;">After the permissions change: the query did not change but returned twice as many results</span></figcaption></figure><p><strong>The Bot Management module does not allow loading of more than 200 features. </strong>This limit was well above the production usage of 60, and was put in place for performance reasons: the Bot Management module pre-allocates memory for up to 200 features, and it will not operate with more than this number.</p><p><strong>A </strong><a href="https://en.wikipedia.org/wiki/Kernel_panic?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>system panic</strong></a><strong> hit machines served with the incorrect feature file. </strong>Cloudflare was nice enough to share the exact code that caused this panic, which was this unwrap() function:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!qih4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8462b639-2c4c-4c8d-91b2-a468f97d7ee4_1606x666.png" class="kg-image" alt loading="lazy" width="1456" height="604"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Source: </em></i><a href="https://blog.cloudflare.com/18-november-2025-outage/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Cloudflare</em></i></a></figcaption></figure><p>What likely happened:</p><ul><li>The append_with_names() function likely checked for a limit of 200 features</li><li>If it saw more than 200 features, it likely returned an error</li><li>&#x2026; and when writing the code, it was not expected that append_with_names() would return an error&#x2026;</li><li>&#x2026; and so .unwrap() panicked and crashed the system!</li></ul><p><strong>Edge nodes started to crash, one by one, seemingly randomly. </strong>The feature file was being generated every 5 minutes, and gradually rolled out to Edge nodes. So, initially, it was only a few nodes that crashed, and then over time, more became non-responsive. At one point, both good and bad configuration files were being distributed, making failed nodes that received the good configuration file start working &#x2013; for a while!</p><h3 id="why-so-long-to-find-the-root-cause">Why so long to find the root cause?</h3><p>It took Cloudflare engineers unusually long &#x2013; 2.5 hours! &#x2013; to figure all this out, and that an incorrect configuration file propagating to Edge servers was to blame for their proxy going down. Turns out, an unrelated failure made the Cloudflare team suspect that they were under a coordinated botnet attack, as when a few of the Edge nodes started to go offline, the company&#x2019;s status page did, too:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!Xa8F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F565ff3fa-112f-4500-940a-4f3f241991fd_1999x478.png" class="kg-image" alt loading="lazy" width="1456" height="348"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Cloudflare&#x2019;s status page went offline when the outage started. Source: </em></i><a href="https://blog.cloudflare.com/18-november-2025-outage/?ref=blog.pragmaticengineer.com" target="_blank" rel="noopener noreferrer nofollow"><i><em class="italic" style="white-space: pre-wrap;">Cloudflare</em></i></a></figcaption></figure><p>The team tried to gather details about the attack, but there was no attack, meaning they wasted time looking in the wrong place. In reality, the status page going down was a coincidence and unrelated to the outage. But it&#x2019;s easy to see why their first reaction was to figure out if there was a distributed denial of service (DDoS) attack.</p><p>As mentioned, it eventually took 2.5 hours to pinpoint the incorrect configuration files as the source of the outage, and another hour to stop the propagation of new files, and create a new and correct file, which was deployed 3.5 hours after the start of the incident. Cleanup took another 2.5 hours, and at 17:06 UTC, the outage was resolved, ~6 hours after it started.</p><p>Cloudflare shared a detailed review of the incident and learnings, which can be <a href="https://blog.cloudflare.com/18-november-2025-outage/?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">read here.</a></p><h3 id="how-did-the-postmortem-come-so-fast">How did the postmortem come so fast?</h3><p>One thing that keeps being surprising about Cloudflare is how they have a very detailed postmortem up in less than 24 hours after the incident is resolved. Cofounfer and CEO Matthew Prince <a href="https://news.ycombinator.com/user?id=eastdakota&amp;ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow">explained</a> how this was possible:</p><ul><li>Matthew was part of the outage call.</li><li>After the outage was resolved, he wrote a first version of the incident review, at home. Matthew was in Lisbon, in Cloudflare&#x2019;s European HQ, so this was early evening</li><li>The team circulated a Google Doc with this initial writeup, and questions that needed to be reviewed</li><li>In a few hours, all questions were answered</li><li>Matthew: &#x201C;None of us were happy [about the incident] &#x2014; we were embarrassed by what had happened &#x2014; but we declared it [the postmortem] true and accurate.</li><li>Sent the draft over to the SF team, who did one more sweep, the posted it</li></ul><p>Talk about moving with the speed of a startup, despite being a publicly traded company!</p><h3 id="learnings">Learnings</h3><p>There is much to learn from this incident, such as:</p><p><strong>Be explicit about logging errors when you raise them! </strong>Cloudflare could probably have identified the root cause of this error much faster if the line of code that returned an error, also logged the error, and if Cloudflare had alerts set up when certain errors spiked on its nodes. It could have surely shaved an hour or two off the time it took to mitigate.</p><p>Of course, logging errors before throwing them is extra work, but when done with monitoring or log analysis, it can help find the source of errors much faster.</p><p><strong>Global database changes are always risky. </strong>You never know what part of the system you might hit.<strong> </strong>The incident started with a seemingly innocuous database permissions change that impacted a wide range of queries. Unfortunately, there is no good way to test the impact of such changes (if you know one, please leave a comment below!)</p><p>Cloudflare was making the right kind of change by removing global systems accounts; it&#x2019;s a good direction to go in for security and reliability. It was extremely hard to predict the change would end up taking down a part of their system &#x2013; and the web.</p><p><strong>Two things going wrong at the same time can really throw an engineering team. </strong>If Cloudflare&#x2019;s status page did not go offline, the engineering team would have surely pinpointed the problem much faster than they did. But in the heat of the moment, it&#x2019;s easy to assume that two small outages are connected, until there&#x2019;s evidence that they&#x2019;re not. Cloudflare is a service that&#x2019;s continuously under attack, so the engineering team can&#x2019;t be blamed for assuming it might be more of the same.</p><p><strong>CDNs are the backbone of the internet, and this outage doesn&#x2019;t change that. </strong>The outage hit lots of large businesses, resulting in lost revenue for many. But could affected companies have prepared better for Cloudflare going down?</p><p>The problem is that this is hard: using a CDN means taking on a <em>hard</em> dependency in order to reduce traffic on your own servers (the origin servers), while serving internet users faster and more cheaply:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://substackcdn.com/image/fetch/$s_!54wJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca2f86-18b2-4ba8-8fd2-bc7236b330db_1194x280.png" class="kg-image" alt loading="lazy" width="1194" height="280"><figcaption><span style="white-space: pre-wrap;">A CDN is a common way to reduce traffic to servers and serve webpages and APIs faster to users</span></figcaption></figure><p>When using a CDN, you propagate addresses that point to that CDN server&#x2019;s IP or domain. When the CDN goes down, you could start to redirect traffic to your own origin servers (and deal with the traffic spike), or utilize a backup CDN, if you prepared for this eventuality.</p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!fj68!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80ef266a-4a28-429b-9d01-52a34e03eae0_1248x774.png" class="kg-image" alt loading="lazy" width="1248" height="774"></figure><p>Both these are expensive to pull off:</p><ul><li>Redirecting to the origin servers likely means needing to suddenly scale up backend infrastructure</li><li>Having a backup CDN means there must be a contract and payment for a CDN partner which will most likely sit idle. As and when it is needed, you must switch over and warm up their cache: it&#x2019;s a lot of effort and money to do this!</li></ul><p>A case study in the trickiness of dealing with a CDN going offline is the story of Downdetector, including inside details on why Downdetector went down during Cloudflare&#x2019;s latest outage, and what they learned from it.</p><hr><p><em>This was one out of the five topics covered in this week&#x2019;s The Pulse. </em><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><em>The full edition</em></a><em> additionally covers:</em></p><ol><li><strong>Downdetector &amp; the real cost of no upstream dependencies.</strong> During the Cloudflare outage, Downdetector was also unavailable. I got details from the team about why they have a hard dependency on Cloudflare, and why that won&#x2019;t change anytime soon.</li><li><strong>Antigravity: Google&#x2019;s new AI IDE &#x2013; that its devs cannot use. </strong>Google wants to become a serious player in AI coding tools, but Antigravity contains remnants of Windsurf. Interestingly, devs at Google aren&#x2019;t allowed to use Antigravity for work</li><li><strong>Industry pulse.</strong> Gemini 3 launch, Anthropic valued at $350B, Jeff Bezos funds an AI company, and unusually slow headcount growth at startups persists.</li><li><strong>Five AI fakers caught in 1 month by crypto startup. </strong>Candidates who fake their backgrounds and change their looks in remote interviews continue to plague companies hiring full-remote &#x2013; especially crypto startups.</li></ol><p><a href="https://newsletter.pragmaticengineer.com/p/the-pulse-154?ref=blog.pragmaticengineer.com" rel="noopener noreferrer nofollow"><strong>Read the full The Pulse</strong></a></p>]]></content:encoded></item></channel></rss>