Daniel Lemire's blog

X just gave us an interface that AI agents can use. I pointed it at my own posts.

Daniel Lemire — Sat, 11 Jul 2026 19:53:17 +0000

I have been on X for a long time. Like most people who post regularly, I have a gut feeling for what might interest people. I post in the morning. Longer posts seem to do better.

But gut feelings are not measurements. And until recently, digging into your own posting data meant either clicking around the web UI or writing custom scripts. Neither is particularly friendly when you want to ask ad hoc questions with an AI assistant.

X recently launched hosted MCP servers: official endpoints that AI tools can connect to. MCP is a protocol for plugging tools into language models: the model can search posts, manage bookmarks, fetch trends, and so on. In practice, I connected an AI coding agent to the X MCP server and simply started asking questions about my account.

I spent a session exploring about two months of my own activity. Here is what I found interesting.

Over roughly sixty days (mid-May through mid-July 2026), I published on the order of 435 posts that were not pure retweets of other people—mostly a mix of original posts, replies, and a few X Articles. The agent pulled them through the MCP tools, kept the public metrics (likes, views, reposts), and ran simple analyses.

I asked for every post to be binned by local hour of day (America/Toronto, Eastern time), and for each hour: how many posts, and the min / median / max view count.

My posting is heavily skewed toward the morning:

Local hour	Posts	Median views
08:00–08:59	45	454
09:00–09:59	58	1,067
10:00–10:59	30	194
11:00–11:59	42	284

The 9 a.m. hour is both my busiest and, among busy hours, my strongest by median views. The overall median across all hours was only about 188 views, so most of what I write is quiet. The distribution is heavy-tailed: a few posts get tens or hundreds of thousands of impressions; the rest are background noise.

I then binned posts by character length in steps of 25 characters (using the text as returned by the API, including short t.co URLs).

The bulk of my writing is short, often a reply of a few dozen characters:

Characters	Posts	Median likes	Max likes
0–25	44	1	195
25–50	87	1	60
50–75	69	0	98
75–100	51	1	61
100–125	33	1	32
…
175–200	12	5	58
200–225	17	4	456
275–300	19	4	385
300–325	48	46.5	470

Under about 175 characters, the median stays at zero or one like. Around full-length posts (roughly the old 280-character regime and a bit beyond), engagement jumps. The 300–325 character band is where a large fraction of my “serious” posts live, and the median likes there are an order of magnitude higher than for short replies.

I also asked the AI to identify the posts that had the most likes, the following types of posts were liked:

AI vs. “experts” claiming models are nowhere near human intelligence
Go adding SIMD-style data-parallelism to the standard library
SIMD-accelerated data processing talks and library notes (JSON, string→integer maps, vulnerability-report fatigue)
Nvidia hardware, university AI-cheating, C++ contracts

The interesting part is the workflow. I did not export a CSV by hand and open Excel. I asked an agent, connected to X’s MCP server. If AI agents can do this for one account’s metrics, they can do it for bug trackers, logs, paper drafts, and codebases.

Chatting with an AI Won’t Make You a Top Programmer

Daniel Lemire — Sun, 21 Jun 2026 17:51:16 +0000

When I was a kid, most people did not know how to type. We took typing class. The final exam was a speed test: words per minute. Today, you will not impress anyone by saying you can type. In fact, cursive writing is fading. Kids increasingly cannot read or write it. We type constantly. We forget how many skills are learned, and how often some of these skills have faded.

But not everything fades. Socrates would be immensely popular today as a teacher. I still buy and recommend paper books.

Is reading and writing code more like Socrates, or more like cursive writing? There are clear signs that code could become like cursive writing. This year, I have met more than one student who could use AI to build an application but could not read or write code. It is not new. Software has long had non-technical people who describe what they built or designed. In fact, in much of the industry, the standard view was that once you had a university degree, you no longer coded. Coding was for monkeys or low-status employees. Top engineers paid a million dollars a year at Google or Meta know how to write code. They often read and write assembly and TypeScript. They know it all.

Why the discrepancy?

We pay an engineer a million dollars because he understands concepts few others grasp. He outruns others because he sees the problems more deeply. Reading and writing large amounts of code is part of how you gain those insights. Chatting with an AI will not make you a top 1% programmer. In the future, top engineers might read more code than anyone could in the past. These engineers will not be everywhere, but they will pack a punch. “But Daniel, people say programming is solved. Why read or write code?” Be careful with your models. When television arrived, some predicted it would replace the university lecturer. In some respects the model was correct, yet it did not happen. The lecturer’s job was never to deliver a TV show. The Google engineer paid a million dollars was never a machine that produces code. Nobody actually wants code, any more than they want raw text.

In fact, I predict a bifurcation in the tooling. The best engineers will work with tools that maximize their understanding of the code. I believe that reading and writing code, at a high level, is more like studying Socrates than like cursive writing. It is a necessary mental labor that does not become obsolete just because we have better tools for generating output.

Parsing JSON at compile time with C++26 static reflection

Daniel Lemire — Sun, 14 Jun 2026 14:59:44 +0000

Suppose that you have a configuration file in JSON. Something like this:

{ "width": 1920, "height": 1080, "fullscreen": true,
  "title": "My Game", "volume": 0.8 }

Normally you ship this file alongside your program, open it at startup, read it, and parse it. That is a lot of work for data that never changes. What if the file is fixed at build time? Could the compiler read it, parse it, and bake the result directly into the executable as a constant?

With C++26, the answer is yes. We need two new ingredients, all of which are usable right now with the latest version of the GCC compiler (16).

#embed to pull the file into the program at compile time,
A software library supporting static reflection like simdjson.

Let me show you how far we can take this.

The new #embed directive reads a file and expands it into a comma-separated list of byte values. To read the file data.json at compile time and keep it around as a constant, we write:

constexpr const char json_data[] = {
#embed "data.json"
    , 0
};

I use constexpr because I want the compiler to be allowed to inspect these bytes during constant evaluation. The trailing , 0 simply appends a null terminator, so the array can be treated as an ordinary C string.

There is no run-time input/output of any kind. The bytes are part of the program.

But embedded bytes are not yet useful by themselves. What I really want is a typed C++ object. In my example, the target type is this configuration struct:

struct Window {
  int         width;
  int         height;
  bool        fullscreen;
  std::string title;
  double      volume;
};

The traditional way to populate such a struct from JSON is to write, by hand, one line per field: read "width", store it into width, read "height", store it into height, and so on. It is tedious. And because it runs at startup, a malformed file becomes a run-time error, discovered by your users rather than by you.

Recent versions of simdjson can parse JSON at compile time using C++26 static reflection. The entry point is simdjson::compile_time::parse_json, and it does something I still find slightly magical: it reads the JSON and, from the keys it finds, and synthesises the struct type for you.

#define SIMDJSON_STATIC_REFLECTION 1
#include "simdjson.h"
constexpr const char json_data[] = {
#embed "data.json"
    , 0
};
constexpr auto window = simdjson::compile_time::parse_json<json_data>();

The variable window is a value computed entirely by the compiler. Its type is generated from the document: it has a width and a height (both 64-bit integers), a bool fullscreen, a double volume, and a title. From here on I write window.width and it behaves like any ordinary field.

How do I know the parsing really happened at compile time? Because I can assert things about the result that the compiler must check before the program even exists:

static_assert(window.width      == 1920);
static_assert(window.height     == 1080);
static_assert(window.fullscreen == true);

If I corrupt the JSON — delete a brace, misspell true, leave a trailing comma — the program no longer compiles, and the error points at the parse_json line. The broken file is caught at build time, on my machine, instead of at startup on someone else’s.

Because window is a genuine compile-time constant, any computation over it is a constant too. Consider this function:

int  screen_area()   { return window.width * window.height; }

Compiled with -O3, there is no multiplication, no field access, and certainly no parsing left — only the answers, as immediate values (here on my macBook):

screen_area:    mov  w0, #0xa400        // 0x1fa400 = 2073600
                movk w0, #0x1f, lsl #16
                ret

The JSON has vanished from the binary. It was read and parsed exactly once, by the compiler, and all that survives is the number 2073600.

Because static reflection is so new, when building with GCC 16, you need to pass the flags -std=c++26 -freflection: the -freflection flag is necessary to activate compile-time reflection You must also set the simdjson macro SIMDJSON_STATIC_REFLECTION=1 before importing the simdjson.h. It is a temporary safeguard.

The source code to reproduce these examples is available.

Reference: P2996 — Reflection for C++26 and the simdjson library.

Credit: The simdjson implementation is joint work with Francisco Geiman Thiesen.

Sovereign

Daniel Lemire — Tue, 09 Jun 2026 18:39:32 +0000

The keyword in politics these days is ‘sovereign’.

What few will admit is that it is effectively the adoption of the American strategy: Make America Great Again. In other words, reindustrialization of key sectors of the economy. The UK used to be a computing champion. Our chip designs (ARM) originated from the UK. Canada had BlackBerry, everyone was using Canadian phones.

Like Canada, many countries have progressively slid into financialization. Huge banks and bank-related businesses, surrounded by emptied factories.

Part of it was the doing of economists who promoted globalization. We are going to make our best CPUs in Taiwan, because they have a comparative advantage (whatever that means).

Another part is the rise of the managerial class, or our version of the technocracy: the summum of the status game is to make PowerPoint presentations in a nice office. Everyone has 2 or 3 university degrees. And if you don’t have many degrees, what is wrong with you?

I think that this is breaking apart for a few reasons.

One of them is Trump. And I don’t mean bombastic statements, bad hair color or overly long ties… But rather the realization that globalization might leave your people economically better off for a time because they have cheap stuff… But it also leaves your people with few skills. You can fund a robotics factory near Montreal, and you’ll find 2000 people with robotics PhDs, but nobody actually knows how to build robots. The analogy is thus: salary is not everything (to the great chagrin of economists). I have left a much better paying job. My previous job meant that I had to sit in an office and do little concrete. I would have had a great and early retirement… but I would never have developed my skills nor would I have built anything. And that’s what we did at the country level. Great total compensation, but a dead-end skill-wise.

Another factor, I believe, is the COVID era (2020-2023) and its final outcome: empty offices, closed coffee shops. What happened at work was illegible. Lots of people in offices. Certainly, something important was happening. Entire businesses and government organizations have now migrated partially or entirely to a pajama party of some kind. Netflix in the middle of the workday is no longer a dream, but a reality.

Another element is educational misalignment. A country like Canada has the most schooled population in history. You cannot throw a rock without hitting someone with a PhD. Meanwhile, we are not making robots or microchips. You can’t even pay with your phone in the Montreal subway. It is a project for another decade, maybe. We are using a push strategy: push more people with degrees into the economy and you are going to get a fancier economy. Won’t work.

Finally, the AI breakthrough of 2022 is the final nail in the proverbial coffin. My country (Canada) claimed for decades that we were the AI powerhouse. All these PDFs online can’t lie, can they? Canada basically invited modern AI, didn’t it? We did. On paper. On paper we did a great many things. In practice? Few know how to build anything.

In a country like Canada, the population has not yet caught up. They blame the orange man for whatever trouble they see. And the politicians promise to do what they must: shower money to build tech sovereignty. It won’t work. They will try again. It won’t work again. The cycle makes things worse because it sustains a managerial class that is great at politics but terrible at building.

Meanwhile, you can’t escape preference falsification. People will use ChatGPT, Claude, Grok, Gemini. And if Elon produces robots, they’ll want them.

Trump will leave office in two years… Canada and the UK will still be flat-lined economically. The USA will still surge ahead.

Countries like Canada and the UK will have to realize that it is industry and know-how first. Build stuff and the wealth will follow. Stop the virtue signaling. Stop the credentialism. Build.

How much do amd64 microarchitecture levels help in Go?

Daniel Lemire — Sat, 06 Jun 2026 20:25:19 +0000

Our 64-bit Intel and AMD processors have evolved over decades. When you compile a Go program for a 64-bit Intel or AMD processor, the compiler targets, by default, a nearly 20-year-old instruction set. The binary that comes out runs on essentially any x64 chip, but it also leaves on the table every instruction that was added since 2003.

We often refer to microarchitecture levels. Each level bundles a set of instruction-set extensions that you can assume are present:

Level	Adds (roughly)
v1	the original AMD64 baseline (SSE2)
v2	`popcnt`, SSE4.2
v3	AVX2
v4	AVX-512 (F/BW/DQ/VL)

In my view, this ladder is already slightly obsolete. It was frozen around 2020, and the hardware has moved on. We would need to add the latest AVX-512 sub-extensions (VBMI, VBMI2, VNNI, BF16, FP16, VPOPCNTDQ, and so on), which recent server and consumer chips support but which v4 does not require. While v1 through v4 are a useful common language, a realistic “use everything this CPU offers” target today would need at least a v5, and arguably the whole scheme should be replaced by finer-grained feature detection.

In any case, the Go toolchain exposes this v1 through v4 ladder via the GOAMD64 environment variable. Setting GOAMD64=v3 tells the compiler it may use everything up to and including AVX2. The default is v1, the lowest common denominator.

This raises an obvious question. If I take a real, performance-sensitive library and recompile it at each level, how much do I actually gain? I picked Roaring Bitmaps, a compressed bitset data structure used in databases and search engines.

A Roaring Bitmap stores a set of 32-bit integers. It splits the 32-bit space into chunks of 65,536 values, keyed by the high 16 bits, and stores each chunk in a container that holds only the low 16 bits. A container comes in one of three shapes, and the library always keeps whichever is smallest:

an array container: a sorted list of 16-bit values, used when the chunk is sparse (a few thousand elements at most);
a bitmap container: a flat 8 KB bit vector (65,536 bits, one per possible value), used when the chunk is dense;
a run container: a list of [start, length] intervals, used when the set bits cluster into consecutive runs.

I fetched the latest release of the library, then ran its own benchmark suite four times, once per level, collecting eight samples each. I did this on a single Intel Xeon Gold 6548N (Emerald Rapids, which supports all four levels, including AVX-512) under Go 1.26.2 and Roaring v2.18.2.

A population count (or popcount, also called the Hamming weight) is simply the number of bits set to 1 in a machine word. Roaring leans on it constantly: the cardinality of a bitmap container, how many values it holds, is the sum of the population counts of its 1024 64-bit words. Modern x86 chips have a dedicated popcnt instruction that does this in a single operation, but it only became available at the v2 level (SSE4.2, 2008). Without it, the compiler has to fall back to a multi-instruction bit-twiddling sequence.

The clearest single result is population count: counting the number of set bits in a bitmap container. The v1 baseline cannot use the popcnt instruction, so Go emits a software fallback. The moment we move to v2, popcnt becomes available and the time is cut almost in half:

That is a 43% reduction, and it is free: no source change, just a compiler flag. Notice, though, that v3 and v4 do nothing more. A single popcnt instruction is already optimal; as far as the Go compiler is concerned, AVX2 and AVX-512 have nothing to add.

Population count is the easy win. What about the rest of the library?

Another clear win is building a container from a dense bitmap. The FromDense array benchmark takes a raw 8 KB bit vector and constructs the most compact container for it: it popcounts every word to learn the cardinality, then scans out the positions of the set bits. That word-at-a-time popcount-and-scan loop is exactly what the compiler can auto-vectorize once 256-bit registers are available, so the gains keep coming past v2:

v2 already cuts 21% by using scalar popcnt/tzcnt instructions, and v3 (AVX2) nearly doubles that to a 38% reduction. As with popcount, v4 adds nothing.

Set operations show the same pattern. The IntersectionCardinality benchmark counts how many values two bitmaps have in common: for bitmap containers, it ANDs the words pairwise and population-counts the result, without ever materializing the intersection. Here v2 does essentially nothing (the scalar popcnt is already in the inner loop), but v3 lets the compiler widen the AND-and-count loop to 256-bit registers, cutting the time by 22%:

Takeaways:

On modern hardware, everyone should be using v2 or better. The resulting binary will run in any data center and on any non-ancient laptop.
The v3 level might be worth investigating.
The v4 level should have helped in some of my benchmarks, but it did not. I suspect that the Go compiler is just not great at it.

(Obviously: run your own benchmarks.)

Embodied cognition and agentic AI

Daniel Lemire — Thu, 28 May 2026 23:04:35 +0000

Where is your intelligence located? In your brain?

It is a simplistic answer. A better model is that your intelligence is embodied.

Consider a cook working at an expensive restaurant. He has all his favorite knives and cooking instructions, placed exactly where he wants them. His kitchen is part of his intelligence, of his skills. The same cook working in your kitchen can probably cook better than you do, but he can’t reproduce the same meals he would prepare in his favorite kitchen.

We often assess computer programmers using whiteboard tests. It is an endless source of complaints. Programmers rightly point out that it forces them out of their element. They are just not as good when you take away their laptop. It is not an excuse, it is a real issue: you are cutting them off from part of what makes them so intelligent.

To sum it up, the model of intelligence as a brain in a jar, disconnected from anything else, is ridiculous.

If you accept the idea of embodied intelligence, then many actions that we view as a consequence of our intelligence are actually part of our intelligence. First and foremost, language. Our ability to talk or write to each other means that I am not limited by my own person. Have you ever heard of human beings isolated in small tribes making technological breakthroughs? Nah. Progress requires lots of people communicating together. Up until a few decades ago, progress required cities. Today I am less certain than it does, as I can more and more communicate with anyone in the world from anywhere. But language is still critical, we have not invented anything better. Similarly, having hands and the ability to build sophisticated tools (like laptops) allows us to extend our intelligence.

At the end of 2022, we got a breakthrough technology: ChatGPT. It built on several pre-existing ideas such as (large) language models, neural networks, and so forth. That’s the ‘GPT’ part. But an important, if underappreciated, part of the breakthrough was the ‘Chat’ component. Someone had the idea of connecting a large language model with a chat interface. Maybe this came naturally and obviously to people building this system, but it should not be assumed to be trivial or unimportant.

Language is a key component of our intelligence, and, thus, it makes sense that it would be pivotal for machine intelligence.

We embodied the AI software in a chat box.

The next step was what we call today ‘agentic AI’. We keep the chat box, but we add the ability for the AI software to interact with tools, and to make plans to use them. In effect, we give the AI more agency: it can do stuff and learn from the results as they happen. It is starting to resemble a human being with hands and tools.

I was talking with a colleague this week. My colleague is all in on the AI revolution. He uses his AI to help him write better and faster, and to get his data analysis done faster, without so much help from technical experts.

But my colleague was not aware of the agentic AI approach. I tried to explain on the phone. What does it mean to give the AI access to tools? Is this only about saving the effort of copying and pasting the AI’s response?

I ended up making a video where I start an AI in a shell within something called RStudio. It is an environment people use to program in R, to do data analysis. I don’t use R or RStudio, but thanks to the AI, I was able to build an entire climate research project in a few minutes, complete with the retrieval of the data from the web.

How did the AI do it? I recorded it. It tried a few things, initially struggling to download the data. At some point, it finds out that it needs new R packages, so it installs them, and once they are installed, it can proceed to generate figures, verifying that it works.

Agentic AI greatly extends machine intelligence by improving the embodiment of AI.

I believe that it is not yet understood as it should be.

In Montreal, the most established professor in the field of AI is Yoshua Bengio. He started his own non-trivial enterprise a few years ago (Element AI). His latest venture is Law Zero, which aims to create a Scientist AI. The first goal of this project is to build AI without the agentic component. It should be a disembodied AI that has no goal of its own, no agency.

I fear that Bengio suffers from what Kevin Kelly called Thinkism. Let me quote from Kelly’s 2008 essay.

No intelligence, no matter how super duper, can figure out how human body works simply by reading all the known scientific literature in the world and then contemplating it. No super AI can simply think about all the current and past nuclear fission experiments and then come up with working nuclear fusion in a day. Between not knowing how things work and knowing how they work is a lot more than thinkism. There are tons of experiments in the real world which yields tons and tons of data that will be required to form the correct working hypothesis. Thinking about the potential data will not yield the correct data. Thinking is only part of science; maybe even a small part. (…) Thinkism is not enough. Without conducting experiments, building prototypes, having failures, and engaging in reality, an intelligence can have thoughts but not results. It cannot think its way to solving the world’s problems. (…) The Singularity is an illusion that will be constantly retreating — always “near” but never arriving. We’ll wonder why it never came after we got AI. Then one day in the future, we’ll realize it already happened. The super AI came, and all the things we thought it would bring instantly — personal nanotechnology, brain upgrades, immortality — did not come. Instead other benefits accrued, which we did not anticipate, and took long to appreciate. Since we did not see them coming, we look back and say, yes, that was the Singularity.

I believe that University professors are especially prone to thinkism. They view intelligence as being centered on what is happening in their brain. When you live in an ivory tower, it is easy to dismiss the real world as the core source of intelligence. Further, they are often people who did quite well in school where thinkism is naturally prevalent.

I have been a professor most of my life. However, I tire quickly of talking with other professors. What I most enjoy is working with people who have new tools that they apply in the real world. Unsurprisingly, I spent most of my time working with software that people deploy in the real world.

What Kelly is saying is that a high degree of intelligence is not enough to do much of anything. The real world is not the final stage of your thinking process. It is maybe the most important part of it.

And thus, when you connect your AI with the real world, giving it the ability of running experiments (as virtually all software developers do today), you get impressive results that go much beyond what AI software can do on its own.

Agency is not a feature. Agency is primary.

Parsing IPv6 Addresses Crazily Fast with AVX-512

Daniel Lemire — Sat, 23 May 2026 02:45:11 +0000

Every machine connected to the Internet has an address called an IP address. Originally, these addresses were 32-bit integers (IPv4), giving a theoretical maximum of about four billion distinct addresses. We are all familiar with these addresses (e.g., 192.168.0.0). There was a big fuss about how we would run out of addresses. It never happened because we don’t actually need every device to have its own unique address. Your home router needs an address, but every device in your home does not need a worldwide unique address.

Nevertheless, the range was extended to cover 128 bits (IPv6). An IPv6 address is conventionally written as eight groups of four hexadecimal digits separated by colons. For example:

2001:0db8:85a3:0000:0000:8a2e:0370:7334

Because addresses often contain runs of zeros, the format allows two shortcuts:

Leading zeroes within a group may be omitted: 2001:db8:85a3:0:0:8a2e:370:7334.
A single run of all-zero groups may be replaced by ::: 2001:db8:85a3::8a2e:370:7334.

The double-colon trick can appear only once in an address, and can match one or more zero groups. Hence ::1 is the loopback address (all zeros except the last group), and :: is the unspecified address (all zeros).

IPv6 also accepts an embedded IPv4 address in the last 32 bits, written in the usual dotted-decimal form. This is mostly used for IPv4-mapped IPv6 addresses such as ::ffff:192.168.1.1. The longest possible textual form is 45 characters:

0000:0000:0000:0000:0000:ffff:255.255.255.255

So a parser must accept hexadecimal groups, the compressed form, and an optional IPv4 tail. It is more involved than parsing IPv4.

The standard C function for the job is inet_pton, available on essentially every system.

Can we do better?

A few years ago, I showed that you could parse IPv4 addresses really fast. Can we do the same with IPv6?

The trick is to use data parallelism: we invoke the so-called SIMD instructions that all our processors support. These instructions can process potentially dozens of bytes at once.

Shreesh Adiga gave it a try with AVX-512, the powerful instruction set supported by recent Intel server processors and all new AMD CPUs. The idea is to load the entire string into a 512-bit register, find the colons with a single comparison, compute the spacing between them to drive a byte-level expand, translate hex digits via a permute, and finish with a multiply-accumulate that combines the hex digits into bytes. Almost the whole parser is branch-free, meaning that there are few if clauses.

I put together a small benchmark that generates random IPv6 addresses with inet_ntop (so the addresses are written in their canonical, compressed form) and parses each one with both inet_pton and the AVX-512 routine. The benchmark runs on an Intel Xeon Gold 6548N CPU @ 2.8 GHz (Emerald Rapids) with GCC, compiled with -march=native -O3.

function	ns/addr	speed (Mv/s)	instr/addr	instr/cycle
`inet_pton`	175.3	5.7	954	1.56
AVX-512	14.0	71.3	120	2.45

The AVX-512 routine is about 12 times faster than inet_pton, parsing more than 70 million addresses per second on a single core. It uses eight times fewer instructions, and runs them at a higher throughput (2.45 instructions per cycle versus 1.56).

The source code used for this benchmark is available on my blog repository.

Update. Peter Fors points out that the step in my benchmark, where I sum
up the bytes, adds some overhead especially under GCC. Thus I underestimate the speed slightly.

Only 17% of all 64-bit Integers are products of two 32-bit integers

Daniel Lemire — Fri, 22 May 2026 01:16:35 +0000

In software programming, the product between two integers is often computed to a fixed number of bits with overflow. Consider 8-bit integers. If you multiply 127 by 127, you get back the number 1 as an 8-bit unsigned integer, with an overflow. The actual full product is 16129. To represent 16129, you typically use 16 bits of precision.

Thus we have the notion of the full product. The full product of two 32-bit integers is typically represented using 64 bits. The question that preoccupied me is what fraction of all 64-bit integers can be written as the product of two 32-bit integers.

You might wonder why you would care?

We often design hash functions: they are special functions that take an input and generate a random-looking output. Several years ago I designed a very fast hash function called clhash. It is a super-fast hash function for strings having a few hundred bytes or more. If you don’t know about clhash, check it out. It is interesting in its own right.

This clhash hash function uses a type of multiplication typical of cryptographic applications. I was trying to argue that our approach had benefits compared with techniques based on standard multiplications. Let me illustrate. A simple hash function for 32-bit integers could take the least significant bits and multiply them with the most significant bits.

// simpleHighLowHash is a simple (and weak) 32-bit hash
// that multiplies the high 16 bits by the low 16 bits.
func simpleHighLowHash(x uint32) uint32 {
    high := uint16(x >> 16)
    low := uint16(x & 0xFFFF)
    return uint32(high) * uint32(low)
}

Maybe you’d want the hash function to be uniform: all possible 32-bit hash values should be equally probable. It is only possible in this instance if the hash function can produce all 32-bit hash values, which is not the case.

The great mathematician Erdös showed that the proportion of all 2n-bit values that can be generated by the product of two n-bit values goes to zero as n becomes large. This means that if you have, say, 10000000-bit integers multiplying 10000000-bit integers, you’d expect relatively few 20000000-bit integers to be produced. But what about practical cases like 32-bit integers or 64-bit integers?

You can just brute-force the problem easily up to the multiplication of 16-bit integers into 32-bit products. At that point, slightly one out of five 32-bit numbers is a product between two 16-bit integers. About 80% of all 32-bit integers are never produced by this hash. However, the running time grows exponentially, and brute force won’t scale all the way to 32 bits.

So what do we do about the 32-bit case? That is, what do you do when you multiply two 32-bit integers to produce a 64-bit product? What fraction of 64-bit values can the following function produce?

func simpleHighLowHash(x uint64) uint64 {
    high := uint32(x >> 32)
    low := uint32(x & 0xFFFFFFFF)
    return uint64(high) * uint64(low)
}

Can we get an exact result?

Yes!!!

Webster and his colleagues built the math to allow us to scale up the exact computation. He was kind enough to publish his code.

There are 3,215,709,724,700,470,902 64-bit (unsigned) integers that can be written as a product of two 32-bit integers. That’s about 17% of all possible values.

What about actually computing a pair of integers given their product? One approach consists of computing its full prime factorization, and then using those factors to build all possible divisors that are strictly less than 2^32, starting with a set of candidates containing only 1 and iteratively multiplying existing candidates by each prime factor (only keeping products that stay below 2^32). We can avoid adding duplicates to our set by processing unique prime factors with their multiplicity. Finally, we select the maximum such candidate m as the largest divisor under 2^32, compute the corresponding leftover n / m, and report whether a valid split into two 32-bit factors exists. In general, the answer (if it exists) is not unique: this returns the pair where one value is maximized. In Python, the code might look as follows.

for p in factor_multiplicities:
    new_candidates = []
    for c in candidates:
        for i in range(factor_multiplicities[p] + 1):
            if c * (p ** i) < 2**32:
                new_candidates.append(c * (p ** i))
    for new_c in new_candidates:
        candidates.append(new_c)
m = max(candidates)
print(f"Maximum candidate: {m}")
leftover = n // m
print(f"Leftover: {leftover}")
if leftover >= 2**32:
    print("Leftover is too large, cannot find a suitable candidate.")

You might be able to come up with a more efficient algorithm. I find it interesting to consider that if you pick a value at random, it will usually fail! That is, most 64-bit integers cannot be written as the product of two 32-bit integers.

SIMD-accelerated integer-to-string conversion

Daniel Lemire — Mon, 18 May 2026 19:39:49 +0000

Converting a 64-bit integer to its decimal string representation is a mundane task that shows up everywhere: logging, JSON serialization, CSV output, debug prints, etc. In C++, you might use std::to_chars, sprintf, or some library routine.

How do these functions work? At a high level, they repeatedly divide by ten. Start with your integer k. Divide it by ten, use the remainder as the last digit (it is between 0 and 9 inclusively). You then add the code point value of the character 0 to get the ASCII digit. To go faster, you can divide by 100 and use a lookup table so that the value between 0 and 99 inclusively is mapped to a string.

So far so good. Unfortunately, even with all these optimizations, this string generation may become a performance bottleneck. Can you do better?

Let us assume that you have a recent AMD processor or an Intel server. Then you have powerful data-parallel instructions (AVX-512) that can multiply eight 64-bit integers at once. We often refer to these instructions as SIMD (single instruction multiple data). My colleague Jaël Champagne Gareau and I recently published a new paper on exactly this problem. The title says it all: Converting an Integer to a Decimal String in Under Two Nanoseconds.

When you write n / 100 in code, an optimizing compiler converts the operation to a multiplication followed by a shift. It is often described as a multiplicative inverse. Generally, you can replace the division of n by d with the division of c * n or c * n + c by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. We often call c * n + c a fused multiply-add. Picking m to be a power of two means that the division by m is just a shift. Then you can get the remainder of the division by using the remainder of the division by m, multiplied by d and divided again by m, which is essentially a multiplication followed by a shift (Lemire et al., 2021).

We can put this to good use with the Integer Fused Multiply-Add (IFMA) instructions available on recent Intel and AMD processors. They essentially allow you to compute eight instances of (c * n + c)/m in one instruction. The expression (c * n + c)/m gives you the division, but we need the remainder, so instead we pick (c * n + c)%m which we need to multiply by the divisor.

The fun thing with AVX-512 instructions is that they can use a different c and a different divisor for each of the eight operations. Using Intel intrinsic functions, our core routine which converts a value smaller than 10^8 to eight digits looks as follows:

__m512i to_string_avx512ifma_8digits(uint64_t n) {
  __m512i bcstq_l   = _mm512_set1_epi64(n);
  constexpr uint64_t twoto52 = 0x10000000000000ULL; // 2^52
  __m512i ifma_const = _mm512_setr_epi64(
    twoto52 / 100000000, twoto52 / 10000000,
    twoto52 / 1000000, twoto52 / 100000,
    twoto52 / 10000, twoto52 / 1000, 
    twoto52 / 100, twoto52 / 10
  );
  __m512i zmmTen    = _mm512_set1_epi64(10);
  __m512i asciiZero = _mm512_set1_epi64('0');
  __m512i lowbits_l  = _mm512_madd52lo_epu64(ifma_const, 
    bcstq_l, ifma_const); // ifma_const * bcstq_l + ifma_const
  __m512i highbits_l = _mm512_madd52hi_epu64(asciiZero, 
    zmmTen, lowbits_l);
  return highbits_l;
}

It compiles down to two multiplication-add instructions: vpmadd52huq. That’s it. Two instructions to generate eight digits.

It works by broadcasting n across all eight 64-bit lanes of a __m512i vector (bcstq_l). It then prepares a vector of carefully chosen multiplicative inverses (ifma_const) that represent the reciprocals of 10^8, 10^7, … The magic happens in the single _mm512_madd52lo_epu64 instruction, which simultaneously performs eight fused multiply-adds: each lane computes (ifma_const[i] * n + ifma_const[i]) using 52-bit low-half multiplication, effectively extracting the quotient when dividing by the corresponding power of ten. A second _mm512_madd52hi_epu64 instruction (with a vector of ten and a vector of '0') then isolates the digit values and adds the ASCII '0' offset in the high 52 bits, producing eight packed digit characters in a single 512-bit register.

If all your integers require eight digits, you are done. But in the general case, putting this to good use requires a bit of effort.

Thankfully, even if you, say, need only six digits, you can do the full 8-digit computation and then use a masked store if you want to store only six digits, ignoring the two leftovers. That is, instruction sets like AVX-512 allow you to write only some of the data to memory, which is quite convenient.

We have two variants. One is branch-heavy and does well on homogeneous data (numbers with similar digit lengths). The other is branch-light and better for mixed workloads. A quick profiling step can pick the right one for your dataset.

Our implementation is consistently 1.4–2× faster than the best competitors and 2–4× faster than std::to_chars across a wide range of inputs. What I find interesting is that even if the std::to_chars implementation is not at all naive, you can do significantly better in many ways. James Anhalt’s approach (jeaiii) is also quite fast on modern hardware.

Further reading
– The paper: doi:10.1002/spe.70079
– The benchmarks are on GitHub (fully reproducible)
– Shortly after our paper came online, Barend Erasmus created a software library implementing our proposed approach. I am not certain that Barend includes both the homogeneous and heterogeneous approaches.

Checking multiplication overflow

Daniel Lemire — Wed, 06 May 2026 20:15:20 +0000

Suppose that x is a variable of an unsigned type. In C/C++, it could be of type size_t for example.

You have an expression like 6 * x and you want to know whether 6 * x overflows. That is, you want to know if 6 * x exceeds the range of values that can be represented by the type. In most cases, a variable of type size_t will be about to represent all values in the range [0, 2^64-1]. Instead of 64, let me use a variable for the number of bits: [0, 2^L-1].

The easiest approach is to compare x with (2^L-1) // 6 where I use the symbol // to denote the integer division (as opposed to /).

But can you do otherwise ?

If the value does not overflow, we know for sure that (6 * x)//6 == x. The interesting question is what happens when it overflows. We can answer this directly for an arbitrary non-zero constant a in the range [1, 2^L-1].

Let k = (a*x)//2^L be the number of times the multiplication wraps around. The effective (wrapped) value computed by the machine is r = a*x - k*2^L, with 0 <= r < 2^L. Overflow happens precisely when k >= 1. We have that k <= a − 1 because x<2^L.

Performing the integer division of r = a*x - k*2^L by a, we get x plus -k*2^L//a. When k is non-zero, this last value (-k*2^L//a) is one of -2^L//a, -2* 2^L//a, …, -(a-1) * 2^L//a.

When k = 0 (no overflow): r // a = x.
When k ≥ 1: r // a = x + (negative integer) ≠ x.

Hence we have the following result.

Theorem If x is of an unsigned type and a is a non-zero constant, then a * x overflows if and only if (a * x)//a != x.

In practice, a simple comparison x with (2^L-1) // a is likely more efficient. Optimizing compilers might be able to convert (a * x)//a != x to a simple comparison. Unfortunately, the Go compiler (for example) cannot.

An open question is whether there is a more mathematically elegant check.

Mapping Strings to Float Arrays in Go: How Fast Can We Go?

Daniel Lemire — Tue, 05 May 2026 19:01:09 +0000

A common pattern in modern software is to map a string key to a small array of floating-point numbers. Word embeddings, feature vectors, lookup tables for physical constants: all variations on the same theme. In Go, the obvious way to write this is a map[string][]float32. But how fast is it, really, and can we do better?

I have been working on constmap, a Go library that builds an immutable map from strings to uint64 values using the binary fuse filter construction. A lookup amounts to one hash, three array reads, and two XORs. There is no comparison, no chaining, no probing. The whole table fits in roughly 9 bytes per key, which often means it fits in cache where a Go map does not.

Go has fast maps, you cannot easily beat them in performance. But if you build a smaller data structure that causes fewer cache misses, you can definitively go faster.

By default, the constmap returns a uint64. But what if your value is an array of eight float32 numbers? You have at least two options:

Keep the arrays in a separate slice [][]float32. The constmap returns the index.
Store a pointer to the float array directly inside the constmap’s uint64.

The second option requires the unsafe package because we are smuggling a pointer through an integer field. It has some limitations.

You cannot and should not deserialize the data structure to disk.
You must make sure that a reference remains to your float array, or else the garbage collector could collect it and you’d be left with a dangling pointer. It is trickier than it sounds because Go can collect your memory if it sees that it is no longer used. And it cannot see through your unsafe calls converting an integer to a pointer value. Thankfully, you can just put all your arrays of floats in an array and call runtime.KeepAlive(mybigarray) at a strategic location: this will prevent Go from collecting mybigarray. The call to runtime.KeepAlive is not free but also quite cheap so you can possibly use a lot of such calls. benchmark

I built three lookups over 100,000 keys, each mapping to an 8-element []float32. We always access the first element of the array, to make sure that the bencmark is a bit fair. We have a large set of random queries (a query is a string).

We compare map[string][]float32, the standard constmap coupled with an array (so that the constmap constains indexes), and the constmap that contains what is effectively a pointer to the location of the []float32.

Run on an Apple M4 Max with Go’s standard benchmark harness:

Lookup	Time per op
`map[string][]float32`	21 ns
ConstMap → index → `[][]float32`	11 ns
ConstMap → pointer → `*[8]float32`	8.7 ns

The constmap with an index is already a twice as fast as the Go map. Replacing the index by a raw pointer shaves another 2 ns by skipping the indirection through the [][]float32 slice header. It is a speedup of about 20% in my case.

The result is interesting on its own: a constmap lookup is fast enough that the next memory load, the slice header read, becomes a measurable fraction of the work.

The benchmark and code are in github.com/lemire/constmap. Run them with:

go test -bench 'FloatArray' -benchtime=1s

House prices and fertility

Daniel Lemire — Thu, 30 Apr 2026 22:17:34 +0000

No, rising house prices are not the driver of sharp fertility declines. The evidence shows only modest, mixed effects that cannot explain the large drops observed in places like Canada.

What the Research Actually Shows:

A well-known study by Dettling and Kearney (2014) found that rising house prices have opposing effects: they slightly increase fertility among homeowners (via a “home equity” or wealth effect) and slightly decrease it among renters (via a price effect). At average U.S. homeownership rates, the net effect was a small increase in fertility.

This pattern has held up in other countries. For example, Daysal et al. (2021) and related work confirm similar homeowner/renter dynamics in Denmark and elsewhere.

Clark (2012) found that expensive housing markets are associated with a modest delay in age at first birth (roughly 3–4 years after controls), but the overall impact on completed fertility remains limited.

Canadian evidence aligns with this. Clark and Ferrer (2019) analyzed longitudinal data and found that higher lagged house prices were positively associated with the probability of an additional birth among homeowners in some specifications, but the effects were small and did not drive large-scale declines. They noted that falling school-aged children in high-price cities like Vancouver or Toronto are better explained by selective migration of people with preferences for fewer children into urban centers, rather than existing residents having fewer kids due to prices.

These effects are too modest to explain Canada’s fertility rate of about 1.25–1.3 children per woman (a record low, far below the 2.1 replacement level).

Expensive cities do have fewer children on average, but this largely reflects self-sorting: higher-income, higher-educated people (who tend to have fewer children) cluster in costly urban areas. Correlation is not causation.

Fertility is declining across most developed countries, regardless of housing costs:

– Israel maintains a high fertility rate (~2.9 children per woman, highest in the OECD), despite expensive housing; Tel Aviv is pricier than Montreal.
– Japan has long had very low fertility (~1.2) despite more affordable housing in many areas compared to Canada.

As Richard Florida summarized in “Don’t Blame Expensive Housing for Falling Fertility”: intuition suggests high costs deter families, but demographic research points more strongly to higher education and related lifestyle shifts as drivers of lower birth rates. Richer societies and cities tend to have lower fertility.

Fertility choices are heavily shaped early in life. A 16-year-old planning her future is unlikely to base major decisions on distant future house prices in her 20s or 30s. Models ignoring how teenagers view family, career, and life priorities miss the bigger picture.

Housing costs matter at the margins (with offsetting effects by tenure), but they are not the story behind broad fertility collapses. Policies focused on housing affordability are unlikely to reverse fertility trends.

We have no reason whatsoever to believe that a collapse in housing prices in a country like Canada would drive fertility upward. It is motivated reasoning.

High house prices are a terrible thing in my opinion. Investing all your capital in houses is an odd way to get prosperity. But it does not drive our fertility collapse.

References

Clark, Jeremy. 2012. “Do Women Delay Family Formation in Expensive Housing Markets?” Demographic Research 27(1): 1–24. https://doi.org/10.4054/DemRes.2012.27.1

Clark, Jeremy, and Ana Ferrer. 2019. “The Effect of House Prices on Fertility: Evidence from Canada.” Economics: The Open-Access, Open-Assessment E-Journal 13(2019-38): 1–32. https://doi.org/10.5018/economics-ejournal.ja.2019-38

Daysal, N. Meltem, Michael F. Lovenheim, Nikolaj Siersbæk, and David N. Wasser. 2021. “Home Prices, Fertility, and Early-Life Health Outcomes.” Journal of Public Economics 198: 104366. https://doi.org/10.1016/j.jpubeco.2021.104366

Dettling, Lisa J., and Melissa S. Kearney. 2014. “House Prices and Birth Rates: The Impact of the Real Estate Market on the Decision to Have a Baby.” Journal of Public Economics 110: 82–100. https://doi.org/10.1016/j.jpubeco.2013.09.009 (NBER Working Paper 17485, 2011/2014 version)

Florida, Richard. 2018. “Don’t Blame Expensive Housing for Falling Fertility.” CityLab (Bloomberg), June 14, 2018. https://www.bloomberg.com/news/articles/2018-06-14/the-complex-relationship-between-house-prices-and-fertility

You can beat the binary search

Daniel Lemire — Mon, 27 Apr 2026 17:32:13 +0000

We sometimes have to look for a value in a sorted array. The simplest algorithm consists in just going through the values one by one, until we encounter the value, or exhaust the array. We sometimes call this algorithm a linear search. In C++, you can get the desired effect with the std::find function.

For large arrays, you can do better with a binary search. Binary search is a classic algorithm that efficiently locates a target value in a sorted array by repeatedly dividing the search interval in half. Starting with the entire array, it compares the target to the middle element: if the target is smaller, it discards the upper half; if larger, it discards the lower half. This process continues until the target is found or the interval is empty. It is much faster than linear search for large datasets. In C++, this is implemented by the std::binary_search function, which returns a boolean indicating whether the value is present.

The popular Roaring Bitmap format uses arrays of 16-bit integers of size ranging from 1 to 4096. We sometimes have to check whether a value is present. We use a binary search.

I wanted a faster approach. I had two insights.

Virtually all processors today have data parallel instructions (sometimes called SIMD) that can check several values at once. Both 64-bit ARM and x64 processors (Intel/AMD) always support comparing eight 16-bit integers with a target value using a single instruction. This suggests that you should not bother going down in the binary search to blocks that are smaller than eight elements. And you may also want to cheaply compare sixteen elements or more.
The binary search checks one value at a time. However, recent processors can load and check more than one value at once. They have excellent memory-level parllelism. This suggest that instead of a binary search, we might want to try a quaternary search: instead of splitting arrays in halves, we might split them in quarters. The net result might generate a few more instructions but the number of instructions is likely not the limiting factor.

Thus, I created something I call the SIMD Quad algorithm. It is an efficient search algorithm for sorted arrays of 16-bit unsigned integers, combining a quaternary interpolation search with SIMD (Single Instruction, Multiple Data). The algorithm divides the array into fixed-size blocks of 16 elements (except maybe for the last block) and uses the last element of each block as interpolation keys to quickly narrow down the search to a single block, then employs SIMD instructions to check all 16 elements in that block simultaneously.

The core idea is to perform a hierarchical search: first, use interpolation search on a coarser level (block boundaries) to find the likely block containing the target value, then switch to SIMD for fine-grained parallel checking within the block. This hybrid approach leverages the strengths of both algorithmic optimization (interpolation search reduces comparisons logarithmically) and hardware acceleration (SIMD checks multiple elements at once).

Initial Check: If the array has fewer than 16 elements, perform a simple linear search through all elements.
Block Division: Divide the array into blocks of 16 consecutive elements. For an array of size cardinality, there are num_blocks = cardinality / 16 full blocks.
Quaternary Interpolation Search: Use the last element of each block (at positions 16-1, 32-1, etc.) as keys for interpolation. The search performs a quaternary (base-4) interpolation to find the block where the target pos is likely located. This involves comparing the target against quarter-points of the current search range and adjusting the base accordingly.
Block Selection: After narrowing down, select the appropriate block index lo based on the interpolation results.
SIMD Check: If a valid block is found, load the 16 elements into SIMD registers (using NEON on ARM or SSE2 on x64) and perform parallel equality comparisons with the target value. If any match is found, return true.
Remainder Check: For any elements not in full blocks (remainder), perform a linear search.

How does it do? I wrote a benchmark. The benchmark works as follows. For each array size from 2 to 4096 elements, it generates 100,000 sorted arrays of 16-bit unsigned integers. For each size, it performs 10 million membership queries in “cold” mode (each query searches a different array, simulating cache misses) and 10 million queries in “warm” mode (queries are grouped by array, with each array being searched 100 times consecutively, simulating cache hits). The benchmark measures the average time per query for three algorithms: linear search (std::find), binary search (std::binary_search), and the new SIMD Quad algorithm.

I use two systems. An Apple M4 with Apple LLVM and an Intel Emeral Rapids processor with GCC.

Firstly, let us compare the linear search with the binary search.

Intel/GCC:

Apple/LLVM

The result is clear. The binary search beats the linear search as soon as the arrays get large. That is to be expected.

On a cold cache, the linear search is relatively worse. That is to be expected because it accesses more data, causing more cache faults.

We have established that the binary search is the net winner over the linear search. Let us now compare with the SIMD Quad algorithm.

Intel/GCC:

Apple/LLVM

The results differ markedly between the Intel and Apple platform. On the Intel platform the SIMD Quad is more than twice as fast as the binary search on the warm cache. The benefits are lesser on the cold cache. On the Apple platform, the reverse is true, it is with the cold cache that the SIMD Quad is more than twice as fast, whereas the benefits are more marginal on the warm cache.

But the important point is that, in all instances, SIMD Quad is faster than the binary search.

The SIMD component of the algorithm is rather straightforward: we use specialized instructions that save work. So it is easy to see why it might make things faster. There are few instructions, fewer branches.

But what about the ‘quad’ part. Does it matter? So I tried a binary version of the same algorithm. It has the same SIMD optimization, but I am dropping the quaternary interpolation search and replacing it with a standard binary search.

Intel/GCC:

Apple/LLVM

To put it in simple terms, the quad approach has little effect on the Apple platform, but it is a decent optimization on the Intel platform for large arrays in the cold case. The quaternary search better exploits the memory-level parallelism on my Intel server.

My source code is available.

Conclusion. What my results suggest is that while a textbook binary search is a decent algorithm, you can do better in ways that matter. Standard algorithms were often not designed for computers that have so much parallelism. The SIMD Quad algorithm tries to leverage both the memory-level and data parallelism. Further, I suspect that we can do even better than my algorithm. Let us get creative!

Further reading: Faster intersections between sorted arrays with shotgun

Appendix (source code)

bool simd_quad(const uint16_t *carr, int32_t cardinality, 
            uint16_t pos) {
    constexpr int32_t gap = 16;
    if (cardinality < gap) {
      for (int32_t j = 0; j < cardinality; j++) {
          if (carr[j] == pos) return true;
        }
        return false;
    }
    int32_t num_blocks = cardinality / gap;
    int32_t base = 0;
    int32_t n = num_blocks;
    while (n > 3) {
      int32_t quarter = n >> 2;

      int32_t k1 = carr[(base + quarter + 1) * gap - 1];
      int32_t k2 = carr[(base + 2 * quarter + 1) * gap - 1];
      int32_t k3 = carr[(base + 3 * quarter + 1) * gap - 1];

      int32_t c1 = (k1 < pos);
      int32_t c2 = (k2 < pos);
      int32_t c3 = (k3 < pos);

      base += (c1 + c2 + c3) * quarter;
      n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (carr[(base + half + 1) * gap - 1] < pos) 
                 ? base + half : base;
        n -= half;
    }
    int32_t lo = (carr[(base + 1) * gap - 1] < pos) 
                ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
#ifdef __ARM_NEON
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8_t v0 = vld1q_u16(blk);
        uint16x8_t v1 = vld1q_u16(blk + 8);
        uint16x8_t hit = vorrq_u16(vceqq_u16(v0, needle), 
                  vceqq_u16(v1, needle));
        return vmaxvq_u16(hit) != 0;
#else
        __m128i needle = _mm_set1_epi16((short)pos);
        __m128i v0 = _mm_loadu_si128((const __m128i *)blk);
        __m128i v1 = _mm_loadu_si128((const __m128i *)(blk + 8));
        __m128i hit = _mm_or_si128(_mm_cmpeq_epi16(v0, needle),
                                   _mm_cmpeq_epi16(v1, needle));
        return _mm_movemask_epi8(hit) != 0;
#endif
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return (v == pos);
    }
    return false;
}

The fastest way to match characters on ARM processors?

Daniel Lemire — Sun, 19 Apr 2026 20:41:04 +0000

Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and the space) and some characters important in JSON (:, ,, [, ], {, }). JSON is a text-based data format used for web services. A toy JSON document looks as follows.

{
  "name": "Alice",
  "age": 30,
  "email": "alice@example.com",
  "tags": ["developer", "python", "open-source"],
  "active": true
}

We want to solve this problem using SIMD (single-instruction-multiple-data) instructions. With these instructions, you can compare a block of 16 bytes with another block of 16 bytes in one instruction.

It is a subproblem in the fast simdjson JSON library when we index a JSON document. We call this task vectorized classification. We also use the same technique when parsing DNS records, and so forth. In the actual simdjson library, we must also handle strings and quotes, and it gets more complicated.

I need to define what I mean by ‘matching’ the characters. In my case, it is enough to get, for each block of 64 bytes, two 64-bit masks: one for spaces and one for important characters. To illustrate, let me consider a 16-byte variant:

{"name": "Ali" }
1000000100000001 // important characters
0000000010000010 // spaces

Thus, I want to get back the numbers 0b1000000100000001 and 0b0000000010000010 in binary format (they are 33025 and 130 in decimal).

I refer you to Langdale and Lemire (2019) for how to do it using the conventional SIMD instructions available on ARM processors (NEON). Their key idea is a table-driven, branch-free classifier: for each byte, use SIMD table lookups to map each nibble to a bitmask, and compare to decide whether the byte belongs to a target set (whitespace or structural JSON characters). This avoids doing many separate equality comparisons per character.

There is now a better way on recent ARM processors.

The 128-bit version of NEON was introduced in 2011 with the ARMv8-A architecture (AArch64). Apple played an important role and it was first used by the Apple A7 chip in the iPhone 5S. You can count on all 64-bit ARM processors to support NEON, which is convenient. (There are 32-bit ARM processors but they are mostly used for embedded systems, not mainstream computing.)

ARM NEON is good but getting old. It is no match for the AVX-512 instruction set available on x64 (AMD and Intel) processors. Not only do the AVX-512 instructions support wider registers (64 bytes as opposed to ARM NEON’s 16 bytes), but they also have more powerful instructions.

But ARM has something else to offer: Scalable Vector Extension (SVE) and its successor, SVE2. Though SVE was first introduced in 2016, it took until 2022 before we had actual access. The Neoverse V1 architecture used by the Amazon Graviton 3 is the first one I had access to. Soon after, we got SVE2 with the Neoverse V2 and N2 architectures. Today it is readily available: the Graviton4 on AWS, the Microsoft Cobalt 100 on Azure, the Google Axion on Google Cloud (and newer Google Cloud ARM CPUs), the NVIDIA Grace CPU, as well as several chips from Qualcomm, MediaTek, and Samsung. Notice who I am not including? Apple. For unclear reasons, Apple has not yet adopted SVE2.

I have mixed feelings about SVE/SVE2. Like RISC-V, it breaks with the approach from ARM NEON and x64 SIMD that uses fixed-length register sizes (16 bytes, 32 bytes, 64 bytes). This means that you are expected to code without knowing how wide the registers are.

This is convenient for chip makers because it gives them the option of adjusting the register size to better suit their market. Yet it seems to have failed. While the Graviton 3 processor from Amazon had 256-bit registers… all commodity chips have had 128-bit registers after that.

On the plus side, SVE/SVE2 has masks a bit like AVX-512, so you can load and process data only in a subset of the registers. It solves a long-standing problem with earlier SIMD instruction sets where the input is not a multiple of the register size. Both SVE/SVE2 and AVX-512 might make tail handling nicer. Being able to operate on only part of the register allows clever optimizations. Sadly, SVE/SVE2 does not allow you to move masks to and from a general-purpose register efficiently, unlike AVX-512. And that’s a direct consequence of their design with variable-length registers. Thus, even though your registers might always be 128-bit and contain 16 bytes, the instruction set is not allowed to assume that a mask fits in a 16-bit word.

I was pessimistic regarding SVE/SVE2 until I learned that it is designed to be interoperable with ARM NEON. Thus you can use the SVE/SVE2 instructions with your ARM NEON code. This works especially well if you know that the SVE/SVE2 registers match the ARM NEON registers (16 bytes).

For the work I do, there are two SVE2 instructions that are important: match and nmatch. In their 8-bit versions, what they do is the following: given two vectors a and b, each containing up to 16 bytes, match sets a predicate bit to true for each position i where a[i] equals any of the bytes in b. In other words, b acts as a small lookup set, and match tests set membership for every byte of a simultaneously. The nmatch instruction is the logical complement: it sets a predicate bit to true wherever a[i] does not match any byte in b. A single instruction thus replaces a series of equality comparisons and OR-reductions that would otherwise be needed. In the code below, op_chars holds the 8 structural JSON characters and ws_chars holds the 4 whitespace characters; calling svmatch_u8 once on a 16-byte chunk d0 produces a predicate that has a true bit exactly where that input byte is a structural character. The code uses SVE2 intrinsics: compiler-provided C/C++ functions that map almost one-to-one to CPU SIMD instructions, so you get near-assembly control without writing assembly.

// : , [ ] { }
uint8_t op_chars_data[16] = {
    0x3a, 0x2c, 0x5b, 0x5d, 0x7b, 0x7d, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};
// \t \n \r ' '
uint8_t ws_chars_data[16] = {
    0x09, 0x0a, 0x0d, 0x20, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};

// load the characters in SIMD registers
svuint8_t op_chars = svld1_u8(svptrue_b8(), op_chars_data);
svuint8_t ws_chars = svld1_u8(svptrue_b8(), ws_chars_data);

// load data
// const char * input = ...
svbool_t pg = svptrue_pat_b8(SV_VL16);
svuint8_t d = svld1_u8(pg, input);

// matching
svbool_t op = svmatch_u8(pg, d, op_chars);
svbool_t ws = svmatch_u8(pg, d, ws_chars);

In this code snippet, svuint8_t is an SVE vector type containing unsigned 8-bit lanes (bytes). svbool_t is an SVE predicate (mask) type. svptrue_b8() builds a predicate where all 8-bit lanes are active, and svld1_u8(pg, ptr) loads bytes from memory into an SVE vector, using predicate pg to decide which lanes are actually read.

If you paid attention thus far, you might have noticed that my code is slightly wrong since I am including 0 in the character sets. But it is fine as long as I assume that the zero byte is not present in the input. In practice, I could just repeat one of the characters, or use a bogus character that I do not expect to see in my inputs (such as the byte value 0xFF, which cannot appear in a valid UTF-8 string).

In standard SVE/SVE2, op and ws are predicates, not integer masks. A practical trick is to materialize each predicate as bytes (0xFF for true, 0x00 for false), for example with svdup_n_u8_z.

svuint8_t opm = svdup_n_u8_z(op, 0xFF);
svuint8_t wsm = svdup_n_u8_z(ws, 0xFF);

When SVE vectors are 128 bits, this byte vector maps naturally to a NEON uint8x16_t via svget_neonq_u8, and from there we can build scalar bitmasks efficiently with NEON operations (masking plus pairwise additions). Repeating this over four 16-byte chunks gives the two 64-bit masks needed for a 64-byte block.

I wanted to quickly run my benchmarks on an AWS Graviton 4. I used LLVM clang 20 which was readily available in the images that AWS makes available (I picked RedHat 10).

The AWS Graviton 4 processor is a Neoverse V2 processor. Google has its own Neoverse V2 processors in its cloud. In my tests, it ran at 2.8 GHz.

My benchmark generates a random string of 1 MiB and computes the bitmaps indicating the positions of the characters. It is available on GitHub. My results are as follow.

method	GB/s	instructions/byte	instructions/cycle
simdjson (NEON)	15.5	0.75	4.16
SVE/SVE2 (new!)	16.0	0.55	3.13

So the SVE/SVE2 approach is faster than the NEON equivalent and uses 25% fewer instructions, and that’s without any kind of fancy optimization. Importantly, the code is relatively simple thanks to the match instruction.

It might be that the SVE2 function match is the fastest way to match characters on ARM processors.

Credit: This post was motivated by a sketch by user liuyang-664 on GitHub.

References

Langdale, G., & Lemire, D. (2019). Parsing gigabytes of JSON per second. The VLDB Journal, 28(6), 941-960.

Koekkoek, J., & Lemire, D. (2025). Parsing millions of DNS records per second. Software: Practice and Experience, 55(4), 778-788.

Lemire, D. (2025). Scanning HTML at Tens of Gigabytes Per Second on ARM Processors. Software: Practice and Experience, 55(7), 1256-1265.

A brief history of C/C++ programming languages

Daniel Lemire — Thu, 09 Apr 2026 14:58:53 +0000

Initially, we had languages like Fortran (1957), Pascal (1970), and C (1972). Fortran was designed for number crunching and scientific computing. Pascal was restrictive with respect to low-level access (it was deliberately “safe”, as meant for teaching structured programming). So C won out as a language that allowed low-level/unsafe programming (pointer arithmetic, direct memory access) while remaining general-purpose enough for systems work like Unix. To be fair, Pascal had descendants that are still around, but C clearly dominated.

Object-oriented programming became viewed as the future in the 1980s and 1990s. It turned into some kind of sect.

But C was not object-oriented.

So we got C++, which began as “C with Classes”. C++ had templates, enabling generic programming and compile-time metaprogramming. This part of the language makes C++ quite powerful, but somewhat difficult to master (with crazy error messages).

Both C and C++ became wildly successful, but writing portable applications remained difficult — you often had to target Windows or a specific Unix variant. This was a problem for a company like Sun Microsystems that sold Unix boxes and wanted to compete against the juggernaut that Microsoft was becoming.

So Java came along in 1995. It was positioned as a safe, portable alternative to C++: it eliminated raw pointer arithmetic, added mandatory garbage collection, array bounds checking everywhere, and ran on a virtual machine (JVM) with just-in-time compilation for performance.

The “write once, run anywhere” promise addressed C/C++ portability pain points directly. To this day, Java remains a strong solution for writing portable enterprise and server-side code.

We also got JavaScript in 1995. Despite the name, it has almost nothing in common with Java semantically. It is best viewed as separate from the C/C++ branch. Python is similarly quite different.

Microsoft would eventually come up with C# in 2000. It belongs to the same C-family syntax tradition as C++ and Java, but with support for ahead-of-time compilation in modern .NET. It also allows guarded pointer access within explicitly marked unsafe scopes. At this point, C# can be seen as “C++ with garbage collection” in spirit. It even competes against C++ in the game industry thanks to Unity.

Google came up with Go. It is much like a simpler, modern C: garbage-collected, with built-in bounds checking on slices/arrays, and pointers allowed but without arbitrary arithmetic in safe code (the unsafe package exists for low-level needs).

Later, Apple came up with Swift. It has C++-like performance and syntax goals but adds modern safety features (bounds checking by default, integer overflow panics in debug mode) and uses Automatic Reference Counting (ARC) for memory management. Swift replaced Objective-C but I still view it as a C++ successor.

At about the same time, we got Rust. Like Swift, it drops the generational garbage collection from Java, C# and Go. It relies instead on compile-time ownership and borrowing rules, with the tradeoff that you can leak memory with reference cycles. We also got Zig which makes memory usage fully explicit.

I think that it is fairer to describe Rust and Zig as descendants of C rather than C++. Both are much more powerful than C, of course… and the evolution of programming languages is complex. Still. They are C-like programming languages.

To this day, in much of the industry, the dominant programming languages for performance-critical, systems, enterprise, and infrastructure work remain C, C++, Java, and C#. By the Lindy effect (the longer something has survived, the longer it is likely to continue surviving), these languages, especially C, now over 50 years old, are still going to be around for a long time.

Can your AI rewrite your code in assembly?

Daniel Lemire — Sun, 05 Apr 2026 21:16:14 +0000

Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In C++, you might solve the problem as follows if you are an old-school programmer.

size_t c = 0;
for (const auto &str : strings) {
    c += std::count(str.begin(), str.end(), '!');
}

You can also get fancier with ranges.

for (const auto &str : strings) {
    c += std::ranges::count(str, '!');
}

And so forth.

But what if you want to go faster? Maybe you’d want to rewrite this function in assembly. I decided to do so, and to have fun using both Grok and Claude as my AIs, setting up a friendly competition.

I started with my function and then I asked AIs to optimize it in assembly. Importantly, they knew which machine I was on, so they started to write ARM assembly.

By repeated prompting, I got the following functions.

count_classic: Uses C++ standard library std::count for reference.
count_assembly: A basic ARM64 assembly loop (byte-by-byte comparison). Written by Grok.
count_assembly_claude: Claude’s SIMD-optimized version using NEON instructions (16-byte chunks).
count_assembly_grok: Grok’s optimized version (32-byte chunks).
count_assembly_claude_2: Claude’s further optimized version (64-byte chunks with multiple accumulators).
count_assembly_grok_2: Grok’s latest version (64-byte chunks with improved accumulator handling).
count_assembly_claude_3: Claude’s most advanced version with additional optimizations.

You get the idea.

So, how is the performance? I use random strings of up to 1 kilobyte. In all cases, I test that the functions provide the correct count. I did not closely examine the code, so it is possible that mistakes could be hiding in the code.

I record the average number of instructions per string.

name	instructions/string
classic C++	1200
claude assembly	250
grok assembly	204
claude assembly 2	183
grok assembly 2	176
claude assembly 3	154

By repeated optimization, I reduced the number of instructions by a factor of eight. The running time decreases similarly.

Can we get the AIs to rewrite the best option in C? Yes, although you need SIMD intrinsics. So there is no benefit to leaving the code in assembly in this instance.

An open question is whether the AIs could find optimizations that are not possible if we use a higher-level language like C or C++. It is an intriguing question that I will seek to answer later. For the time being, the AIs can beat my C++ compiler!

My source code is available.

A Fast Immutable Map in Go

Daniel Lemire — Sun, 29 Mar 2026 18:18:01 +0000

Consider the following problem. You have a large set of strings, maybe millions. You need to map these strings to 8-byte integers (uint64). These integers are given to you.

If you are working in Go, the standard solution is to create a map. The construction is trivial, something like the following loop.

m := make(map[string]uint64, N)
for i, k := range keys {
    m[k] = values[i]
}

One downside is that the map may use over 50 bytes per entry.

In important scenarios, we might have the following conditions. The map is large (a million of entries or more), you do not need to modify it dynamically (it is immutable), and all queried keys are in the set. In such conditions, you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry. One fast technique is the binary fuse filters.

I implemented it as a Go library called constmap that provides an immutable map from strings to uint64 values using binary fuse filters. This data structure is ideal when you have a fixed set of keys at construction time and need fast, memory-efficient lookups afterward. You can even construct the map once, save it to disk so you do not pay the cost of constructing the map each time you need it.

The usage is just as simple.

package main

import (
    "fmt"
    "log"

    "github.com/lemire/constmap"
)

func main() {
    keys := []string{"apple", "banana", "cherry"}
    values := []uint64{100, 200, 300}

    cm, err := constmap.New(keys, values)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(cm.Map("banana")) // 200
}

The construction time is higher (as expected for any compact data structure), but lookups are optimized for speed. I ran benchmarks on my Apple M4 Max processor to compare constmap lookups against Go’s built-in map[string]uint64. The test uses 1 million keys.

Data Structure	Lookup Time	Memory Usage
ConstMap	7.4 ns/op	9 bytes/key
Go Map	20 ns/op	56 bytes/key

ConstMap is nearly 3 times faster than Go’s standard map for lookups! And we reduced the memory usage by a factor of 6.

The ConstMap may not always be faster, but it should always use significantly less memory. If it can reside in CPU cache while the map cannot, then it will be significantly faster.

Source Code The implementation is available on GitHub: github.com/lemire/constmap.

JSON and C++26 compile-time reflection: a talk

Daniel Lemire — Thu, 26 Mar 2026 00:29:38 +0000

The next C++ standard (C++26) is getting exciting new features. One of these features is compile-time reflection. It is ideally suited to serialize and deserialize data at high speed. To test it out, we extended our fast JSON library (simdjson) and we gave a talk at CppCon 2025. The video is out on YouTube.

Our slides are also available.

How many branches can your CPU predict?

Daniel Lemire — Wed, 18 Mar 2026 21:52:53 +0000

Modern processors have the ability to execute many instructions per cycle, on a single core. To be able to execute many instructions per cycle in practice, processors predict branches. I have made the point over the years that modern CPUs have an incredible ability to predict branches.

It makes benchmarking difficult because if you test on small datasets, you can get surprising results that might not work on real data.

My go-to benchmark is a function like so:

while (howmany != 0) {
    val = generate_random_value()
    if(val is odd) write to buffer
    decrement howmany
}

The processor tries to predict the branch (if clause). Because we use random values, the processor should mispredict one time out of two.

However, if we repeat multiple times the benchmark, always using the same random values, the processor learns the branches. How many can processors learn? I test using three recent processors.

The AMD Zen 5 processor can predict perfectly 30,000 branches.
The Apple M4 processor can predict perfectly 10,000 branches.
Intel Emerald Rapids can predict perfectly 5,000 branches.

Once more I am disappointed by Intel. AMD is doing wonderfully well on this benchmark.

My source code is available.

Prefix sums at tens of gigabytes per second with ARM NEON

Daniel Lemire — Sun, 08 Mar 2026 20:09:48 +0000

Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you are told how many sales you have made since the start of the year.

day	sales per day	running sales
1	10$	10 $
2	15$	25 $
3	5$	30 $

Such an operation is called a prefix sum or a scan.

Implementing it in C is not difficult. It is a simple loop.

  for (size_t i = 1; i < length; i++) {
    data[i] += data[i - 1];
  }

How fast can this function be? We can derive a speed limit rather simply: to compute the current value, you must have computed the previous one, and so forth.

data[0] -> data[1] -> data[2] -> ...

At best, you require one CPU cycle per entry in your table. Thus, on a 4 GHz processor, you might process 4 billion integer values per second. It is an upper bound but you might be able to reach close to it in practice on many modern systems. Of course, there are other instructions involved such as loads, stores and branching, but our processors can execute many instructions per cycle and they can predict branches effectively. So you should be able to process billions of integers per second on most processors today.

Not bad! But can we do better?

We can use SIMD instructions. SIMD instructions are special instructions that process several values at once. All 64-bit ARM processors support NEON instructions. NEON instructions can process four integers at once, if they are packed in one SIMD register.

But how do you do the prefix sum on a 4-value register? You can do it with two shifts and two additions. In theory, it scales as log(N) where N is the number elements in a vector register.

input   = [A B   C     D]
shift1  = [0 A   B     C]
sum1    = [A A+B B+C   C+D]
shift2  = [0 0   A     B+A]
result  = [A A+B A+B+C A+B+C+D]

You can then extract the last value (A+B+C+D) and broadcast it to all positions so that you can add it to the next value.

Is this faster than the scalar approach? We have 4 instructions in sequence, plus at least one instruction if you want to use the total sum in the next block of four values.

Thus the SIMD approach might be worse. It is disappointing.

A solution might be the scale up over many more integer values.

Consider ARM NEON which has interleaved load and store instructions. If you can load 16 values at once, and get all of the first values together, all of the second values together, and so forth.

original data : ABCD EFGH IJKL MNOP
loaded data   : AEIM BFJN CGKO DHLP

Then I can do a prefix sum over the four blocks in parallel. It takes three instructions. At the end of the three instructions, we have one register which contains the local sums:

A+B+C+D E+F+G+H I+J+K+L M+N+O+P

And then we can apply our prefix sum recipe on this register (4 instructions). You might end up with something like 8 sequential instructions per block of 16 values.

It is theoretically twice as fast as the scalar approach.

In C with instrinsics, you might code it as follows.

void neon_prefixsum_fast(uint32_t *data, size_t length) {
  uint32x4_t zero = {0, 0, 0, 0};
  uint32x4_t prev = {0, 0, 0, 0};

  for (size_t i = 0; i < length / 16; i++) {
    uint32x4x4_t vals = vld4q_u32(data + 16 * i);

    // Prefix sum inside each transposed ("vertical") lane
    vals.val[1] = vaddq_u32(vals.val[1], vals.val[0]);
    vals.val[2] = vaddq_u32(vals.val[2], vals.val[1]);
    vals.val[3] = vaddq_u32(vals.val[3], vals.val[2]);

    // Now vals.val[3] contains the four local prefix sums:
    //   vals.val[3] = [s0=A+B+C+D, s1=E+F+G+H, 
    //                  s2=I+J+K+L, s3=M+N+O+P]

    // Compute prefix sum across the four local sums 
    uint32x4_t off = vextq_u32(zero, vals.val[3], 3);
    uint32x4_t ps = vaddq_u32(vals.val[3], off);       
    off = vextq_u32(zero, ps, 2);                      
    ps = vaddq_u32(ps, off);

    // Now ps contains cumulative sums across the four groups
    // Add the incoming carry from the previous 16-element block
    ps = vaddq_u32(ps, prev);

    // Prepare carry for next block: broadcast the last lane of ps
    prev = vdupq_laneq_u32(ps, 3);

    // The add vector to apply to the original lanes is the 
    // prefix up to previous group
    uint32x4_t add = vextq_u32(prev, ps, 3);  

    // Apply carry/offset to each of the four transposed lanes
    vals.val[0] = vaddq_u32(vals.val[0], add);
    vals.val[1] = vaddq_u32(vals.val[1], add);
    vals.val[2] = vaddq_u32(vals.val[2], add);
    vals.val[3] = vaddq_u32(vals.val[3], add);

    // Store back the four lanes (interleaved)
    vst4q_u32(data + 16 * i, vals);
  }

  scalar_prefixsum_leftover(data, length, 16);
}

Let us try it out on an Apple M4 processor (4.5 GHz).

method	billions of values/s
scalar	3.9
naive SIMD	3.6
fast SIMD	8.9

So the SIMD approach is about 2.3 times faster than the scalar approach. Not bad.

My source code is available on GitHub.

Appendix. Instrinsics

Intrinsic	What it does
`vld4q_u32`	Loads 16 consecutive 32-bit unsigned integers from memory and deinterleaves them into 4 separate `uint32x4_t` vectors (lane 0 = elements 0,4,8,12,…; lane 1 = 1,5,9,13,… etc.).
`vaddq_u32`	Adds corresponding 32-bit unsigned integer lanes from two vectors (`a[i] + b[i]` for each of 4 lanes).
`vextq_u32`	Extracts (concatenates a and b, then takes 4 lanes starting from lane `n` of the 8-lane concatenation). Used to implement shifts/rotates by inserting zeros (when `a` is zero vector).
`vdupq_laneq_u32`	Broadcasts (duplicates) the value from the specified lane (0–3) of the input vector to all 4 lanes of the result.
`vdupq_n_u32` (implied usage)	Sets all 4 lanes of the result to the same scalar value (commonly used for zero or broadcast).

Text formats are everywhere. Why?

Daniel Lemire — Thu, 05 Mar 2026 14:40:58 +0000

The Internet relies on text formats. Thus, we spend a lot of time producing and consuming data encoded in text.

Your web pages are HTML. The code running in them is JavaScript, sent as text (JavaScript source), not as already-parsed code. Your emails, including their attachments, are sent as text (your binary files are sent as text).

It does not stop there. The Python code that runs your server is stored as text. It queries data by sending text queries. It often gets back the answer as text that must then be decoded.

JSON is the universal data interchange format online today. We share maps as JSON (GeoJSON).

Not everything is text, of course. There is no common video or image format that is shared as text. Transmissions over the Internet are routinely compressed to binary formats. There are popular binary formats that compete with JSON.
But why is text dominant?

It is not because, back in the 1970s, programmers did not know about binary formats.

In fact, we did not start with text formats. Initially, we worked with raw binary data. Those of us old enough will remember programming in assembly using raw byte values.

Why text won?

1.Text is efficient.

In the XML era, when everything had to be XML, there were countless proposals for binary formats. People were sometimes surprised to find that the binary approach was not much faster in practice. Remember that many text formats date back to an era when computers were much slower. Had text been a performance bottleneck, it would not have spread. Of course, there are cases where text makes things slower. You then have a choice: optimize your code further or transition to another format. Often, both are viable.

It is easy to make wrong assumptions about binary formats, such as that you can consume them without any parsing or validation. If you pick up data from the Internet, you must assume that it could have been sent by an adversary or someone who does not follow your conventions.

2.Text is easy to work with.

If you receive text from a remote source, you can often transform it, index it, search it, quote it, version it… with little effort and without in-depth knowledge of the format. Text is often self-documenting.

In an open world, when you will never speak with the person producing the data, text often makes everything easier and smoother.

If there is an issue to report and the data is in text, you can usually copy-paste the relevant section into a message. Things are much harder with a binary format.

You can use newline characters in URLs

Daniel Lemire — Sat, 28 Feb 2026 19:21:39 +0000

We locate web content using special addresses called URLs. We are all familiar with addresses like https://google.com. Sometimes, URLs can get long and they can become difficult to read. Thus, we might be tempted to format them
like so in HTML using newline and tab characters, like so:

<a href="https://lemire.me/blog/2026/02/21/
        how-fast-do-browsers-correct-utf-16-strings/">my blog posta>

It will work.

Let us refer to the WHATWG URL specification that browsers follow. It makes two statements in sequence.

If input contains any ASCII tab or newline, invalid-URL-unit validation error.
Remove all ASCII tab or newline from input.

Notice how it reports an error if there is a tab or newline character, but continues anyway? The specification says that A validation error does not mean that the parser terminates and it encourages systems to report errors somewhere. Effectively, the error is ignored although it might be logged. Thus our HTML is fine in practice.

The following is also fine:

<a href="https://go
ogle.c
om" class="button">Visit Googlea>

You can also use tabs. But you cannot arbitrarily insert any other whitespace.

Yet there are cases when you can use any ASCII whitespace character: data URLs. Data URLs (also called data URIs) embed small files—like images, text, or other content—directly inside a URL string, instead of linking to an external resource. Data URLs are a special kind of URL and they follow different rules.

A typical data URL might look like data:image/png;base64,iVBORw0KGgoAAAANSUhEUg... where the string iVBORw0KGgoAAAANSUhEUg... is the binary data of the image that has been encoded with base64. Base64 is a text format that can represent any binary content: we use 64 ASCII characters so that each character encodes 6 bits. Your binary email attachments are base64 encoded.

On the web, when decoding a base64 string, you ignore all ASCII whitespaces (including the space character itself). Thus you can embed a PNG image in HTML as follows.

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA
                                 QAAAAECAIAAAAmkwkpAAAAEUl
                                 EQVR4nGP8z4AATEhsPBwAM9EB
                                 BzDn4UwAAAAASUVORK5CYII=" />

This HTML code is valid and will insert a tiny image in your page.

But there is more. A data URL can also be used to insert an SVG image. SVG (Scalable Vector Graphics) is an XML-based vector image format that describes 2D graphics using mathematical paths, shapes, and text instead of pixels.
The following should draw a very simple sunset:

<img src='data:image/svg+xml,

   
  
    
  
    
  
' />

Observe how I was able to format the SVG code so that it is readable.

Further reading: Nizipli, Y., & Lemire, D. (2024). Parsing millions of URLs per second. Software: Practice and Experience, 54(5), 744-758.

How fast do browsers correct UTF-16 strings?

Daniel Lemire — Sat, 21 Feb 2026 20:07:17 +0000

JavaScript represents strings using Unicode, like most programming languages today. Each character in a JavaScript string is stored using one or two 16-bit words. The following JavaScript code might surprise some programmers because a single character becomes two 16-bit words.

> t="🧰"
'🧰'
> t.length
2
> t[0]
'\ud83e'
> t[1]
'\uddf0'

The convention is that \uddf0 is the 16-bit value 0xDDF0 also written U+DDF0.

The UTF-16 standard is relatively simple. There are three types of values. high surrogates (the range U+D800 to U+DBFF), low surrogates (U+DC00 to U+DFFF), and all other code units (U+0000–U+D7FF together with U+E000–U+FFFF). A high surrogate must always be followed by a low surrogate, and a low surrogate must always be preceded by a high surrogate.

What happens if you break the rules and have a high surrogate followed by a high surrogate? Then you have an invalid string. We can correct the strings by patching them: we replace the bad values by the replacement character (\ufffd). The replacement character sometimes appears as a question mark.

To correct a broken string in JavaScript, you can call the toWellFormed method.

> t = '\uddf0\uddf0'
'\uddf0\uddf0'
> t.toWellFormed()
'��'

How fast is it?

I wrote a small benchmark that you can test online to measure its speed. I use broken strings of various sizes up to a few kilobytes. I run the benchmarks on my Apple M4 processor using different browsers.

Browser	Speed
Safari 18.6	1 GiB/s
Firefox 147	3 GiB/s
Chrome 145	15 GiB/s

Quite a range of performance! The speed of other chromium-based browsers (Brave and Edge) is much the same as Chrome.

I also tested with JavaScript runtimes.

Engine	Speed
Node.js v25.5.0	16 GiB/s
Bun 1.3.9	8.4 GiB/s

Usually Bun is faster than Node, but in this instance, Node is twice as far as Bun.

Thus, we can correct strings in JavaScript at over ten gigabytes per second if you use Chromium-based browsers.

How bad can Python stop-the-world pauses get?

Daniel Lemire — Sun, 15 Feb 2026 20:02:29 +0000

When programming, we need to allocate memory, and then deallocate it. If you program in C, you get used to malloc/free functions. Sadly, this leaves you vulnerable to memory leaks: unrecovered memory. Most popular programming languages today use automated memory management: Java, JavaScript, Python, C#, Go, Swift and so forth.

There are essentially two types of automated memory managements. The simplest method is reference counting. You track how many references there are to each object. When an object has no more references, then we can free the memory associated with it. Swift and Python use reference counting. The downside of reference counting are circular references. You may have your main program reference object A, then you add object B which references object A, and you make it so that object A also reference object B. Thus object B has one reference while object A has two references. If your main program drops its reference to object A, the both objects A and B still have a reference count of one. Yet they should be freed. To solve this problem, you could just visit all of your objects to detect which are unreachable, including A and B. However, it takes time to do so. Thus, the other popular approach of automated memory management: generational garbage collection. You use the fact that most memory gets released soon after allocation. Thus you track young objects and visit them from time to time. Then, more rarely, you do a full scan. The downside of generational garbage collection is that typical implementations stop the world to scan the memory. In many instances, your entire program is stopped. There are many variations on the implementation, with decades of research.

The common Python implementation has both types: reference counting and generational garbage collection. The generational garbage collection component can trigger pauses. A lot of servers are written in Python. It means that your service might just become unavailable for a time. We often call them ‘stop the world’ pauses. How long can this pause get?

To test this out, I wrote a Python function to create a classical linked list:

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None
    def add_next(self, node):
        self.next = node

def create_linked_list(limit):
    """ create a linked list of length 'limit' """
    head = Node(0)
    current = head
    for i in range(1, limit):
        new_node = Node(i)
        current.add_next(new_node)
        current = new_node
    return head

And then I create one large linked list and then, in a tight loop, we create small linked lists that are immediately discarded.

x = create_linked_list(50_000_000)
for i in range(1000000):
    create_linked_list(1000)

A key characteristic of my code is the 50 million linked list. It does not get released until the end of the program, but the garbage collector may still examine it.

And I record the maximum delay between two iterations in the loop (using time.time()).

How bad can it get? The answer depends on the Python version. And it is not consistent from run-to-run. So I ran it once and picked whatever result I got. I express the delay in milliseconds.

python version	system	max delay
3.14	macOS (Apple M4)	320 ms
3.12	Linux (Intel Ice Lake)	2,200 ms

Almost all of this delay (say 320 ms) is due to the garbage collection. Creating a linked list with 1000 elements takes less than a millisecond.

How long is 320 ms? It is a third of a second, so it is long enough for human beings to notice it. For reference, a video game drawing the screen 60 times per second has less than 17 ms to draw the screen. The 2,200 ms delay could look like a server crash from the point of view of a user, and might definitely trigger a time-out (failed request).

I ported the Python program to Go. It is the same algorithm, but a direct comparison is likely unfair. Still, it gives us a reference.

go version	system	max delay
1.25	macOS (Apple M4)	50 ms
1.25	Linux (Intel Ice Lake)	33 ms

Thus Go has pauses that are several times shorter than Python, and there is no catastrophic 2-second pause.

Should these pauses be a concern? Most Python programs do not create so many objects in memory at the same time. Thus you are not likely to see these long pauses if you have a simple web app or a script. Python gives you a few options, such as gc.set_threshold and gc.freeze which could help you tune the behaviour.

My source code is available.

Video

AI: Igniting the Spark to End Stagnation

Daniel Lemire — Sun, 15 Feb 2026 15:19:31 +0000

Much of the West has been economically stagnant. Countries like Canada have failed to improve their productivity and standard of living as of late. In Canada, there has been no progress in Canadian living standards as measured by per-person GDP over the past five years. It is hard to overstate how anomalous this is: the USSR collapsed in part because it could only sustain a growth rate of about 1%, far below what the West was capable of. Canada is more stagnant than the USSR.

Late in 2022, some of us got access to a technical breakthrough: AI. In three years, it has become part of our lives. Nearly all students use AI to do research or write essays.

Dallas Fed economists projected the most credible effect that AI might have on our economies: AI should help reverse the post-2008 slowdown and deliver higher living standards in line with historical technological progress.

It will imply a profound, rapid but gradual transformation of our economy. There will still be teachers, accountants, and even translators in the future… but their work will change as it has changed in the past. Accountants do far less arithmetic today; that part of their work has been replaced by software. Even more of their work is about to be replaced by software, thus improving their productivity further. We will still have teachers, but all our kids, including the poorest ones, will have dedicated always-on tutors: this will not be just available in Canada or the USA, but everywhere. It is up to us to decide who is allowed to build this technology.

AI empowers the individual. An entrepreneur with a small team can get faster access to quality advice, copywriting, and so forth. Artists with an imagination can create more with fewer constraints.

I don’t have to prove these facts: they are fast becoming obvious to the whole world.

New jobs are created. Students of mine work as AI specialists. One of them helps build software providing AI assistance to pharmacists. One of my sons is an AI engineer. These are great jobs.

The conventional explanation for Canada’s stagnation is essentially that we have already harvested all the innovation we are ever going to get. The low-hanging fruit has been picked. Further progress has become inherently difficult because we are already so advanced; there is simply not much room left to improve. In this view, there is no need to rethink our institutions. Yet a sufficiently large breakthrough compels us to reconsider where we stand and what is still possible. It forces us to use our imagination again. It helps renew the culture.

We often hear claims that artificial intelligence will consume vast amounts of energy and water in the coming years. It is true that data centers, which host AI workloads along with many other computing tasks, rely on water for cooling.

But let’s look at the actual water numbers. In 2023, U.S. data centers directly consumed roughly 17.4 billion gallons of water—a figure that could potentially double or quadruple by 2028 as demand grows. By comparison, American golf courses use more than 500 billion gallons every year for irrigation, often in arid regions where this usage is widely criticized as wasteful. Even if data-center water demand were to grow exponentially, it would take decades to reach the scale of golf-course irrigation.

On the energy side, data centers are indeed taking a larger share of electricity demand. According to the International Energy Agency’s latest analysis, they consumed approximately 415 TWh in 2024—about 1.5% of global electricity consumption. This is projected to more than double to around 945 TWh by 2030 (just under 3% of global electricity). However, even this rapid growth accounts for less than 10% (roughly 8%) of the total expected increase in worldwide electricity demand through 2030. Data centers are therefore not the main driver of the much larger rise in overall energy use.

If we let engineers in Australia, Canada, or Argentina free to innovate, we will surely see fantastic developments.

You might also have heard about the possibility that ChatGPT might decide to kill us all. Nobody can predict the future, but you are surely more likely to be killed by cancer than by a rogue AI. And AI might help you with your cancer.

We always have a choice. Nations can try to regulate AI out of existence. We can set up new government bodies to prevent the application of AI. This will surely dampen the productivity gains and marginalize some nations economically.

The European Union showed it could be done. By some reports, Europeans make more money by fining American software companies than by building their own innovation enterprises. Countries like Canada have economies dominated by finance, mining and oil (with a side of Shopify).

If you are already well off, stopping innovation sounds good. It’s not if you are trying to get a start.

AI is likely to help young people who need it so much. They, more than any other group, will find it easier to occupy the new jobs, start the new businesses.

If you are a politician and you want to lose the vote of young people: make it difficult to use AI. It will crater your credibility.

It is time to renew our prosperity. It is time to create new exciting jobs.

References:

Wynne, M. A., & Derr, L. (2025, June 24). Advances in AI will boost productivity, living standards over time. Federal Reserve Bank of Dallas.

Fraser Institute. (2025, December 16). Canada’s recent economic growth performance has been awful.

DemandSage. (2026, January 9). 75 AI in education statistics 2026 (Global trends & facts).

MIT Technology Review. (2026, January 21). Rethinking AI’s future in an augmented workplace.

Davis, J. H. (2025). Coming into view: How AI and other megatrends will shape your investments. Wiley.

Choi, J. H., & Xie, C. (2025, June 26). AI is reshaping accounting jobs by doing the boring stuff. Stanford Graduate School of Business.

International Energy Agency. (n.d.). Energy demand from AI.

University of Colorado Anschutz Medical Campus. (2025, May 19). Real talk about AI and advancing cancer treatments.

International Energy Agency. (2025). Global energy review 2025.

The cost of a function call

Daniel Lemire — Sun, 08 Feb 2026 20:11:32 +0000

When programming, we chain functions together. Function A calls function B. And so forth.

You do not have to program this way, you could write an entire program using a single function. It would be a fun exercise to write a non-trivial program using a single function… as long as you delegate the code writing to AI because human beings quickly struggle with long functions.

A key compiler optimization is ‘inlining’: the compiler takes your function definition and it tries to substitute it at the call location. It is conceptually quite simple. Consider the following example where the function add3 calls the function add.

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

You can manually inline the call as follows.

int add3(int x, int y, int z) {
    return x + y + z;
}

A function call is reasonably cheap performance-wise, but not free. If the function takes non-trivial parameters, you might need to save and restore them on the stack, so you get extra loads and stores. You need to jump into the function, and then jump out at the end. And depending on the function call convention on your system, and the type of instructions you are using, there are extra instructions at the beginning and at the end.

If a function is sufficiently simple, such as my add function, it should always be inlined when performance is critical. Let us examine a concrete example. Let me sum the integers in an array.

for (int x : numbers) {
  sum = add(sum, x);
}

I am using my MacBook (M4 processor with LLVM).

function	ns/int
regular	0.7
inline	0.03

Wow. The inline version is over 20 times faster.

Let us try to see what is happening. The call site of the ‘add’ function is just a straight loop with a call to the function.

ldr    w1, [x19], #0x4
bl     0x100021740    ; add(int, int)
cmp    x19, x20
b.ne   0x100001368    ; <+28>

The function itself is as cheap as it can be: just two instructions.

add    w0, w1, w0
ret

So, we spend 6 instructions for each addition. It takes about 3 cycles per addition.

What about the inline function?

ldp    q4, q5, [x12, #-0x20]
ldp    q6, q7, [x12], #0x40
add.4s v0, v4, v0
add.4s v1, v5, v1
add.4s v2, v6, v2
add.4s v3, v7, v3
subs   x13, x13, #0x10
b.ne   0x1000013fc    ; <+104>

It is entirely different. The compiler has converted the addition to advanced (SIMD) instructions processing blocks of 16 integers using 8 instructions. So we are down to half an instruction per integer (from 6 instructions). So we use 12 times fewer instructions. On top of having fewer instructions, the processor is able to retire more instructions per cycle, for a massive performance boost.

What if we prevented the compiler from using these fancy instructions while still inlining? We still get a significant performance boost (about 10x faster).

function	ns/int
regular	0.7
inline	0.03
inline (no SIMD)	0.07

Ok. But the add function is a bit extreme. We know it should always be inlined. What about something less trivial like a function that counts the number of spaces in a string.

size_t count_spaces(std::string_view sv) {
    size_t count = 0;
    for (char c : sv) {
        if (c == ' ') ++count;
    }
    return count;
}

If the string is reasonably long, then the overhead of the function call should be negligible.
Let us pass a string of 1000 characters.

function	ns/string
regular	111
inline	115

The inline version is not only not faster, but it is even slightly slower. I am not sure why.

What if I use short strings (say between 0 and 6 characters)? Then the inline function is measurably faster.

function	ns/string
regular	1.6
inline	1.0

Takeaways:

Short and simple functions should be inlined when possible if performance is a concern. The benefits can be impressive.
For functions that can be fast or slow, the decision as to whether to inline or not depends on the input. For string processing functions, the size of the string may determine whether inlining is necessary for best performance.

Note: My source code is available.

Converting data to hexadecimal outputs quickly

Daniel Lemire — Mon, 02 Feb 2026 15:52:27 +0000

Given any string of bytes, you can convert it to an hexadecimal string by mapping the least significant and the most significant 4 bits of byte to characters in 01...9A...F. There are more efficient techniques like base64, that map 3 bytes to 4 characters. However, hexadecimal outputs are easier to understand and often sufficiently concise.

A simple function to do the conversion using a short lookup table is as follows:

static const char hex[] = "0123456789abcdef";
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = hex[val >> 4];
    dst[k + 1] = hex[val & 15];
}

This code snippet implements a straightforward byte-to-hexadecimal string conversion loop in C++. It iterates over an input byte array (src), processing one byte at a time using index i, while simultaneously building the output string in dst with index k that advances twice as fast (by 2) since each input byte produces two hexadecimal characters. For each byte, it extracts the value as an unsigned 8-bit integer (val), then isolates the high 4 bits (via right shift by 4) and low 4 bits (via bitwise AND with 15) to index into a static lookup table (hex) containing the characters ‘0’ through ‘9’ and ‘a’ through ‘f’. The loop continues until k reaches the expected output length (dlen), which should be twice the input length, ensuring all bytes are converted without bounds errors.

This lookup table approach is used in the popular Node.js JavaScript runtime. Skovoroda recently proposed to replace this lookup table approach with an arithmetic version.

char nibble(uint8_t x) { return x + '0' + ((x > 9) * 39); }
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = nibble(val >> 4);
    dst[k + 1] = nibble(val & 15);
}

Surprisingly maybe, this approach is much faster and uses far fewer instructions. At first glance, this result might be puzzling. A table lookup is cheap, the new nibble function seemingly does more work.

The trick that Skovoroda relies upon is that compilers are smart: they will ‘autovectorize’ such number crunching functions (if you are lucky). That is, instead of using regular instructions that process byte values, the will SIMD instructions that process 16 bytes at once or more.

Of course, instead of relying on the compiler, you can manually invoke SIMD instructions through SIMD instrinsic functions. Let us assume that you have an ARM processors (e.g., on Apple Silicon). Then you can process blocks of 32 bytes as follows.

size_t maxv = (slen - (slen%32));
for (; i < maxv; i += 32) {
    uint8x16_t val1 = vld1q_u8((uint8_t*)src + i);
    uint8x16_t val2 = vld1q_u8((uint8_t*)src + i + 16);
    uint8x16_t high1 = vshrq_n_u8(val1, 4);
    uint8x16_t low1 = vandq_u8(val1, vdupq_n_u8(15));
    uint8x16_t high2 = vshrq_n_u8(val2, 4);
    uint8x16_t low2 = vandq_u8(val2, vdupq_n_u8(15));
    uint8x16_t high_chars1 = vqtbl1q_u8(table, high1);
    uint8x16_t low_chars1 = vqtbl1q_u8(table, low1);
    uint8x16_t high_chars2 = vqtbl1q_u8(table, high2);
    uint8x16_t low_chars2 = vqtbl1q_u8(table, low2);
    uint8x16x2_t zipped1 = {high_chars1, low_chars1};
    uint8x16x2_t zipped2 = {high_chars2, low_chars2};
    vst2q_u8((uint8_t*)dst + i*2, zipped1);
    vst2q_u8((uint8_t*)dst + i*2 + 32, zipped2);
}

This SIMD code leverages ARM NEON intrinsics to accelerate hexadecimal encoding by processing 32 input bytes simultaneously. It begins by loading two 16-byte vectors (val1 and val2) from the source array using vld1q_u8. For each vector, it extracts the high nibbles (via right shift by 4 with vshrq_n_u8) and low nibbles (via bitwise AND with 15 using vandq_u8 and vdupq_n_u8). The nibbles are then used as indices into a pre-loaded hex table via vqtbl1q_u8 to fetch the corresponding ASCII characters. The high and low character vectors are interleaved using vzipq_u8, producing two output vectors per input pair. Finally, the results are stored back to the destination array with vst1q_u8, ensuring efficient memory operations.

You could do similar work on other systems like x64. The same code with AVX-512 for recent Intel and AMD processors would probably be insanely efficient.

Benchmarking these implementations on a dataset of 10,000 random bytes reveals significant performance differences. The basic lookup table version achieves around 3 GB/s, while the arithmetic version, benefiting from compiler autovectorization, reaches 23 GB/s. The manual SIMD NEON versions push performance further: I reach 42 GB/s in my tests.

method	speed	instructions per byte
table	3.1 GB/s	9
Skovoroda	23 GB/s	0.75
intrinsics	42 GB/s	0.69

One lesson is that intuition can be a poor guide when trying to assess performance.

My source code is available.

Converting floats to strings quickly

Daniel Lemire — Sun, 01 Feb 2026 15:23:25 +0000

When serializing data to JSON, CSV or when logging, we convert numbers to strings. Floating-point numbers are stored in binary, but we need them as decimal strings. The first formally published algorithm is Steele and White’s Dragon schemes (specifically Dragin2) in 1990. Since then, faster methods have emerged: Grisu3, Ryū, Schubfach, Grisu-Exact, and Dragonbox. In C++17, we have a standard function called std::to_chars for this purpose. A common objective is to generate the shortest strings while still being able to uniquely identify the original number.

We recently published Converting Binary Floating-Point Numbers to Shortest Decimal Strings. We examine the full conversion, from the floating-point number to the string. In practice, the conversion implies two steps: we take the number and compute the significant and the power of 10 (step 1) and then we generate the string (step 2). E.g., for the number pi, you might need to compute 31415927 and -7 (step 1) before generating the string 3.1415927. The string generation requires placing the dot at the right location and switching to the exponential notation when needed. The generation of the string is relatively cheap and was probably a negligible cost for older schemes, but as the software got faster, it is now a more important component (using 20% to 35% of the time).

The results vary quite a bit depending on the numbers being converted. But we find that the two implementations tend to do best: Dragonbox by Jeon and Schubfach by Giulietti. The Ryū implementation by Adams is close behind or just as fast. All of these techniques are about 10 times faster than the original Dragon 4 from 1990. A tenfold performance gain in performance over three decades is equivalent to a gain of about 8% per year, entirely due to better implementations and algorithms.

Efficient algorithms use between 200 and 350 instructions for each string generated. We find that the standard function std::to_chars under Linux uses slightly more instructions than needed (up to nearly 2 times too many). So there is room to improve common implementations. Using the popular C++ library fmt is slightly less efficient.

A fun fact is that we found that that none of the available functions generate the shortest possible string. The std::to_chars C++ function renders the number 0.00011 as 0.00011 (7 characters), while the shorter scientific form 1.1e-4 would do. But, by convention, when switching to the scientific notation, it is required to pad the exponent to two digits (so 1.1e-04). Beyond this technicality, we found that no implementation always generate the shortest string.

All our code, datasets, and raw results are open-source. The benchmarking suite is at https://github.com/fastfloat/float_serialization_benchmark, test data at https://github.com/fastfloat/float-data.

Reference: Converting Binary Floating-Point Numbers to Shortest
Decimal Strings: An Experimental Review, Software: Practice and Experience (to appear)

Optimizing Python scripts with AI

Daniel Lemire — Sun, 25 Jan 2026 23:19:12 +0000

One of the first steps we take when we want to optimize software is to look
at profiling data. Software profilers are tools that try to identify where
your software spends its time. Though the exact approach can vary, a typical profiler samples your software (steps it at regular intervals) and collects statistics. If your software is routinely stopped in a given function, this function is likely using a lot of time. In turn, it might be where you should put your optimization efforts.

Matteo Collina recently shared with me his work on feeding profiler data for software optimization purposes in JavaScript. Essentially, Matteo takes the profiling data, and prepares it in a way that an AI can comprehend. The insight is simple but intriguing: tell an AI how it can capture profiling data and then let it optimize your code, possibly by repeatedly profiling the code. The idea is not original since AI tools will, on their own, figure out that they can get profiling data.

How well does it work? I had to try it.

Case 1. Code amalgamation script

For the simdutf software library, we use an amalgamation script: it collects all of the C++ files on disk, does some shallow parsing and glues them together according to some rules.

I first ask the AI to optimize the script without access to profiling data. What it did immediately was to add a file cache. The script repeatedly loads the same files from disk (the script is a bit complex). This saved about 20% of the running time.

Specifically, the AI replaced this naive code…

def read_file(file):
    with open(file, 'r') as f:
        for line in f:
            yield line.rstrip()

by this version with caching…

def read_file(file):
    if file in file_cache:
        for line in file_cache[file]:
            yield line
    else:
        lines = []
        with open(file, 'r') as f:
            for line in f:
                line = line.rstrip()
                lines.append(line)
                yield line
        file_cache[file] = lines

Could the AI do better with profiling data? I instructed it to run the Python profiler: python -m cProfile -s cumtime myprogram.py. It found two additional optimizations:

1. It precompiled the regular expressions (re.compile). It replaced

  if re.match('.*generic/.*.h', file):
    # ...

if generic_pattern.match(file):
    # ...

where elsewhere in the code, we have…

generic_pattern = re.compile(r'.*generic/.*\.h')

2. Instead of repeatedly calling re.sub to do a regular expression substitution, it filtered the strings by checking for the presence of a keyword in the string first.

if 'SIMDUTF_IMPLEMENTATION' in line: # This IF is the optimization
  print(uses_simdutf_implementation.sub(context.current_implementation+"\\1", line), file=fid)
else:
  print(line, file=fid) # Fast path

These two optimizations could probably have been arrived at by looking at the code directly, and I cannot be certain that they were driven by the profiling data. But I can tell that they do appear in the profile data.

Unfortunately, the low-hanging fruit, caching the file access, represented the bulk of the gain. The AI was not able to further optimize the code. So the profiling data did not help much.

Case 2: Check Link Script

When I design online courses, I often use a lot of links. These links break over time. So I have a simple Python script that goes through all the links, and verifies them.

I first ask my AI to optimize the code. It did the same regex trick, compiling the regular expression. It created a thread pool and made the script asynchronous.

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    url_results = {url: executor.submit(check_url, url) for url in urls_to_check}
    for url, future in url_results.items():
        url_cache[url] = future.result()

This parallelization more than doubled the speed of the script.

It cached the URL checks in an interesting way, using functools:

from functools import lru_cache

@lru_cache(maxsize=None)
def check(link):
    # ...

I did not know about this nice trick. This proved useless in my context because I rarely have several times the same link.

I then started again, and told it to use the profiler. It did much the same thing, except for the optimization of the regular expression.

As far as I can tell all optimizations were in vain, except for the multithreading. And it could do this part without the profiling data.

Conclusion so far

The Python scripts I tried were not heavily optimized, as their performance was not critical. They are relatively simple.

For the amalgamation, I got a 20% performance gain for ‘free’ thanks to the file caching. The link checker is going to be faster now that it is multithreaded. Both optimizations are valid and useful, and will make my life marginally better.

In neither case I was able to discern benefits due to the profiler data. I was initially hoping to get the AI busy optimizing the code in a loop, continuously running the profiler, but it did not happen in these simple cases. The AI optimized code segments that contributed little to the running time as per the profiler data.

To be fair, profiling data is often of limited use. The real problems are often architectural and not related to narrow bottlenecks. Even when there are identifiable bottlenecks, a simple profiling run can fail to make them clearly identifiable. Further, profilers become more useful as the code base grows, while my test cases are tiny.

Overall, I expect that the main reason for my relative failure is that I did not have the right use cases. I think that collecting profiling data and asking an AI to have a look might be a reasonable first step at this point.

A new way to call C from Java: how fast is it?

Daniel Lemire — Sat, 17 Jan 2026 23:44:38 +0000

Irrespective of your programming language of choice, calling C functions is often a necessity. For the longest time, the only standard way to call C was the Java Native Interface (JNI). But it was so painful that few dared to do it. I have heard it said that it was deliberately painful so that people would be enticed to use pure Java as much as possible.

Since Java 22, there is a new approach called the Foreign Function & Memory API in java.lang.foreign. Let me go through step by step.

You need a Linker and a SymbolLookup instance from which you will build a MethodHandle that will capture the native function you want to call.

The linker is easy:

Linker linker = Linker.nativeLinker();

To load the SymbolLookup instance for your library (called mylibrary), you may do so as follows:

System.loadLibrary("mylibrary");
SymbolLookup lookup = SymbolLookup.loaderLookup();

The native library file should be on your java.library.path path, or somewhere on the default library paths. (You can pass it to your java executable as -Djava.library.path=something).

Alternatively, you can use SymbolLookup.libraryLookup or other means of loading
the library, but System.loadLibrary should work well enough.

You have the lookup, you can grab the address of a function like so:

lookup.find("myfunction")

This returns an Optional. You can grab the MemorySegment like so:

MemorySegment mem = lookup.find("myfunction").orElseThrow()

Once you have your MemorySegment, you can pass it to your linker to get a MethodHandle which is close to a callable function:

 MethodHandle myfunc = linker.downcallHandle(
     mem,
     functiondescr
 );

The functiondescr must describe the returned value and the function parameters that your function takes. If you pass a pointer and get back a long value, you might proceed as follows:

 MethodHandle myfunc = linker.downcallHandle(
     mem,
     FunctionDescriptor.of(
        ValueLayout.JAVA_LONG,
        ValueLayout.ADDRESS
    )
 );

That is, the first parameter is the returned value.

For function returning nothing, you use FunctionDescriptor.ofVoid.

The MethodHandle can be called almost like a normal Java function:
myfunc.invokeExact(parameters). It always returns an Object which means that if it should return a long, it will return a Long. So a cast might be necessary.

It is a bit painful, but thankfully, there is a tool called jextract that can automate this task. It generates Java bindings from native library headers.

You can allocate C data structures from Java that you can pass to your native code by using an Arena. Let us say that you want to create an instance like

MemoryLayout mystruct = MemoryLayout.structLayout(
        ValueLayout.JAVA_LONG.withName("age"),
        ValueLayout.JAVA_INT.withName("friends"));

You could do it in this manner:

MemorySegment myseg = arena.allocate(mystruct);

You can then pass myseg as a pointer to a data structure in C.

You often get an array with a try clause like so:

try (Arena arena = Arena.ofConfined()) {
       //
}

There are many types of arenas: confined, global, automatic, shared. The confined arenas are accessible from a single thread. A shared or global arena is accessible from several threads. The global and automatic arenas are managed by the Java garbage collector whereas the confined and shared arenas are managed explicitly, with a specific lifetime.

So, it is fairly complicated but manageable. Is it fast? To find out, I call from Java a C library I wrote with support for binary fuse filters. They are a fast alternative to Bloom filters.

You don’t need to know what any of this means, however. Keep in mind that I wrote a Java library called jfusebin which calls a C library. Then I also have a pure Java implementation and I can compare the speed.

I should first point out that even if calling the C function did not include any overhead, it might still be slower because the Java compiler is unlikely to inline a native function. However, if you have a pure Java function, and it is relatively small, it can get inlined and you get all sorts of nice optimizations like constant folding and so forth.

Thus I can overestimate the cost of the overhead. But that’s ok. I just want a ballpark measure.

In my benchmark, I check for the presence of a key in a set. I have one million keys in the filter. I can ask whether a key is not present in the filter.

I find that the library calling C can issue 44 million calls per second using the 8-bit binary fuse filter. I reach about 400 million calls per second using the pure Java implementation.

method	time per query in nanoseconds
Java-to-C	22.7 ns
Pure Java	2.5 ns

Thus I measure an overhead of about 20 ns per C function calls from Java using a macBook (M4 processor).

We can do slightly better by marking the functions that are expected to be short running as critical. You achieve this result by passing an option to the linker.downcallHandle call.

binary_fuse8_contain = linker.downcallHandle(
    lookup.find("xfuse_binary_fuse8_contain").orElseThrow(),
    binary_fuse8_contain_desc,
    Linker.Option.critical(false)
);

You save about 15% of the running time in my case.

method	time per query in nanoseconds
Java-to-C	22.7 ns
Java-to-C (critical)	19.5 ns
Pure Java	2.5 ns

Obviously, in my case, because the Java library is so fast, the 20 ns becomes too much. But it is otherwise a reasonable overhead.

I did not compare with the old approach (JNI), but other folks did and they find that the new foreign function approach can be measurably faster (e.g., 50% faster). In particular, it has been reported that calling a Java function from C is now relatively fast: I have not tested this functionality myself.

One of the cool feature of the new interface is that you can pass directly data from the Java heap to your C function with relative ease.

Suppose you have the following C function:

int sum_array(int* data, int count) {
    int sum = 0;
    for(int i = 0; i < count; i++) {
        sum += data[i];
    }
    return sum;
}

And you want the following Java array to be passed to C without a copy:

int[] javaArray = {10, 20, 30, 40, 50};

It is as simple as the following code.

System.loadLibrary("sum");
Linker linker = Linker.nativeLinker();
SymbolLookup lookup = SymbolLookup.loaderLookup();
MemorySegment sumAddress = lookup.find("sum_array").orElseThrow();

// C Signature: int sum_array(int* data, int count)
MethodHandle sumArray = linker.downcallHandle(
    sumAddress,
    FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS, ValueLayout.JAVA_INT),
    Linker.Option.critical(true)
);

int[] javaArray = {10, 20, 30, 40, 50};

try (Arena arena = Arena.ofConfined()) {
    MemorySegment heapSegment = MemorySegment.ofArray(javaArray);
    int result = (int) sumArray.invoke(heapSegment, javaArray.length);
    System.out.println("The sum from C is: " + result);
}

I created a complete example in a few minutes. One trick is to make sure that java finds the native library. If it is not at a standard library path, you can specify the location with -Djava.library.path like so:

java -Djava.library.path=target -cp target/classes IntArrayExample

Further reading.When Does Java’s Foreign Function & Memory API Actually Make Sense? by A N M Bazlur Rahman.

How stagnant is CPU technology?

Daniel Lemire — Wed, 14 Jan 2026 14:52:39 +0000

Sometimes, people tell me that there is no more progress in CPU performance.

Consider these three processors which had comparable prices at release time.

The AMD Ryzen 7 9800X3D (Zen 5, with up to 5.3 GHz boost) was released in 2024.
The AMD Ryzen 7 7800X3D (Zen 4, with up to 5.1 GHz boost) was released in 2023.
The AMD Ryzen 7 5800X3D (Zen 3, with 3.4 GHz base) was released in 2022.

Let us consider their results on on the PartialTweets open benchmark (JSON parsing). It is a single core benchmark.

2024 processor	12.7 GB/s
2023 processor	9 GB/s
2022 processor	5.2 GB/s

In two years, on this benchmark, AMD more than doubled the performance for the same cost.

So what is happening is that processor performance is indeed going up, sometimes dramatically so, but not all of our software can benefit from the improvements. Software developers must track the trends and adapt our software accordingly. Unfortunately, it is hard work and it requires expertise. In the case of this benchmark, the simdjson library is designed to benefit from better processor features.

Not all software can easily run much faster on new processors, and genuine progress is difficult.

Let us be ambitious. Let us move forward!

What I Got Wrong About “Hard Work” in My 20s

Daniel Lemire — Thu, 08 Jan 2026 00:39:36 +0000

When I was younger, in my 20s, I assumed that everyone was working “hard,” meaning a solid 35 hours of work a week. Especially, say, university professors and professional engineers. I’d feel terribly guilty when I would be messing around, playing video games on a workday.

Today I realize that most people become very adept at avoiding actual work. And the people you think are working really hard are often just very good at focusing on what is externally visible. They show up to the right meetings but unashamedly avoid the hard work.

It ends up being visible to the people “who know.” Why? Because working hard is how you acquire actual expertise. And lack of actual expertise ends up being visible… but only to those who have the relevant expertise.

And the effect compounds. The difference between someone who has honed their skills for 20 years and someone who has merely showed up to the right meetings becomes enormous. And so, we end up with huge competency gaps between people who are in their 30s, 40s, 50s. It becomes night and day.

A bit of glass and freedom is all you need

Daniel Lemire — Wed, 07 Jan 2026 00:47:23 +0000

Galileo Galilei was the OpenAI of his time. He helped establish modern science by emphasizing experimentation as the primary means to uncover natural truths. To this end, he built his own telescopes. He revealed to the world the moons of Jupiter, thereby changing forever how we viewed the cosmos.

How was Galileo able to design better telescopes than others ? If you have ever been to Venice, you may know that it was famous for its glassmakers. There is a small island nearby (Murano) where there are still glassmakers. Further, Venice had some of the best merchants of Europe, so they could export their glass worldwide.

This is an important lesson as to what drives innovation. It is not a linear process. Living near people making fancy glasses could be the key you need.

A common misconception portrays Galileo as persecuted solely for advocating heliocentrism, the idea that Earth orbits the Sun. In reality, he spent most of his career challenging established doctrines and thrived under Church patronage. Galileo overturned the widespread belief that heavier objects fall faster, and this achievement, if nothing else, brought him greater fame. When he gathered strong evidence for heliocentrism, he initially faced only cautions rather than outright condemnation.

Pope Urban VIII had personally permitted Galileo to discuss heliocentrism as a hypothesis and even requested that his own arguments on the matter be included. However, Galileo placed these papal views in the mouth of Simplicio, a character portrayed as intellectually inadequate in defending the traditional geocentric position. This was widely interpreted as a mockery.

Galileo was sentenced to house arrest, during which he continued productive work. The ban on his Copernican writings applied mainly within Catholic territories, allowing their dissemination elsewhere in Europe.

Thus another important element that made Galileo possible was the relative freedom he enjoyed.

You want to innovate ? Don’t live in the world of ideas solely. Don’t be shy about mixing with commercial interest. And make sure to have a bit of freedom.

Technology is culture

Daniel Lemire — Thu, 01 Jan 2026 14:03:25 +0000

We are experiencing one of the most significant technological breakthroughs of the last few decades. Call it what you will: AI, generative AI, large language models…

But where does it come from? Academics will tell you that it stems from decades of mathematical efforts on campus. But think about it: if this were the best model to explain what happened, where would the current breakthroughs have occurred? They would have happened on campus first, then propagated to industry. That’s the linear model of innovation—a rather indefensible one.

Technology is culture. Technological progress does not follow a path from the blackboard of a middle-aged MIT professor to your desk, via a corporation.

So what is the cultural background? Of course, there is hacker culture and the way hackers won a culture war in the 1980s by becoming cool enough to have a seat at the table.

But closer to us… I believe there are two main roots. The first is gaming. Gamers wanted photorealistic, high-performance games. They built powerful machines capable of solving linear algebra problems at very high speeds.

Powerful computing alone, however, does you no good if you want to build an AI. That’s where web culture came in. Everything was networked, published, republished. Web nerds helped build the greatest library the world had ever seen.
These two cultures came together to generate the current revolution.

If you like my model, I submit that it has a few interesting consequences. The most immediate one is that if you want to understand how and where technological progress happens, you have to look at cultural drivers—not at what professors at MIT are publishing.

The culture war that we won

Daniel Lemire — Wed, 31 Dec 2025 15:26:25 +0000

Culture wars are real. They occur when a dominant culture faces a serious challenge. But unless you pay close attention, you might miss them entirely. As a kid, I was a “nerd.” I read a lot and spent hours on my computer. I devoured science and technology magazines. I taught myself programming. “Great!” you might think. Not at all. This was not valued where and when I grew up. Computers were seen as toys. A kid who spent a lot of time on a computer was viewed as obsessed with a rather dull plaything. We had a computer club, but it was essentially a gathering of “social rejects.” No one looked up to us. Working with computers carried no prestige. Dungeons & Dragons was outright “dangerous”—you had to hide any interest in such games. The 1983 movie WarGames stands out precisely because the computer-obsessed kid gets the girl and saves the world. Bill Gates was becoming famous around that time, but this marked only the beginning of a decade-long culture war in which hacker culture gradually rose to dominance.

Today, most people can speak the language of hackers. It did not have to turn out this way, and it did not unfold identically everywhere. The status of hacker culture is high in the United States, but it remains lower in many other places even now. Even so, in many organizations today, even in the United States, the « computer people » are stored in the basement. We do not let them out too often. They are not « people persons ». So the culture war was won by the hackers, the victory is undeniable. But as with all wars, the result is more nuanced that one might think. Many would like nothing more than to send back the computer people at the bottom of the prestige ladder.

Salaries are a good indicator for prestige. In the USA, in Australia and in Switzerland, « computer people » have high salaries and relatively high status. In the UK as a whole? Not so much. I bet you do better as a « financial analyst » over there.

What is worth watching is the effect that « AI » will have on the status battles. In some sense, building software that can do financial, political and legal analysis is the latest weapon in the arsenal of the computer people. Many despair about what AI might do to software developers: I recommend looking at it in the context of the hacker culture war.

By how much does your memory allocator overallocate?

Daniel Lemire — Tue, 30 Dec 2025 19:15:55 +0000

How much virtual memory does the following C++ expression allocate on the heap?

new char[4096]

The answer is at least 4 kibibytes but surely more.

Firstly, each heap memory allocation requires some memory to keep track of what has been allocated. You are likely using 8 bytes or so of overhead that your program cannot access.

Secondly, the memory allocator may allocate a bit more than the 4096 bytes you requested. On a Linux machine, I found that it would allocate 4104 bytes, so 8 extra bytes that are usable by your program. You can check this value by calling malloc_usable_size under Linux.

Thus, overall, you may end up with an extra 16 bytes allocated when you requested 4096 bytes. It is an overhead of about 0.4%. You are basically wasting a byte for every 256 bytes that you allocate.

But that is not the worst possible case. On macOS, let us consider the following line of code.

new char[3585]

The system reports an allocation of 4096 bytes: a 14% overhead. What is happening is that macOS rounds up the memory allocation to the nearest 512 byte boundary for moderately small allocations. If you try allocating even larger memory blocks, it starts rounding up even more.

Freedom from incompetence

Daniel Lemire — Mon, 29 Dec 2025 14:41:00 +0000

Many people say that they crave more freedom.
But what do we mean by “freedom”?
Being free from constraints? Is that what we mean? Would you feel “freer” if you could walk outside in your underwear?
It is almost surely not what you mean by “freedom.”
I submit to you that it is almost always the case that if you are frustrated at work by your lack of freedom, the actual problem is competence.
Imagine two scenarios.

Scenario A: You work for a highly directive boss. You are constantly accountable for what you do. But everyone around you is highly competent. You need to wear a jacket and a tie, but you work in the best team in the world.
Scenario B: You work in a context where you hardly know who your boss is. You come to work in your underwear. However, everyone is incompetent. You work with the least competent team in the world.

Assuming that the salary is the same, which job do you prefer?
I cannot answer for you, but most people I know prefer Scenario A.

Don’t be so eager to rewrite your code

Daniel Lemire — Sun, 28 Dec 2025 03:02:44 +0000

I used to always want to rewrite my code. Maybe even use another programming language. « If only I could rewrite my code, it would be so much better now. »

If you maintain software projects, you see it all the time. Someone new comes along and they want to start rewriting everything. They always have subjective arguments: it is going to be more maintainable or safer or just more elegant.

If your code is battle tested… then the correct instinct is to be conservative and keep your current code. Sometimes you need to rewrite your code : you made a mistake or must change your architecture. But most times, the old code is fine and investing time in updating your current code is better than starting anew.

The great intellectual Robin Hanson argues that software ages. One of his arguments is that software engineers say that it does. That’s what engineers feel but whether it is true is another matter.

« Before Borland’s new spreadsheet for Windows shipped, Philippe Kahn, the colorful founder of Borland, was quoted a lot in the press bragging about how Quattro Pro would be much better than Microsoft Excel, because it was written from scratch. All new source code! As if source code rusted. The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material? » (Joel Spolsky)

Parsing IP addresses quickly (portably, without SIMD magic)

Daniel Lemire — Sat, 27 Dec 2025 23:39:57 +0000

Most programmers are familiar with IP addresses. They take the form
of four numbers between 0 and 255 separated by dots: 192.168.0.1.
In some sense, it is a convoluted way to represent a 32-bit integer.
The modern version of an IP address is IPv6 which is usually surrounded
by square brackets. It is less common in my experience.

Using fancy techniques, you can parse IP addresses with as little as 50 instructions. It is a bit complicated and not necessarily portable.

What if you want high speed without too much work or a specialized library? You can try to roll your own. But since I am civilized programmer, I just asked my favorite AI to write it for me.

// Parse an IPv4 address starting at 'p'.
// p : start pointer, pend: end of the string
std::expected<uint32_t, parse_error> parse_manual(const char *p, const char *pend) {
uint32_t ip = 0;
    int octets = 0;
    while (p < pend && octets < 4) {
        uint32_t val = 0;
        const char *start = p;
        while (p < pend && *p >= '0' && *p <= '9') {
            val = val * 10 + (*p - '0');
            if (val > 255) {
                return std::unexpected(invalid_format);
            }
            p++;
        }
        if (p == start || (p - start > 1 && *start == '0')) {
            return std::unexpected(invalid_format);
        }
        ip = (ip << 8) | val;
        octets++;

        if (octets < 4) {
            if (p == pend || *p != '.') {
                return std::unexpected(invalid_format);
            }
            p++; // Skip dot
        }
    }
    if (octets == 4 && p == pend) {
        return ip;
    } else {
        return std::unexpected(invalid_format);
    }
}

It was immediately clear to me that this function was not as fast as it could be. I then asked the AI to improve the result by using the fact that each number is made of between one and three digits. I got the following reasonable function.

std::expected<uint32_t, parse_error> parse_manual_unrolled(const char *p, const char *pend) {
    uint32_t ip = 0;
    int octets = 0;
    while (p < pend && octets < 4) {
        uint32_t val = 0;
        if (p < pend && *p >= '0' && *p <= '9') {
            val = (*p++ - '0');
            if (p < pend && *p >= '0' && *p <= '9') {
                if (val == 0) { 
                  return std::unexpected(invalid_format);
                }
                val = val * 10 + (*p++ - '0');
                if (p < pend && *p >= '0' && *p <= '9') {
                    val = val * 10 + (*p++ - '0');
                    if (val > 255) { 
                      return std::unexpected(invalid_format);
                    }
                }
            }
        } else {
            return std::unexpected(parse_error::invalid_format);
        }
        ip = (ip << 8) | val;
        octets++;
        if (octets < 4) {
            if (p == pend || *p != '.') {
              return std::unexpected(invalid_format);
            }
            p++; // Skip the dot
        }
    }
    if (octets == 4 && p == pend) {
        return ip;
    } else {
        return std::unexpected(invalid_format);
    }
}

Nice work AI!

In C++, we have standard functions to parse numbers (std::from_chars) which can significantly simplify the code.

std::expected<uint32_t, parse_error> parse_ip(const char *p, const char *pend) {
  const char *current = p;
  uint32_t ip = 0;
  for (int i = 0; i < 4; ++i) {
    uint8_t value;
    auto r = std::from_chars(current, pend, value);
    if (r.ec != std::errc()) {
      return std::unexpected(invalid_format);
    }
    current = r.ptr;
    ip = (ip << 8) | value;
    if (i < 3) {
      if (current == pend || *current++ != '.') {
        return std::unexpected(invalid_format);
      }
    }
  }
  return ip;
}

You can also use the fast_float library as a substitute for std::from_chars. The latest version of fast_float has faster 8-bit integer parsing thanks to Shikhar Soni (with a fix by Pavel Novikov).

I wrote a benchmark for this problem. Let us first consider the results using an Apple M4 processors (4.5 GHz) with LLVM 17.

function	instructions/ip	ns/ip
manual	185	6.2
manual (unrolled)	114	3.3
from_chars	381	14
fast_float	181	7.2

Let us try with GCC 12 and an Intel Ice Lake processor (3.2 GHz) using GCC 12.

function	instructions/ip	ns/ip
manual	219	30
manual (unrolled)	154	24
from_chars	220	29
fast_float	211	18

And finally, let us try with a Chinese Longsoon 3A6000 processor (2.5 GHz) using LLVM 21.

function	instructions/ip	ns/ip
manual	187	29
manual (unrolled)	109	21
from_chars	191	39
fast_float	193	27

The optimization work on the fast_float library paid off. The difference is especially striking on the x64 processor.

What is also interesting in my little experiment is that I was able to get the AI to produce faster code with relatively little effort on my part. I did have to ‘guide’ the AI. Does that mean that I can retire? Not yet. But I am happy that I can more quickly get good reference baselines, which allows me to better focus my work where it matters.

Reference: The fast_float C++ library is a fast number parsing library part of GCC and major web browsers.

Performance trick : optimistic vs pessimistic checks

Daniel Lemire — Sat, 20 Dec 2025 23:26:09 +0000

Strings in programming are often represented as arrays of 8-bit words. The string is ASCII if and only if all 8-bit words have their most significant bit unset. In other words, the byte values must be no larger than 127 (or 0x7F in hexadecimal).

A decent C function to check that the string is ASCII is as follows.

bool is_ascii_pessimistic(const char *data, size_t length) {
  for (size_t i = 0; i < length; i++) {
    if (static_cast<unsigned char>(data[i]) > 0x7F) {
      return false;
    }
  }
  return true;
}

We go over each character, we compare it with 0x7F and continue if the value is no larger than 0x7F. If you have scanned the entire string and all tests have passed, you know that your string is ASCII.

Notice how I called this function pessimistic. What do I mean? I mean that it expects, in some sense, that it will find some non-ASCII character. If so, the best option is to immediately return and not scan the whole string.

What if you expect the string to almost always be ASCII? An alternative then is to effectively do a bitwise OR reduction of the string: you OR all characters together and you check just once that the result is bounded by 0x7F. If any character has its most significant bit set, then the bitwise OR of all characters will also have its most significant bit set. So you might write your function as follows.

bool is_ascii_optimistic(const char *data, size_t length) {
  unsigned char result = 0;
  for (size_t i = 0; i < length; i++) {
    result |= static_cast<unsigned char>(data[i]);
  }
  return result <= 0x7F;
}

If you have strings that are all pure ASCII, which function will be fastest? Maybe surprisingly, the optimistic might be several times faster. I wrote a benchmark and ran it with GCC 15 on an Intel Ice Lake processor. I get the following results.

function	speed
pessimistic	1.8 GB/s
optimistic	13 GB/s

Why is the optimistic faster? Mostly because the compiler is better able to optimize it. Among other possibilities, it can use autovectorization to automatically use data-level parallelization (e.g., SIMD instructions).

Which function is best depends on your use case.

What if you would prefer a pessimistic function, that is, one that returns early when non-ASCII characters are encountered, but you still want high speed? Then you can use a dedicated library like simdutf where we have hand-coded the logic. In simdutf, the pessimistic function is called validate_ascii_with_errors. Your results will vary but I got that it has the same speed as optimistic function.

function	speed
pessimistic	1.8 GB/s
pessimistic (simdutf)	14 GB/s
optimistic	13 GB/s

So it is possible to combine the benefits of pessimism and optimism although it requires a bit of care.