Matt Stockton

There’s More Public Data Than People Exploring It

2026-06-04T00:00:00+00:00

I recently built WattsOpen, an interactive site about where the US grid has room. It maps the large fossil plants retiring over the next several years, scores those sites on how close they are to existing transmission and gas lines, and digs into interconnection queues, retirement schedules, and the hour-by-hour fuel mix on each grid. Everything on it comes from open federal data, and I made the whole thing with Claude Code.

WattsOpen started as a way to test two theses I’ve had for a while. The first is that there’s a huge amount of data out there, structured and unstructured, that’s largely underexplored. People aren’t looking at it, and they’re mostly not trying to see how different datasets connect. FRED alone has over 800,000 economic time series. The EIA publishes monthly filings on every generator in the country. And honestly, this kind of exploration is one of the things AI is most useful for - understanding what a dataset is, writing the code to process it, figuring out how it relates to other data.

The second thesis is that these tools have gotten incredibly good at creating the right user experience for visualizing information. They can build something interactive that gets you to the signal quickly, and they can do it mostly automatically.

Why the Grid

I’ve been curious about how the grid fits into what’s happening in AI right now. Power keeps coming up as the bottleneck for data center construction, and when I started asking basic questions about it - what’s retiring, where does new generation connect, why does it take years - I realized I didn’t really understand anything about how the grid works. I didn’t know what an interconnection queue was or what a balancing authority does.

I could have read explainers, but I learn better by building. The plan was to pull the real data, work through it with Claude Code, and end up with something I could keep using and updating as I learn more. The site is as much a tool for my own learning as it is something to share.

What WattsOpen Is

The site tells five short stories on top of the same data:

A co-location atlas - every US fossil plant of 100 MW or more retiring between 2024 and 2030, scored on distance to high-voltage transmission, distance to interstate gas, site size, and recent utilization. These are sites where grid infrastructure is already in place and about to be freed up.
A retirement schedule - a county-by-county map of the roughly 73 GW of fossil capacity scheduled to retire through 2035.
A headroom time machine - a replay of ten years of monthly EIA-860M filings, showing when retirement plans first appeared in the public record, county by county.
A queue reality check - withdrawal rates, time-to-operation, and resource mix for the interconnection queues in nine regions, built on LBNL’s Queued Up data.
An hourly fuel-mix clock - what’s on the wire hour by hour across nine balancing authorities, from EIA-930 data.

The sources are EIA generator and utility filings, hourly grid data from EIA-930, LBNL’s queue research, and HIFLD’s maps of transmission lines and gas pipelines. All of it is free, though for most of the sources I had to sign up for an API key.

How I Worked

The first commit was May 16. The site was live about two and a half weeks later - 33 commits across nine days of touching the project, all in spare time.

Claude Code was the thought partner for the whole thing. I used it to understand what the datasets actually are, to build the code that processes them, to figure out how they relate to each other, and to build the visualizations. Before processing a dataset I’d have it explain what I was looking at - who files EIA-860M and why, what “nameplate capacity” means, how LBNL assembles queue data from nine regions that each define project status differently. Then it wrote the pipeline (fetch each source, normalize to Parquet, load into DuckDB, export JSON for the site) and built the site itself in static HTML and D3 with no build step.

The loop was the same every session. Ask what datasets exist for a question I had, fetch one, have Claude explain the columns and how the data gets collected, look at it together, then figure out how it could join to what we already had. The most useful results came from the joins - retirement filings matched to transmission line locations tell you something neither dataset does on its own.

What I Learned

My mental model of this space is still crude, but a few things stuck:

About 1.8 TW of nameplate capacity sits in US interconnection queues, which is more than the entire installed generating fleet. Historically most of it never gets built. Across the projects in the data, about 85% never reach commercial operation, and the ones that do typically wait two to five years.
Meanwhile, fossil plants retire on a published schedule, at sites that are already wired. The median retiring plant in the atlas is a quarter mile from a high-voltage substation and about five miles from an interstate gas pipeline.
The fuel mix on each grid changes hour by hour. Solar pushes gas off the grid at noon in some markets while coal runs flat through the day in others - patterns that annual averages smooth over.

None of this is news to anyone in the industry, but it was all news to me, and working through the data myself made it stick in a way that reading about it wouldn’t have.

What This Isn’t

I’m sure I got a ton of things wrong here. If you’re an expert in this space, you’ll probably look at WattsOpen and see a naive view of how the grid works, and maybe some things that make no sense at all. Every page on the site has a banner saying it’s a personal learning project, not expert analysis, and I mean that. There’s no parcel-level analysis, no per-RTO transfer rules, and the queue data is an annual snapshot with status definitions that vary by region. None of that made it less useful as a way to dig into the space and learn. Just don’t use it to make siting decisions.

Final Thoughts

The grid was the subject here, but the thing I’m most enthused about is the workflow: exploring data in an applied fashion to learn a domain, and ending up with something I can keep using to learn more. The same approach would work on macro series from FRED, SEC filings, court records, FDA databases - any field with a pile of public data behind it. I wrote a few months ago about domain experts being the biggest AI opportunity. This project was sort of the inverse. I had no expertise, and the tools still got me to something I learned a lot from and that might be useful to other people. Someone who knows energy cold could skip the tutorial phase and go a lot further.

If you know this space, I’d like to hear what WattsOpen gets wrong - that’s part of the learning. And I’m curious whether others are doing this kind of applied exploration with public data in their own fields.

Automating the Path from Frontier Models to Fine-Tuned Models

2026-03-02T00:00:00+00:00

I saw a post from Virat at findatasets recently about using GPT 5.2 to parse 8-K filings. It works well but it’s expensive at scale. His plan was to accumulate examples and fine-tune an open model to bring cost down. I replied that this loop - frontier model generates outputs, outputs become training data, training data fine-tunes a cheaper model - could be almost fully automated.

I haven’t done fine-tuning in a while. Frontier models just work for most of what I’ve needed, so I haven’t had a reason to. But Virat’s post got me thinking about this pattern more carefully, and it seems like something worth exploring.

Structured Extraction and Evaluation

The reason this pattern seems especially interesting for structured extraction is that it’s easier to evaluate. If you’re pulling known fields out of semi-structured documents - dates, entities, financial figures - there’s a clearer definition of “correct” than with freeform text. Some checks can be automated (does this parse as a valid date? does this value appear in the source document?), and people can review the rest.

If you’re using a frontier model for extraction at any real volume, you should already be evaluating its outputs - having people review results, label quality, build an evaluation set. That evaluation work produces labeled (input, output) pairs. And if you orchestrate the system correctly, those labeled pairs can double as training data for a cheaper model.

What the Retraining Loop Would Look Like

I think the workflow would look something like this:

Run the frontier model on your extraction task.
Evaluate outputs. People review and label results. Automated checks can help with the obvious stuff (schema validation, format checks, cross-referencing source documents). This is ongoing work, not a one-time thing.
Evaluated outputs become your training dataset. Every (input, output) pair that’s been reviewed and labeled goes into your training set.
Fine-tune a smaller, cheaper model on this data. Once you have enough high-quality examples, fine-tune an open model like Llama or Mistral.
Evaluate the fine-tuned model out of sample. Check how it performs on examples it wasn’t trained on. This tells you whether it’s actually good enough to deploy.
Deploy the fine-tuned model for bulk traffic. If it meets your quality bar, route extraction volume through the cheaper model. Use the frontier model as a fallback for cases where the fine-tuned model’s confidence is low. You’d also want a way for people to flag bad outputs, so those can feed back into your training data.
Continue accumulating data and periodically retrain. The frontier model fallback path keeps generating new examples for people to evaluate. Each cycle grows the training set.

If you set up the right hooks - a pipeline for people to label outputs, a process for checking fine-tuned model performance out of sample, off-ramps for routing traffic to cheaper models - the retraining loop falls out of the orchestration. You’re connecting evaluation work you should already be doing to a fine-tuning pipeline.

This Is Just MLOps

The thing that struck me about this pattern is that it’s not new. It’s the same retraining loop that classical ML teams have been running for years. Collect labeled data, train a model, deploy it, monitor performance, collect more data, retrain.

In traditional ML, labeling was always the bottleneck. You’d hire annotators to label documents from scratch, and it was tedious, expensive, and slow.

Now frontier models are good enough that they can do the initial extraction at high quality, and you can use techniques like LLM-as-judge to get reasonable confidence that the outputs are correct. You still need humans in the loop reviewing labels, but you’re spot-checking outputs that are already mostly right rather than creating labels from zero. That makes the human review much less costly, which is a big part of why this pattern feels more viable now than it would have a couple years ago.

I’d imagine teams with classical ML experience have an advantage here. They already know how to build data pipelines, version datasets, run A/B tests between model versions, and monitor for drift. If you’ve ever built a retraining pipeline for a classification model or an NER system, you probably already know most of what this requires.

Where I Think This Does and Doesn’t Apply

I haven’t validated all of this myself, but based on what I’ve read, it seems like the pattern fits well when:

There’s a clear definition of “correct” - structured extraction, classification, entity recognition
Volume is high enough to justify the investment (thousands of examples)
The domain is relatively stable - document formats don’t change weekly
Cost matters at scale - you’re spending real money on frontier model API calls

And it probably fits less well when:

Outputs are subjective or hard to evaluate - summarization, creative writing, open-ended Q&A
The domain shifts frequently, invalidating your training data
You need the frontier model’s breadth and general reasoning, not narrow pattern-following
Volume is low - if you’re processing a hundred documents a month, just keep using the frontier model

From what I’ve read, fine-tuning starts making sense around a few hundred to a few thousand high-quality examples.

Fine-Tuning Has Gotten Easier

The last time I experimented with fine-tuning was through the OpenAI fine-tuning API, and it was mostly hello-world level stuff. I didn’t have a real use case to push it further.

The tooling seems like it’s gotten a lot better since then, and there are more options now. Unsloth is one I’ve seen mentioned frequently for running fine-tuning on a single GPU. I have some projects on the horizon where this kind of loop might be worth experimenting with, so I want to spend more time here.

Final Thoughts

The human evaluation part of this doesn’t go away, and I don’t want to downplay that. Someone needs to define what “correct” means, review outputs, and make judgment calls about when the fine-tuned model is good enough. But the orchestration around it - the pipeline from evaluated outputs to training data to fine-tuned model to deployment with off-ramps - that part seems like it can be set up once and mostly run itself.

I’m curious whether teams are actually doing this today, and what their experience has been. If you’re running frontier models on structured extraction at volume and you’re already evaluating outputs, it seems like you’re close to having what you need. I want to spend more time exploring this and see what it takes to get a working version of this loop running end to end.

What Is An Agent?

2026-02-21T00:00:00+00:00

This is my attempt to describe what an agent is and why it’s so incredible, yet simple. I’m not going to edit this and I’m not going to run this through an LLM. I’m also going to try to do this in five minutes or less. Let me know how I did, and what I missed that you think is important.

An agent is simply an LLM that can call tools
A tool is a computer application or a piece of code. For example, opening a file is a tool. Searching for a word in a file is a combination of tools.
Command line tools have been in existence on computers for a very long time.
These tools are composable and can solve almost any problem as it relates to files that are on the computer.
As an example, this ‘tool’:

cat notes/*.txt | tr '[:upper:]' '[:lower:]' | grep -oE '\b[a-z]{4,}\b' | sort | uniq -c | sort -nr | head

Takes all your text notes, normalizes the text, pulls out some words, counts them, and then shows the most common words.
The tool as read may make no sense to you if you’re not an experienced engineer, but the models know exactly how to ‘do this’ - meaning ‘generate that text and do it’
With a written instruction like ‘Find the most common topics in my notes.’ - The model has been tuned enough so that it can generate the above tool call.
Any computer application on your computer is a tool. Excel? It’s a tool. Slack? It’s a tool. Web Browser? Same
So if you blur your eyes, an agent is something that can simply control your computer. Pretty much any aspect of it.
So what if a tool doesn’t exist that does what you need the agent to do?
The agent has a tool that lets it write computer code. This tool is one of the best tools that it has because the foundation labs have spent a lot of time making sure this tool works well.
And computer code can solve almost any problem.
So you can have an instruction like: ‘Write me some code that analyzes this image and gives me the exact coordinates of where the red balloon is.’
Coding agents like Claude Code will see this and then try to write you some code that does that. It will often use other tools in the code itself. So it might download some libraries that it can use or some other techniques that it knows to be able to compose some software to solve the above problem.
Using a tool like Claude Code, eventually you’ll be able to build some code that solves the problem you’re talking about.
So what do you have now?
You have a new tool called the Find the Red Balloon tool.
And now you can give that tool to the agent so it can just use it next time.
Basically, you use an agent to build a tool that you can hand to another agent and it can use that tool whenever it needs to.
So now you can just say, “Find the red balloon in the image.” And the agent will use the tool to do that.
If the tool works, then it’s going to get it right. You can build deterministic tools that the agent can use. Even though LLMs are at their core, non-deterministic, you can bake in lots of determinism.
This is the magic, but also the simplicity of agents.
It’s just an LLM using a computer.
But it is incredibly flexible, incredibly generalizable, and incredibly composable. So basically you can solve almost any problem.
The other important thing here is the file system.
Most systems rely on data to provide any value.
It turns out if you put data in the right place in a file system, meaning folders on a computer, and you organize that well, the agent can just use tools to find what it needs.
So basically, agents come down to tools and file systems.
But those things can be assembled in so many different ways that you can solve incredibly hard problems that truly weren’t solvable before

Hope you found this interesting. There are obviously many other details around what agents are. But I really wanted to capture the core essence as I see it in a way that’s accessible to folks. Let me know how I did and how you think about it.

We Are Here

2026-02-17T00:00:00+00:00

This post is different than my other posts. I’ve found myself trying to write down all of the thoughts I’ve had about what AI is doing, particularly for how we build software. So much has changed and things are moving so fast yet it almost feels like there’s no time to even reflect on it all. And there are so many angles to take and perspectives to have. Meanwhile, things are changing so fast that even those perspectives shift rapidly. So this is my attempt to just get some words down. Not in a narrative forum or with a story or really any coherency, but just a list of thoughts I’ve been having and experiences I’ve been having as it relates to building software.

So why am I writing this down? Partly as a means of reflection. Partly as a way for people who are also thinking about this to feel seen. I don’t think there’s any particular reason or need to feel seen, but just acknowledging that I think there are more and more people who are having these types of thoughts about software, some of which are thrilling and some of which are discomforting. And I’m writing mine down. It’s also for folks who might not have had the time to explore these tools or understand just where we are at. If you read the following and you’re writing software and this all feels just very alien to you, I think it’s worth your time to explore - and I think there is an urgency to explore. My candid advice is that you need to do it now actually.

None of this is meant as a brag or anything of that nature. I’m just stating things as they are happening to me and what I’m seeing. Like the title says: We Are Here.

If you distill it all down to the core essence, It’s that things have changed in software so dramatically over the last six months that it’s truly a completely different thing. If you would have showed me this list three years ago, it would have been completely incomprehensible. I would have told you that there’s no way that these things are true.

So here’s the list and I’m not going to use AI to modify this or to make it sound better. Or anything like that. These are the raw notes.

I have not written a single line of code myself for at least the past four months. Zero.
I am the most productive I have ever been in my career. And I’m astounded by what I can accomplish on an almost daily basis.
I’ve never had more fun building.
My productivity in terms of what I can build has likely 10x’ed compared to two years ago.
Some things I can build in minutes, where it literally would have taken hours or days before.
I’ve never felt busier with work, but most of the times that aspect is energizing. I actually have a hard time putting it down, which is a feeling I haven’t had in a number of years. The last time I felt like this was probably when iOS came out and you could build iPhone apps in the late 2000s.
I rarely use my keyboard anymore, particularly at my home office. It’s just me speaking to my computer using Mac Whisper.
I find myself reading code less and less. Yes, I am still reading it and I’m not just vibe coding, but I’m finding other ways to ensure the system works as expected without reading all the details. There’s never been a better time to have these tools and still know what good looks like from previous experience.
I am building software that self-updates. Meaning that it emits information as it runs, and then is able to look at that information after it runs to make improvements to itself.
I often run “pre-determined commands” (e.g. skills or slash commands) that accomplish ‘operational’ work which would have two years ago, taken me several hours.
I can multi-task on extremely complex projects in disparate areas without feeling overwhelmed. In fact, sometimes this feels almost necessary with how tools like Claude Code and planning mode work. It’s like the code compiling days all over again.
I can record myself on a run talking through something I want to do, whether it be a document I want to produce or even a large code change that I want to make. When I get back to my desk, I can use this transcript with a predetermined command. And it almost always is able to one-shot the changes correctly or get very close.
I spend a lot of my time answering questions that an AI asks me. In fact, one of the most valuable ways to make sure these tools produce what I want them to produce is to allow them to ask me questions exhaustively.
The tools feel like a superpower now. And they continue to get rapidly better. There are truly step changes that have happened in the last couple months, and there is no sign that this is going to stop.
I am still surprised by software engineers who don’t see it or who don’t get it who are not doing it. I’m obviously deep in the rabbit hole, but it is so incredibly obvious to me that things have changed forever. Classical software engineering is over.
There is still an immense need for software engineering talent and specifically systems thinking. There’s never been a better time to apply systems thinking than right now. It feels like a cheat code to have been able to build software classically for the last 20 years and then be able to use these tools now.
There are still lots of quirks with how to use these tools. Many of the people in this space that I have high regard for are using the tools similarly - but there’s still a a lot of differentiation. The only way to figure this out is to get your hands dirty and do the thing.
I’m finding myself to be more reliant on AI to do specific things, and less reliant on it to do other things over time. As a specific example, planning mode is absolutely critical now for these tools, and I will spend a ton of time thinking through the plans and trying to build the plans without AI to help me as I first cut. Because that first step that the model takes and the direction it starts heading is enormously important. So it’s worth the time to think critically here.
So many things in this industry have changed so rapidly, particularly over the last six months. I intellectually believe it’s only going to accelerate but I still don’t fully think I’ve internalized what that means.
I have no idea where this is all going and definitely have my stretches of anxiety about it all. But I am here for it and I’m going to lean into it and I’m going to try to help others that want to do that too.

One more that’s useful to add – and this is the thing that I struggle with the most. I actually feel more and more behind every day. I know I am not, but honestly that is how fast things are moving – and it’s only getting faster.

If You Want to Play Games, You Have to Make Games

2026-02-16T00:00:00+00:00

This post is going to be different than my normal posts in a couple ways. Let me give you the TLDR first in case you want that:

Me and my kids built some games that you can play on the web. They spoke to my computer to build them, and Claude Code built them.
You can go play them now at confettigalaxy.com.
I was stunned at how well this works. I am stunned almost every day by the things I am able to do with AI.
It made me think more deeply about AI, education, and the skills that matter to be successful in society - something that, upon reflection, I’ve been avoiding thinking about.

The Story

My oldest daughter loves to play this online game platform. It’s web-based and has all these different options to play. Some of the games are fun and educational, but some of them are just mindless. I get it, it’s a fun way to use your time and it’s interactive. But there’s been a bunch of friction about her using it and wanting to use it all the time, and a lot of arguing about when she can use it.

I’ve been meaning to think more deeply about what my true view is on AI as it relates to kids. There’s no question that the entire way kids learn is going to have to fundamentally change. There’s also no question in my mind that the current educational systems will always be behind in figuring out how to integrate these new capabilities into learning experiences. AI is accelerating so fast at this point that the gap is just going to continue to get wider.

As someone who’s incredibly deep in this rabbit hole, I took a step back and tried to ask myself - why haven’t I thought more about this? I think it’s because I don’t have a good solution, and I’m worried about how this change takes shape, particularly for younger people. So it’s avoidance.

AI has made a step change in capabilities in the last three months. Things that have taken me weeks in the past now take me hours or even minutes. And it keeps compounding as you build systems to better utilize these tools.

So taking all of that into account - yesterday I had an idea. I told my daughter that if she wanted to play games, then she had to make games. As an eight-year-old who has no idea what AI is, that concept didn’t mean anything. So I had to show her what I meant. I opened up Claude Code with her. I know how to orchestrate this tool very well at this point because I’m in it every day, so we weren’t starting from scratch and I knew the scaffolding we had to set up. I worked with her and then with my other daughter to build games. I used the system in a way that had it ask us questions about what we’re trying to achieve.

It was super interesting because kids are so creative. They have ideas and oftentimes they have trouble describing them exactly, but as you keep asking them questions, they can really clarify their thinking. We worked through this for a while. I set up a little Q&A workflow using a Claude Code skill, but I also had my kids speak to the computer using Mac Whisper and tell their own ideas about the games. Then we had it build the games. I helped with some of the polish and I knew how to make them a little bit more interactive, but most of these games were the kids’ ideas with me asking them questions about what they wanted them to do.

Within an hour they were playing their own games. Then I told them that their friends can play their games too. I spent a little bit more time with Claude Code working on a plan to deploy these games so that anyone can play them. And here they are on confettigalaxy.com.

This was so fun that I built my own game afterwards. I built a game called Go Out For The Pros, which is a game I used to play with my dad at the playground. I described the experience using the skill workflow - what we used to do, how it worked - and it built the game. It’s honestly exactly how I remembered it. Absolutely magical.

My daughter wanted to build a dolphin swimming game where you swim around and collect candy. She described the whole thing herself and we built it together.

The Jumble of Thoughts

After this exercise, a lot of thoughts came to the surface that have been lingering for a while. I haven’t fully clarified them all, but it felt useful to just write them down. Here they are in their raw form.

Education is going to change

There’s a lot of uncertainty about how AI changes education. Educational systems will always be behind the leading edge, and the gap is only going to get wider. But I think the failure mode is not leaning into it. If you want your kids to understand AI and how it fits into society, I think you have to experiment and do that yourself right now.

The skills that matter aren’t new

A lot of the skills you need to use AI well aren’t new skills. They’re skills that need to be amplified to take full advantage of this new capability. The ones that come to mind most for me:

Agency. Do you believe that you can create something yourself, and will you be assertive in taking that initiative? This skill has always been rare and incredibly useful. People who know what agency can do and are assertive about it are in an incredible position to take advantage of these new tools.
Curiosity. Are you willing to think about things and make connections across various topics? Are you learning about things well outside of technology that can impact your viewpoints? Curiosity and making connections has never been more valuable.
Clear thinking and communication. Can you think through what you’re trying to do and communicate that with clarity? There is a vast difference in outcomes from using these AI tools based on your specificity and your clarity of thought.
Willingness to iterate. Do you treat things as things that can improve over time? Do you try to nudge things to improve in an incremental fashion? Utilizing these tools effectively truly requires a feedback loop and iteration, and that’s where you get the compounding. You have to be able to analyze what you’re getting back as the result and figure out how to improve it.
Comfort with discomfort. Can you push through the discomfort of trying something new? Even when it feels odd, can you deal with the change and adapt? Things are moving fast and the people who have the mental models to push through that discomfort instead of retreating from it are going to be in a much better position.

There are other skills that are useful for AI, but those are the top ones that come to mind. None of them are technical in nature. People have been trying to develop these types of skills for years, even before AI. If you want your kids to be able to thrive in what’s coming with this wave of AI, you need to figure out how to help them learn these skills.

These skills are learnable

Building the games yesterday showed me these skills are learnable with the right environment and the right person helping. My probing questions forced the kids to think more clearly about what they wanted, and they rose to it. And one of the big advantages of learning these skills with these tools is the feedback loop is so tight. You describe something, it builds it, and you can see the progress immediately. The kids were amazed by how fast we could build these games and it gave them a ton of energy to think about what else they could do.

These tools are absolutely remarkable

I intellectually know at this point that these things are extremely capable. My mental model has always been to try to throw your most ambitious project at these tools because they will continue to surprise you. But even with that knowledge, I am continually surprised.

This is a portal my kids can use instead of the previous games they were playing. We can build new games together in about 10 to 15 minutes. They can create their own adventure and then play it. We can build games they want to play instead of the mindless games they’ve been playing. And it forces them to utilize the skills I talked about above.

I think people still truly underestimate how remarkable these tools are. They can do stunning things. As someone who’s deep in this rabbit hole every day, I am stunned on a daily basis. I really think people need to see this for themselves. Dig in, try these things. It’s worth carving off time to do so because it is absolutely astounding what they can do.

Equity

Education is going to drastically shift given how these tools can be integrated. Some people are starting to figure this out and take action on it. But the folks who have figured it out already have the means, and the technology and the environment to use it are readily available to them. If you look at something like Alpha School, the results speak for themselves - the improvements in testing outcomes they’re seeing by integrating AI into how kids learn are significant. I think it’s pointing in the right direction. But it’s not accessible to everyone yet, and won’t be for some time.

I’ll be honest - I haven’t spent enough time thinking through how this all plays out at scale. But the opportunity is real. AI enables individually tuned lesson plans in a way that was never before possible, and research on personalized instruction consistently shows it’s more effective than one-size-fits-all approaches. If we don’t figure this out, the disparity in skills and understanding has the possibility to be orders of magnitude greater than anything we’ve ever seen before. How do we democratize access to what I’m talking about in this post in a way that works? How do we ensure as a society that this is available to everyone? I don’t have the answer, but it’s a question we need to be asking.

Where I Landed

I know this is a jumble of thoughts and I haven’t truly clarified them all. But they felt important enough to at least begin trying to clarify my thinking on them. The thread running through all of this for me is just acknowledging how much the world is going to change for our youth because of this technology. There are going to be a lot of decisions that need to be made and things that need to change to make sure we’re taking advantage of these remarkable tools, but doing it in a way that’s fair, equitable, and helps people thrive in society and the economy. I don’t have the answers to all of this. Short term, it’s on me to help my kids adapt, and the earlier I can do that the better off they are. But I’ve committed myself to spend more time thinking about this and figuring out what my role in it is beyond my own family.

On a lighter note, definitely check out the games at confettigalaxy.com because they are pretty fun. And if you’re curious about how you can build these games with your kids, I’ll do a follow-up post that is more technical in nature to show exactly what I did.

I Published My Portfolio Analysis Workflow - Try It Yourself

2026-02-12T00:00:00+00:00

I recently wrote about using Claude Code to build a portfolio optimization plan - I fed it our brokerage CSV exports, described our goals, and iterated over several sessions until it produced a phased action plan with tax impact analysis and fund recommendations. That post ended with a section called “If You Want to Try This” with tips on how to do it yourself from scratch. After it went up, a few people asked if I could just share the workflow so they didn’t have to start from zero.

So I did: MattStockton/portfolio-analysis.

If you haven’t used Claude Code skills before - a skill is basically a set of instructions you give Claude so it knows how to do something specific. It’s not code or a traditional app. It’s markdown files that describe a workflow, and Claude follows them when you ask it to do that thing. Because it’s just text, it’s not locked to Claude Code either. You could drop these files into a Claude project on claude.ai, use them with another AI tool, or just read them and adapt the approach yourself.

What I Changed to Make It Reusable

My original project was built around my specific situation. The parsers only handled the brokerage formats I happened to use. The fund classifications were built up one session at a time as Claude encountered my specific holdings. My tax calculations were hardcoded to my bracket. My allocation targets were baked into the project context. All of that worked great for me but wasn’t useful to anyone else.

To generalize it, I had Claude help me work through each of those pieces. The parsing logic now reads whatever headers are in your file and writes a parser on the fly instead of expecting my specific column layouts. I pulled the fund classifications into a reference database of 190+ funds across eight brokerages, with keyword matching for anything not in the database. Tax calculations got their own reference file covering federal and state brackets. And the allocation targets are now a set of seven templates (or custom) you can pick from or ignore entirely and define your own.

The other big thing was the workflow itself. In my original project, I wrote a long goal prompt and then iterated with Claude over several sessions. That worked because I was willing to put in the time and I knew what I was looking for. For someone picking this up cold, that’s a lot to figure out. So the skill breaks it into six steps with structured questions at each one - you’re picking from options instead of writing freeform prompts.

The goal was to get people 80% of what I got without having to do all the upfront work I did. The other 20% comes from the back-and-forth - clarifying your specific constraints, pushing back on recommendations, iterating until the plan fits your situation.

What You Get Out of It

You provide CSV exports from your brokerage accounts and the skill handles parsing them - Fidelity, Schwab, Vanguard, E*TRADE, Merrill, or whatever else. It reads the headers and figures out the format.

From there it classifies your holdings, runs a gap analysis against your targets, checks for tax-inefficient placements (like bonds in taxable accounts or cash sitting in Roth space), and looks at whether your recurring contributions are closing gaps or making them worse.

Then it puts together a phased action plan. Free moves in retirement accounts first, then low-tax fixes, then contribution changes, then long-term hold-and-dilute strategies. Each recommendation includes the specific fund, amount, account, tax cost, and rationale.

Try It

You need CSV exports from your brokerage accounts - most brokerages have a “Download” or “Export” button on the positions page. When you export, look for options to include extra fields like cost basis, Morningstar category, expense ratio, and fund ratings. The skill can figure a lot of this out from the fund symbol alone, but having it in the export gives it better data to work with. The repo README has setup instructions. I built it as a Claude Code skill, but you don’t need Claude Code specifically. The skill files are just markdown - if you upload them to claude.ai, ChatGPT, or any other AI tool that lets you provide reference files, it should pick up the workflow and do its best with it.

Privacy: The skill is just markdown files - it doesn’t collect or send anything anywhere. Your brokerage data goes to your LLM provider as part of the conversation, same as anything else you share in these tools.

Limitations: US tax system only. Tax brackets are based on 2025 law. Default templates lean toward index funds but it supports active and blended approaches. Not financial advice.

Last Thing

Your situation is different from mine and you’ll probably want to modify things. The skill is just markdown files - you can change the instructions, add your own constraints, or take it somewhere I didn’t. How far you get depends on how much effort you put into the back-and-forth.

Beyond the portfolio stuff, I think the more interesting idea is that this pattern works for any domain. If you’ve built up a good workflow with an AI tool over multiple sessions, you can probably extract it into a skill that other people can pick up and run with.

If you try it, let me know how it goes. Open an issue on the repo if something doesn’t work or if your brokerage format trips it up.

How I Built a Portfolio Optimization Plan with Claude Code

2026-02-10T00:00:00+00:00

I saw the below tweet recently and I did exactly this. Over multiple sessions with Claude Code, I fed it our brokerage CSVs, described our goals and constraints, and worked with it to produce a portfolio optimization plan - phased actions, tax impact analysis, fund recommendations, before-and-after allocation grids. The kind of plan you’d pay a financial advisor to put together, except I built it with an AI.

A year ago, I wrote about building a personal finance tool in a day with AI. That project required software development - I used AI to help me build an application with a UI, data pipelines, and visualization components. Models have gotten flexible enough that you don’t need to build an app anymore. An agent like Claude Code can read your files, write scripts, run them, and iterate on the output directly.

Overview of the workflow - from raw data and goal prompt through iteration to the final plan

Setting Up the Workspace

I used Claude Code, which runs in a terminal and can read and write files on your machine. I pointed it at a directory with all my financial data. Portfolio analysis is a good fit for this because there’s a lot of messy data in different formats that the model can just dig through directly.

I started with three inputs:

CSV exports from my various financial accounts - holdings, amounts, cost basis, fund fees, fund ratings, and a ton of other data the model could use
A supplementary text file - there was other information I knew the model would need but no good way to export it, so I typed up a scratchpad with a bulleted list of things like bank balances, 529s, employer retirement account details, and automated investment settings. Not structured, but it had the raw numbers.
A goal prompt describing what I wanted

The goal prompt was the most important piece. I spent a lot of time on it and wrote it mostly by hand before having the AI review it - I wanted to force myself to think through all of our constraints and goals. I included:

What data exists and where - a narrative of what files are in the directory and what the model should be considering
The desired output format - I brainstormed the kinds of things I wanted to see: allocation tables, style-box grids, holdings charts, different ways to slice the data. And ultimately, a concrete plan for how to modify the portfolio to get to a target allocation.
Our preferences - I want the portfolio to be tax efficient, I’d rather find the lowest fee ETFs that are good than chase performance, and I don’t want to overcomplicate things. I’m well-versed in this stuff but complexity isn’t the goal.
Known constraints - everyone has their own nuances. For us, we’d decided not to touch the allocations or ongoing contributions for a specific account, so I called that out as a constraint.
What we suspected was wrong - I had a view on things that could be better, so I included it, but I caveated it with “if you have a different view, I want to hear it.” I didn’t want to oversteer the model toward a specific outcome. If you have insight that can nudge it in the right direction, it’s worth providing - just make sure you give it room to look outside your ideas too.
Strawman proposals - I included our own rough ideas for target allocations and investment changes. Even if they’re wrong, it gives the model a starting point to react to rather than building from scratch.
“Ask me questions exhaustively” - I put this at the end of the prompt. It gets the AI to ask clarifying questions before jumping straight to conclusions, which makes the first plan way better.

I also built up a CLAUDE.md file - project instructions that Claude reads at the start of every session. I had Claude set it up initially after asking me some questions, but after a couple sessions I started having Claude update it with any learnings from each session. Over time it became the project’s memory - data quirks, strategy decisions, target allocations. When I came back days later, Claude picked up right where we left off.

Parsing the Mess

Once I had the goal prompt and data ready, I handed the wheels over. Claude (Opus 4.6) worked for a while on its own - writing Python scripts to parse all the CSVs, classify every holding into target categories, handle encoding issues, deduplicate overlapping exports, and deal with non-standard line items that would get misclassified. It iterated through problems as it hit them and ended up in a really good place. The baseline it produced - a full gap analysis across every account and category - was impressive enough to start working from immediately.

Iterating as Thought Partners

The initial plan identified overweight categories, found tax-location violations, recommended specific fund swaps with expense ratio comparisons and Morningstar ratings, and calculated one-time tax costs versus ongoing savings. It even recommended tax-loss harvesting partner funds for every new position.

From there, it was like working with an analyst who has all the data but hasn’t lived with the accounts. I’d clarify something or reframe a constraint, and Claude would update every calculation, table, and recommendation in the plan. A few examples:

I refined our international allocation with Claude. The initial plan had a single “international” bucket, but I wanted more nuance - specific targets across value vs. growth, small vs. large. I worked back and forth with Claude to figure out what that breakdown should look like and which funds would get us there. That kind of allocation decision is hard to think through alone because every change affects the rest of the portfolio.

I asked Claude to self-critique - “rate this plan 1-10 and tell me what you’d change.” It gave itself a 7.5 and identified seven improvements. The biggest was hiding in plain sight: we had a large overweight position sitting in tax-free retirement accounts, where selling costs literally zero in taxes. The original plan had left it untouched. It also caught that its own retirement account deployment was putting money into a category that was already overweight. I approved the changes but asked for a phased approach rather than all-at-once.

I reframed which accounts count toward targets. We decided to treat my wife’s employer retirement account as static for now - we wouldn’t touch it or change its allocations. This is an artificial constraint, but it simplifies the plan, and once we lock in the other changes we can always run the process again with new constraints if we want to modify that account later. The initial analysis left it out entirely. But it still holds real money in specific categories, and that affects what the rest of the portfolio needs to do.

Once I pointed this out, Claude built a combined framework quickly: define targets for the full portfolio, subtract what the static account provides, and optimize the rest to fill the gaps. Some overweight categories got worse, some underweight ones became more urgent, and the recommended purchases changed.

What It Produced

I ended up with a ten-section plan document with appendices - and several sections went beyond what I asked for:

Complete portfolio snapshot - Every account, every holding, with cost basis, embedded gains, expense ratios, and tax treatment. All parsed automatically from the raw CSV exports.
Gap analysis across ten fund categories - Current allocation versus targets, with the dollar gap and priority level for each. Included a visual bar chart showing the drift and a nine-box style grid (value/blend/growth vs large/mid/small) showing the portfolio tilt before and after.
Combined portfolio framework - A table showing how an account we’d decided not to touch still changes the effective targets for the rest of the portfolio.
Four-phase sequenced action plan - Concrete steps organized from “execute this week” through “ongoing quarterly.” Each action specified the fund, the dollar amount, the account, the tax cost, and the rationale. Separate actions for retirement accounts (zero tax) versus taxable (calculated tax impact).
Tax impact summary with payback periods - A table for every recommended trade showing the gain, tax cost, annual tax savings, and the breakeven timeline. A separate table for positions explicitly not recommended to sell - calculating the tax cost if you did sell and explaining why the math didn’t work.
Fund recommendations with deep research - An active buy list, a stop-buying list, and a hold-and-dilute list. Claude compared candidates on factor loadings, expense ratios, Morningstar ratings, AUM, and whether they screen out low-quality companies. Each recommendation included a tax-loss harvesting partner: a similar-but-not-identical fund from a different provider, so if a position drops you can capture the loss and immediately buy the partner to maintain exposure without triggering wash sale rules.
Tax location matrix - Which asset type belongs in which account type (pre-tax, Roth, taxable) and why, with a list of current violations and the action to fix each one.
Retirement contribution structure optimization - Claude looked at my wife’s employer retirement contributions unprompted and found that switching one component from after-tax to pre-tax treatment would save thousands annually, with the ability to convert at much lower rates during early retirement.
Natural dilution timeline - Projections for how long each overweight category takes to reach target through new money allocation alone, accounting for recurring contributions and dividend flows.

Anonymized excerpt from the plan - before/after allocation shift across categories

Anonymized excerpt from the plan - nine-box style grid showing portfolio tilt

Anonymized excerpt from the plan - tax impact and payback period for a recommended trade

To be clear, this wasn’t a science experiment. I’m actively using this plan and I’m almost done executing it. I had Claude remix the full document into a shorter checklist so I can scan it and check things off as I go.

If You Want to Try This

Export everything you can into files. Download CSV holdings from every brokerage account. Create a text file for anything without an export - bank account balances, employer retirement plan details (holdings, contribution amounts, fund options), automated investment settings, any pending changes you’re considering.
Write a detailed goal prompt. State your tax situation, risk tolerance, and time horizon. List your constraints - accounts you don’t want to touch, positions you can’t sell, contribution limits. Include what you already suspect is wrong. Add a strawman target allocation if you have one - it gives the AI something concrete to react to. Ask for a sequenced action plan, not general advice. And end with “ask me questions exhaustively” - the AI will ask clarifying questions instead of filling in the blanks, so you’re not leaving anything ambiguous about what you actually want.
Iterate like you would with a human advisor. The first plan will be solid but will miss things only you know about your accounts. Each clarification is fast; the AI updates every calculation in the plan instantly. Ask it to rate its own plan 1-10 - it’ll find things it missed.
Build project context that persists. If you’re using Claude Code, a CLAUDE.md file in your project directory carries across sessions. Add key decisions, data quirks, and what you’ve figured out as you go. Every future session builds on everything that came before.

The Work Compounds

This blog post was written with the help of Claude Code. I pointed it at the same project directory, told it to read the session history and plan documents, and asked it to write a post about the process. The work you do with these tools compounds - you can distill it into other artifacts or remix it with other content. The prompt that kicked off this session:

Because the project context was already in the filesystem - the optimization plan, the session summaries, the strategy decisions - Claude had everything it needed to draft the post. I iterated on it from there, but the starting point was strong because all the context was already there from previous sessions.

A year ago this would have been a week of spreadsheet work or a few thousand dollars to a financial advisor. Instead it was a few evenings of back-and-forth with an AI, and the plan it produced is the one we’re executing.

What I Tell People Getting Started with Claude Code

2026-01-29T00:00:00+00:00

I recently spent an hour walking a friend through how I use Claude Code. He’s not a developer - he manages a portfolio of 90+ companies and does a lot of knowledge work: newsletters, performance reports, meeting notes. He’d heard me talk about these tools and wanted to see what was possible.

It was a good conversation. Pairing on this together - actually trying things instead of just describing them - worked better than I expected. He could see things in action and ask questions as we went. Afterwards I wanted to write down some of the takeaways, some of which I’ve written about before.

Most of these are habits rather than features - ways of working with the tool that make everything else easier.

Plan Mode

Claude Code has a “plan mode” - a thinking mode before execution. When you put Claude in plan mode, it won’t take any actions. It thinks through what you’re asking for and asks clarifying questions. Type /plan or press shift-tab before describing what you want.

I told my friend: always be in plan mode for anything non-trivial. Plan mode lets you explore without committing to anything. You can describe a vague idea, let Claude ask questions, refine your thinking, and only then decide whether to proceed. If you jump straight to execution, Claude might start creating files or making changes before you’ve fully figured out what you want. When planning is complete, Claude will ask if you want to clear the context and execute. Clear the context - it prevents the model from getting confused by all the back-and-forth exploration that happened during planning. You keep the plan, lose the noise.

You’re never committing to anything until you’ve seen Claude’s proposed approach and explicitly approved it.

The Interrogation Pattern

You don’t need 100% clarity on what you want before you start. You can use Claude to pull it out of you.

Tell Claude to “ask me questions exhaustively” or “use AskUserQuestion exhaustively to understand what I want.” Claude will keep asking clarifying questions until you tell it to stop. It pulls information out of you that you didn’t know you needed to provide. It asks about edge cases you hadn’t considered. It catches assumptions you didn’t realize you were making.

You don’t need to be precise upfront. You don’t need to know the right terminology or anticipate what Claude needs to know. Just describe your goal and let Claude figure out what questions to ask. It helps to know how to instruct the model, but you don’t need to be perfect at it - you can iterate, and answering questions is easier than crafting the perfect prompt.

Work Logging and Compounding

Claude Code loves files. It can read them, search them, reference them later. So store information about your work in files - work logs, commit messages, meeting notes, project summaries. Be verbose. You can always trim it down later, but you can’t recover context you didn’t capture.

When Claude can read what you’ve done before, it produces better outputs. You’re not starting from zero every session. I can ask Claude to look at my git commits and summarize what I worked on last week. I can point it at meeting notes and have it draft a follow-up. None of this happens if you’re using Claude as a chat interface that forgets everything between sessions.

Automate the capture where you can. You can get Claude Code to log its own work by setting up good CLAUDE.md files and skills - and you can use plan mode and the interrogation pattern to build those. Tell Claude what you want to track, let it ask questions, and have it create the instructions for itself. You’re not going to get it right the first time. But as you figure out patterns for storing what you’ve learned, things just keep getting better. That’s how I put it to my friend: “things just magically get better” as context accumulates.

Git Repository Backing

This is probably the biggest technical hurdle for folks who aren’t technical. But I do think it’s necessary, and it’s achievable - Claude Code can help you get it set up.

A git repository is just a folder where changes are tracked over time. Every time you save a checkpoint (called a “commit”), git remembers what changed and lets you add a note about why.

Files need history - not just for version control, but so the system itself can reference what changed and when. I have a skill that looks at uncommitted changes, figures out what I did based on the changes and session context, adds an entry to my work log, and creates a detailed git commit. When I need to know what I worked on, I ask Claude to look at the git commits since a certain date and give me a summary. It looks at the changes in each commit and tells me what I did.

History gives Claude something to work with beyond just the current state of your files.

The Slot Machine Mentality

Bias toward action and iteration rather than reviewing everything upfront. Work happens fast enough now that you can throw things away - if something isn’t working or heads in the wrong direction, scrap it and start over.

I’ve gotten more comfortable not reading every detail of what Claude plans to do. I just let it execute. Because first, it’s probably right. And second, if it’s not right - pull the slot machine again. It’s lower cost to iterate than to review everything upfront.

You don’t need to understand every line of what Claude produces. You need to understand the output and whether it matches what you wanted. If it doesn’t, try again. Describe what’s wrong and let Claude fix it. If a session goes off the rails, type /rewind to backtrack to an earlier point in the conversation.

Plan mode, git, and /rewind all let you back out of mistakes. Don’t let the desire for certainty slow you down.

Building Skills Through Doing

Skills are saved instructions that tell Claude how to do a specific task. They’re markdown files that describe what Claude should do, what questions to ask, and how to format the output. I have skills for summarizing meetings, committing changes with work log updates, turning transcripts into blog posts. Once you have a skill, you type a command and Claude follows those instructions.

I wouldn’t try to create skills before you’ve done the task manually at least once. Work through a real task with Claude - post a transcript, describe a document you need, iterate until you get something you like. Then tell Claude to turn that conversation into a skill you can run next time.

You don’t even need to read the skill Claude creates. Just trust that it captured what worked. Next time you have a similar task, invoke the skill. If the situation is slightly different, just tell Claude - it adapts. You figure out what works through doing, then codify it.

Getting Started

People are often hesitant to start because they’re not sure how. But there’s not much downside here. You can explore without committing, revert when things go wrong, start without knowing exactly what you want. If something breaks, you try again.

The skills that matter most for using Claude Code are non-technical: curiosity, persistence, clear thinking, confidence. It helps to understand concepts like git and file organization, but Claude Code can help you learn those too - if you keep trying things. After our session, my friend said what helped most was actually seeing what this looks like in practice. That’s hard to get from reading. You have to try it.

If you’re already using Claude Code and have patterns that work for you, I’d like to hear about them.

The Biggest AI Opportunity Isn’t Better Models

2026-01-10T00:00:00+00:00

Six months ago, an old colleague, Martin, reached out to me after reading some of my writing about AI. He’s an expert in zeolites - microporous crystalline minerals used in catalysis and filtration. Only about 250 zeolite structures have been synthesized, but millions are theoretically possible. We sat down and he showed me his website, Hypothetical Zeolites, and software he’d built to computationally generate and test potential new structures. People from industry were using it. He’d been at this for years. I was fascinated by this domain I had no idea existed - the next day I spent an hour on ChatGPT voice going back and forth learning more about zeolites. It made me realize just how many different things you can be an expert on in this world.

He had ideas for improving his platform, so I nudged him to start experimenting with AI tools. Claude Code existed but was still nascent - not yet widely adopted. He started using various AI tools and had some success with them. Now, six months later, thinking about what he wanted to build - ways to make his research faster - I keep imagining what would happen if he got proficient with today’s tools. Really proficient, not just dabbling. He might be able to move 10x faster on problems he’s been chipping away at for years.

That got me thinking. How many other people like Martin are out there? By “like Martin” I mean: deep expertise in a narrow field, probably building tools or workflows to support their work, but not deep in the rabbit hole of understanding the best ways to use leading edge AI tools.

Here’s what happens when someone like that figures it out. Andrew Hall, a political economist at Stanford, recently shared on Twitter how he used Claude Code to replicate and extend one of his old papers on vote-by-mail and election turnout. It downloaded his original repo, translated Stata code to Python, crawled the web for updated election and census data, ran new analyses through 2024, created tables and figures, performed a lit review, wrote a new paper, and pushed everything to GitHub. The whole thing took about an hour. His take: “This is an insane paradigm shift in how empirical work is done.” What he did is impressive - and once you know what these tools can do, it’s not surprising at all.

My Hypothesis

Martin’s work is unique, but this pattern isn’t. There are specialists in every field - rare diseases, educational methods, obscure industrial processes. They’ve built tools and workflows for years. Many are using AI tools, but there’s a gap between that and knowing what the leading edge can actually do.

AI coding tools can now write working software from a description of what you want. That’s what Andrew Hall did - and it’s what Martin could be doing too.

The tools exist. The experts exist. But the experts don’t know what the tools can do, and the people who know the tools don’t understand what the experts are working on.

If This Is True, What Do You Do About It?

So how do you fix this? My first thought was scale - build a platform, write guides, create content that reaches lots of people.

But I don’t think that’s how this works. You need to show people the capabilities in the context of what they’re already doing - that’s what creates the “aha” moment. A generic guide or tutorial just can’t do that.

I keep hearing this narrative: if AI progress stopped today, we’d have 10 years of adoption work ahead of us just to integrate what already exists. I actually think that’s true based on what I’ve seen. It’s frustrating - we have these superpowers available and it’s going to take a while for most people to use them. That’s probably always going to be true. But is there a way to shortcut this for people like Martin?

I think there is. Find people like Martin - people doing work where I look at it and think “I have no idea what this is, but it seems important.” People where I can see a pattern for how they could use these tools more effectively. Sit down with them, understand what they’re trying to do, and figure out what’s possible.

Two Asks

If you’re a specialist doing deep work in a narrow field: I want to hear what you’re working on. What’s tedious? What would you build if you could? I’m not selling anything - I just want to understand where you’re stuck and show you what might be possible.

If you’re already deep in AI like me: Think about finding a Martin in your network. The most useful thing you can do right now probably isn’t building another demo. It’s sitting down with someone who knows their field cold and showing them what these tools can do for their specific work.

If this sounds like you, you have a view on this hypothesis, you know someone like Martin, or you know someone who’s already in the weeds using these tools to accelerate their work - send me a message on LinkedIn or shoot me an email. I’d love to hear more examples.

A Data Engineering Lesson for AI Agent Builders

2026-01-09T00:00:00+00:00

Nobody can agree on what an agent is, but most definitions share a common thread: it’s software that uses an LLM to take actions, not just generate text. It can read files, call APIs, run commands, and make decisions based on what it finds. I wrote more about this in my post on tool calling and file system access.

Foundation Capital’s article on context graphs has been circulating widely, and NLW covered it on his podcast. The core idea: as agents orchestrate decisions across companies, they can capture the reasoning and context that led to specific outcomes. This information currently lives in Slack threads, people’s heads, or nowhere. Agents are in the execution path. They can record it.

One piece of their argument that I think deserves more attention: don’t over-constrain the format of what you capture. Let the agent figure out what belongs in the trace. This connected to something I learned years ago in data engineering.

The Web Scraping Lesson

At CircleUp, we were building an authoritative data set for understanding consumer packaged goods - what products existed, what brands were out there, where they were sold, how they changed over time. A core part of this was pulling in unstructured data from various sources. One example: scrapers that collected product information from brand websites and retailers - prices, descriptions, SKU numbers, ingredients. We cared about this data over time, not just a snapshot. We wanted to track how products changed month to month.

The MVP version of these scrapers transformed data inline. We’d find the HTML elements for the attributes we wanted, extract those values, and store them in a database.

Then someone asked about ingredient changes. We could add ingredient tracking going forward, but what about the historical data? Products that had reformulated - we couldn’t see what their ingredients used to be. We’d thrown away everything except the fields we thought we needed.

We caught this early and fixed it. The fix: store the entire HTML page, even if you only need three fields right now. Storage is cheap. You can’t recreate the ability to extract new information from historical data. The lesson stuck.

Once you learn this, your brain gets tuned to ask: what should I be storing that I’m not? API responses where you only need one field? Store the full payload. Event streams you’re filtering? Keep the raw stream somewhere. You can always add structure later. You can’t go back and capture what you didn’t store.

Decision Traces Are the Same Problem

Agent reasoning works the same way. Consider an agent handling subscription renewals. A customer asks for a discount. The agent checks their support ticket history - three escalations in the past quarter, two unresolved for over a week. It looks at usage patterns - engagement dropped 40% after the last outage. It reviews similar customers who churned and spots warning signs. Based on all this, it recommends a 15% discount to retain the account.

Most systems would only capture the outcome - a field in the CRM that says “discount: 15%”. The reasoning - the support history it reviewed, the usage patterns it analyzed, the churn signals it identified, the similar cases it compared against - gets thrown away.

Systems of record capture state. They don’t capture decision lineage. The Foundation Capital article calls the accumulated decision lineage a “context graph” - traces stitched together over time so precedent becomes searchable.

The connection to my document generation agents post is direct. I called it “context tracing” - asking the model to generate a decision log alongside its output. What was considered? What was rejected? Why did it rate this evidence as strong vs. weak?

But I prescribed a specific format for those traces because I was targeting specific document types. Reading about context graphs made me reconsider: should I loosen those constraints?

Why Unstructured Traces Are Better

The natural approach is to define a schema for decision traces upfront. But if you lock in a structure too early, you limit what you can extract later.

Models are good at turning unstructured data into structured data - assuming the information exists in the unstructured source. If it’s not there, no post-processing will create it.

Verbose, unstructured traces let you change your mind. You can extract different structures later for different purposes. You can ask the model to create different views on the same data.

You can also discover patterns you didn’t anticipate. If you’re capturing verbose reasoning and later notice your agents consistently make similar exceptions for a certain type of customer, that’s a pattern you can codify - or at least understand. Structured traces with predefined fields would never surface it.

This is the data engineering pattern: store raw, transform on read. Store the full HTML. Later run Spark or DuckDB to extract whatever structure you need. If your transformation logic has bugs, fix it and rerun. The raw data is your source of truth.

With decision traces, the transformation uses an LLM instead of traditional ETL. More expensive per run, but negligible compared to not having the data at all.

Context Engineering for Tracing

Unstructured doesn’t mean effortless. You still have to invest in the prompts and context that tell the agent how to trace its reasoning.

You can’t just say “capture a decision trace.” You have to define what that means - but in terms that don’t constrain what the model can include. The instructions should describe what kinds of reasoning to document, not a rigid schema.

Something like: “For each decision, document what information you considered, what alternatives you evaluated, what made you choose this option over others, and any uncertainty you have about your choice.” Specific enough to be useful. Doesn’t prescribe fields or formats.

The decision trace is almost as important as the decision itself. Once you’ve collected enough traces, you can use an LLM to distill learnings from them. Ask it what patterns it notices. Take a contrarian approach: “What should we be doing differently based on these traces?” Having that information gives you more to work with as you iterate. The traces feed back into improving the agent.

The One-Way Door

Some decisions are hard or impossible to reverse. Choosing not to capture decision traces is one of them. Each agent execution generates reasoning that, if not recorded, is sealed away.

Most people building agent systems focus on getting to the right decision. What’s the correct discount rate? Did the agent extract the right claims? That matters. But if you’re only optimizing for the immediate output, you’re walking through one-way doors with every execution.

The clearest path to keep this door two-way: capture unconstrained, verbose decision traces. Give real thought to how you instruct the agent to do this. It’s a core part of your system design.

Store the raw trace. Structure it later.