César Soto Valero

SDD and the Future of Software Development

2026-02-01T00:00:00-08:00

The next massive software failure will probably not come from a missing line of code. It will come from a missing sentence.

Right now, developers are using AI agents to write code faster than they can actually read it. I’m doing it too! The latest models can generate 1,000+ LoC in seconds. But that’s not the real problem. The problem is that speed removes friction. And friction used to be where thinking happened.

Recently, GitHub ran controlled experiments where developers using Copilot finished tasks faster (with numbers often quoted around “up to 55%”).¹ And that sounds like a dream. Until you realize that speed without direction is just a faster way to reach the wrong place. In other studies, experienced developers got slower when AI increased review load, coordination cost, and rework.² So the real question is not “Can AI write production-ready code?” It definitely can!

The question is:

“Can we keep AI speed without turning our systems into a high-throughput confusion factory?”

That is where Spec-Driven Development (SDD) shows up. The idea is simple: write a clear, structured specification (“spec”) before the code, so that AI gets directed in the right way and remain withing well defined constraints instead of “vibes.”

You many wonder:

Are we just reinventing 1990s Waterfall, but this time with an AI-powered engine on top?

In this post, I’ll cover 3 things:

What SDD actually is and why vague thinking is now your biggest technical liability.
The three levels of SDD and the specific one that quietly traps most teams.
A practical checklist to decide which features deserve a real spec and which ones you should just build fast and move on.

By the end, you’ll know whether SDD is a real shift in how software gets built, or just another heavyweight idea that sounds smart and wastes everyone’s time.

What is SDD?

In SDD, the specification is the primary artifact.

Not the code. Not the framework. Not the Jira tickets. The spec is the source of truth and everything else is downstream.³

And no, the spec is not a vague paragraph or a wish list. A real SDD spec is a structured, testable description of behavior.

The spec defines:

What the system must do (and what it must not do)
Edge cases
Business rules
Failure modes
An explicit definition of done (DoD)

The goal is to guarantee unambiguous behavior from the AI coding agent.

Why is SDD Relevant Now?

SDD is not new. What’s new is the environment it now lives in.

In the past, we used to say “code is the bottleneck.” Research in requirements engineering has been blunt for years: a big chunk of effort ends up as rework, and unclear or shifting requirements are a major driver.⁴

But now code is cheap, and ambiguity is expensive. AI raises the stakes because it will happily translate fuzzy intent into crisp code without asking clarifying questions. Think of an AI agent like a brilliant intern who never says “I don’t understand.” If your requirements are fuzzy, it won’t stop. It will confidently generate a clean, professional-looking solution that could be catastrophically wrong.

And there’s a second issue that feels almost philosophical: Humans transmit information slowly. We type maybe 40 words per minute on a good day. That speed forces reflection. But AI can generate 2,000 lines in minutes. It removes the pause where thinking used to happen.

When thinking disappears, ambiguity leaks straight into production. This is why some researchers now treat prompts as a form of requirements, and argue that classic requirements methods will become even more valuable in the generative era.⁵

Same problem → Higher speed → Bigger blast radius.

SDD vs Waterfall

Yes, SDD feels like Waterfall, you are not crazy. They both say “think before you build.” But they diverge on the most important thing: Waterfall assumes you can predict the future. SDD assumes you cannot, so it builds a fast loop where the spec, tests, and code evolve together. Modern SDD discussions emphasize tight iteration and executable checks, not month-long requirements phases.⁶ So yes, it looks like Waterfall if you only look at the order of steps. It behaves like something else if you look at the feedback loop.

A Matter of Intent

Let’s be honest: most bugs are not code bugs. They are argument bugs.

You did not write incorrect logic. You postponed a decision until it exploded in production.

Example: a payment gateway requirement says:

“The system must block risky transactions.”

One person interprets “risky” as “high-risk country.” Another interprets it as “amount above $1,000.” Both are reasonable. But both can be wrong in different ways.

Research shows this is not just anecdotal. Practitioners interpret conditionals in requirements inconsistently, even when they believe they are being precise.⁷

SDD forces that decision to happen early. When stakeholders are calm, when assumptions are visible, and when the codebase still has exactly zero lines of code.

SDD as a Pipeline

Spec-Driven Development is not about writing more documentation. It’s about removing ambiguity before the agent starts making decisions for you.

A practical workflow looks like this:

%%{init: {'theme':'base'}}%%
flowchart TB;

S["Spec (requirements + constraints + edge cases)"] --> P["Plan (tasks)"]
P --> I["Implementation (AI agent + humans)"]
I --> T["Verification (tests, properties, gates)"]
T -->|pass| M["Merge + Deploy"]
T -->|fail| S

subgraph "What changes with AI"
S
P
T
end

Requirements

Describe what the user experiences from the outside (not how the system works).

Example:

“When a user submits a support ticket, they receive a clear and empathetic response within five seconds. The response must reference their issue and suggest a concrete next step.”

Design

Define structure and boundaries. What information can the agent see? What tools can it use? What is it not allowed to do?

Example:

“The agent can read the ticket text and query the internal FAQ. It cannot invent policies, promise refunds, or escalate issues on its own.”

Tasks

Tasks must be explicit, sequential, and boring.

Example:

Extract the user’s core problem in one sentence.
Match it against FAQ categories.
Select the most relevant solution.
Generate a response using approved tone guidelines.

No “think carefully.” No “use best judgment.” Agents do exactly what you specify. Nothing more.

Build

Now you turn the spec into checks. Does the response reference the user’s issue? Is the response under 120 words? If it fails, the build fails. That’s the whole loop. It is intentionally simple. Because when working with non-deterministic tools, simplicity is not a style choice. It’s a reliability strategy.

The Three Levels of SDD

Not all SDD looks the same. In practice, there are three levels:

Level 1: Spec-First

You write a spec to clarify your own thinking, use it to guide the agent, then move on. Perfect for MVPs and experiments. The spec is useful even if it dies tomorrow.

Level 2: Spec-Anchored

The spec is a living artifact. If the code changes, the spec must change. This is where high-performing teams tend to land because it balances flexibility with discipline. GitHub’s Spec Kit explicitly pushes the idea of specs that can drive implementation and verification, not just explanation.⁸ This is also where most teams quietly fail. Not because they cannot write specs. Because they cannot keep them alive.

Level 3: Spec-as-Source

You edit the spec, and tooling regenerates the implementation. This is the low-code dream. Some newer tools and platforms point in this direction by making the spec the center of gravity.⁹

⚠️ My take: Level 3 is dangerous right now for complex systems. When the spec becomes the source of truth, the spec becomes the codebase. Meaning if you cannot express intent precisely (and natural language is notoriously ambiguous), you are moving bugs upstream and making them look like prose. At that point, you reinvented a programming language. But fuzzier. And some of the loudest skepticism you’ll hear about SDD is basically this argument, stated more rudely.¹⁰

Reality Check

🟢 Green light

SDD excels at:

New features
Greenfield systems
Modernization efforts

When the slate is clean, a clear spec is a superpower.

🔴 Red light

SDD struggles with deeply entangled legacy systems. If logic is spread across five services, three cron jobs, and one senior engineer’s memory, a “clean spec” is not a document problem. It’s an archaeology problem. AI agents will suggest elegant refactors that ignore the decade of hacks quietly keeping the business alive.

Also, a personal confession:

Sometimes I can understand code faster than I can understand a folder full of Markdown. That might be because we have decades of tooling to navigate code, and far less mature tooling to navigate living specs.

So yes, SDD needs tooling support, not moral superiority. And there’s another reason SDD is having a moment: AI-assisted development can increase duplication, churn, and architectural drift when you do not constrain the agent with explicit rules and boundaries.¹¹¹²

Here is an example of a simple spec verification in Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import re

def validate_support_reply(ticket_text: str, reply: str) -> None:
"""
Spec constraint: keep responses short and reference the user's issue.
This is not "AI evaluation." This is build gating.
"""
if not reply or not reply.strip():
raise ValueError("Reply must not be empty")

    word_count = len(reply.split())
    if word_count > 120:
        raise ValueError(f"Reply must be <= 120 words (was {word_count})")

    # Naive overlap check (good enough to catch obvious failures).
    ticket_words = re.split(r'\W+', ticket_text.lower())
    reply_lower = reply.lower()

    overlaps = sum(1 for w in ticket_words if len(w) >= 5 and w in reply_lower)

    if overlaps < 2:
        raise ValueError("Reply does not reference the user's issue enough")

When to Use SDD

Run this three-point check:

Risk: Does this touch money, security, or permissions? Use SDD.
Longevity: Will someone else maintain this in a year? Use SDD.
Exploration: Are you probing an API or testing an idea? Skip SDD. Just code.

In my opinion, the future isn’t “always spec-driven.” The future is spec-driven when being wrong is expensive.

Footnotes

GitHub, Research: quantifying GitHub Copilot’s impact on developer productivity and happiness (2022). ↩
METR, Measuring the impact of AI on experienced developers (2025). ↩
Thoughtworks, Spec-driven development (Technology Radar technique entry, 5 Nov 2025). ↩
Lars-Ola Damm, Lars Lundberg, and Claes Wohlin, A Model for Software Rework Reduction through Improved Requirements Engineering Practice (Journal of Systems and Software, 2008). ↩
Andreas Vogelsang, From Specifications to Prompts: On the Future of Generative LLMs in Requirements Engineering (IEEE Software column preprint, 2024). ↩
Thoughtworks, Spec-driven development: Unpacking one of 2025’s key new AI-assisted engineering practices (4 Dec 2025). ↩
Jannik Fischbach et al., How Do Practitioners Interpret Conditionals in Requirements? (arXiv, 2021). ↩
GitHub, Spec-driven development with AI: Get started with a new open source toolkit (2 Sept 2025). ↩
Tessl, Tessl launches spec-driven framework and registry (23 Sept 2025). ↩
Discussion thread capturing practitioner skepticism about SDD-as-source and ambiguity in natural language, Hacker News thread: “Spec-Driven Development (SDD)” (Sept 2025). ↩
GitClear, AI Copilot Code Quality: 2025 Look Back at 12 Months of Data (report, 2025). ↩
Thoughtworks, Complacency with AI-generated code (Technology Radar technique, 23 Oct 2024). ↩

Projections for Data-Related Roles in The AI Era

2026-01-15T00:00:00-08:00

I worked as a data scientist in 2025. From what I can see, data-related roles might quietly mutate this year without an official memo.

Today, ML models are faster to build and easier to replicate than ever before. We have cloud infrastructure, mature libraries, and powerful AI code assistants. However, keeping models running in production (and tethered to real decisions) is harder than ever. Which means that these are the skills companies will pay for in 2026: owning production systems end to end.

The shift is already underway.

So, my aim with this post is threefold:

To map how data roles are drifting from “model builder” to “system owner.”
To define this kind of ownership in practical terms (pipelines, monitoring, governance, and business outcomes).
To outline a career path that increases leverage for data professionals without chasing shiny tools.

Let’s dive in!

From Artifacts to Outcomes

You’ll probably not gonna get a email this year with something like “Tell me what system do you own or start packing your desk,” but you might already feel the shift in expectations. The interview questions are changing, the the job descriptions are changing, and performance metrics are also not the same as before.

Today, you deliver a model, then get asked about latency, drift, real-world performance, and whether the metric moved or misled.

From Models to Systems

In real-world ML, the model is a small part of the system. Most production cost comes from the surrounding machinery.¹ The first version ships fast, but the subsequent versions became a maintenance job. The scoreboard is not the model’s accuracy, but whether business outcomes improved or not.

So the winning move have changed. Impact beats elegance and reliability beats cleverness. From artifacts to outcomes. A model you can’t operate is just a well-formatted opinion. A outcome that you deliver is a career booster.

Which Model Is The Best Now Matters The Less

Model training is no longer the bottleneck. Better defaults, AutoML, awesome ML Python libraries, and cheaper compute make “getting a model working” the easy part. Even when modeling is actually hard, the path is increasingly well-paved.

But the truth is: most real failures aren’t from weak models but from systems that lie because the input data shifted or the business process changed.

Production systems often see performance drops not because the architecture was wrong, but because the system around it failed to react to distribution changes.²

So the job will evolve:

From “Do you know the latest DL architecture?” to “Can you run this in production, from end to end?”
From “Do you know the newest library?” to “Do you know what breaks first?”
From “Do you know how to tune hyperparameters?” to “Do you know how to monitor drift?”

The most valuable skill is the ability to own the whole chain. That’s ownership.

Production System Ownership

Owning a system means you run a decision-making business process, not a Jupyter notebook.

Ownership forces you to think beyond code and offline metrics. It forces you to care about failure modes, feedback loops, and incentives. To accept that data is not stable and users are not rational. That acceptance is a career booster.

In practice, ownership spans five surfaces:

Data: quality, contracts, lineage.
Model: evaluation, documentation, intended use.
Delivery: deployment, latency, rollback.
Monitoring: drift, performance, cost.
Governance: risk, logs, compliance.

If you ignore any one of them, the system will eventually ignore you back. Because for most real-world sytems the model is a component, not the whole product.

For example, if you build a financial fraud detection model but ignore data quality, you’ll get garbage in and garbage out. If you ignore monitoring, you’ll miss drift and degrade key evaluation metrics silently. This could lead to lost revenue, regulatory fines, or customer churn.

So the overall system is the product. The business outcome is the scoreboard. And it’s my believe that the best way to stay relevant in the coming years is to own that scoreboard.

Skills, Metrics and Problem Framing

You don’t need to learn everything. You need to learn the parts that keep working when tools change (i.e.,the parts that compound across projects and years).

For example, if you’re tasked with improving customer churn prediction, you dig in and find:

The training data excluded users who churned in the first week (survivorship bias).
The model predicts churn probability, but no one defined what action to take at which threshold.
Predictions aren’t logged, so you can’t measure whether high-risk users were actually contacted.
The feature pipeline breaks silently when a new product launches, because no schema validation exists.

You fix the system:

Add schema checks and drift monitoring on input features.
Work with the product team to define intervention thresholds (e.g., contact users with >70% churn risk).
Log predictions and outcomes to measure intervention effectiveness.
Set up a weekly dashboard tracking both model performance and business impact (retention lift).

Six months later, retention improves by 8%, and you can prove causality. That’s ownership. The model didn’t change, the system did!

Success Definition

A metric that is observable and attributable is worth more than a sophisticated loss function. If you can’t articulate how your metric ties to a decision, you’re optimizing a shadow. Make the metric boring, then make it true.

Here’s how:

Ask this question relentlessly: “What decision will this change?”
Then ask the second question: “How will we know it changed for the right reason?”

That is problem framing!

Data as a Product

In most large companies, data is treated like an interface with clear ownership, contracts, and checks.

Modern data mesh thinking emphasizes delivering analytical data as a product with interfaces and ownership, not as a byproduct of engineering work.[^3] In my experience, better engineering around data often beats endless model tweaking.

For me, it works well starting small: add schema contracts, simple checks that fail loudly, and lineage so “what changed?” becomes a quick answer, not a week-long hunt. That’s what professionals call reliability, and it’s a career booster.

Testing and Production Readiness

Many ML teams under-test because they over-believe the capacity of their models.

A principle from production ML readiness: “it works locally” is not a phase of software development. Production-ready models have tests, monitoring, rollback plans, and observability.

In the eyes of stakeholders, a model without monitoring is a demo, a pipeline without tests is a riss, and a system without rollback is an incident waiting for its calendar invite. As a data professional, be the person who prevents that invite.

Real World ML

Reality changes, then your model follows, then your dashboard lies and the business gets a hit. Concept drift is the default state of deployed systems. You need monitoring for input distributions, outputs, and business KPIs—and a playbook for what to do when drift hits.

Here’s a tiny practical pattern: log predictions and the features that drove them. Track a drift signal per feature (e.g., PSI) with thresholds. Alert on change and define a response.

Here’s a simple PSI implementation in Python:

import numpy as np

def psi(expected, actual, bins=10, eps=1e-6):
    quantiles = np.quantile(expected, np.linspace(0, 1, bins + 1))
    quantiles[0], quantiles[-1] = -np.inf, np.inf

    def hist(x):
        counts, _ = np.histogram(x, bins=quantiles)
        p = counts / max(counts.sum(), 1)
        return np.clip(p, eps, 1)

    e = hist(expected)
    a = hist(actual)

    return np.sum((a - e) * np.log(a / e))

The point: measurable signal, threshold, and response, so that when drift hits (and will do), you’re ready to act.

LLMs Add a New Failure Mode (Non-Determinism)

LLMs make capability cheap and behavior slippery. Nondeterminism is baked into the architecture: sampling settings like temperature exist because outputs vary across runs, even with identical inputs. Your system becomes probabilistic even when your code isn’t.

Once integrated, new failure modes appear:

Prompt regressions: A small wording change can flip the output.
Silent degradation: The model changes under you (version updates, temperature drift), and you don’t notice until users complain.
Cost spikes: Token counts explode because the model got chattier or a prompt was rewritten poorly.

Here’s what helps:

Version prompts like code. Store them in Git (not a Notion doc). Track changes, review diffs, and tie versions to deployments.
Test prompts systematically. Use a fixed eval set with known inputs and expected outputs. Measure pass rate, not vibes.
Monitor output distributions. Track length, sentiment, keyword frequency, any signal that catches behavioral drift before it cascades.
Track cost per call. Set alerts when average token counts spike. LLMs are elastic infrastructure; cost is a signal, not just a bill.

The pattern is the same as classic ML: make behavior observable, drift detectable, and regressions catchable. The difference is that LLMs fail in new ways, silent, stochastic, and expensive. So you add new defenses: prompt tests, output monitoring, and cost tracking.

That’s ownership in the LLM era.

Governance Is Part of the Job

If your system influences people, money, or risk, governance shows up.

The European Union’s AI Act mandates quality, transparency, human oversight, and safety obligations for high-risk AI systems. Deployers are expected to monitor and log behavior, assign human oversight, and ensure input data relevance.³ Compliance is insurance: auditability, explainability of intent, and clarity when things go wrong.

Role Shifts To Expect

Here’s how I see data-related roles evolving in 2026:

Analytics roles are becoming more decision-shaped. Dashboards are cheap once data is available, so value moves to interpretation and “what should we do next?” If you can’t recommend action, you’re doing archaeology.
Applied ML roles are becoming more end-to-end. The center of gravity shifts from model building to operating the full loop: definition, deployment, monitoring, and iteration. Accountability becomes the differentiator.
Research roles still exist, but they aren’t the default. Only a handful of organizations push the frontier; most companies are trying to turn capabilities into reliable products. That gap is widening, so pick your lane intentionally.

Career Advice

Optimize for ownership. Pick one production system and own the loop: define the metric, ship the pipeline, add monitoring, write a runbook, and measure impact. Ownership is the fastest teacher.

Learn “just enough engineering.” You don’t need to become a backend polyglot. You do need deployment basics, monitoring, versioning, and incident response. Knowing how work ships turns you from “the data person” into “the person who makes things happen.”

Translate your work into business outcomes. Accuracy is interesting; outcomes are persuasive. Speak in revenue, cost, risk, retention, and customer experience. Then decision-makers stop seeing you as technical support, and start seeing you as a partner.

Summary

In 2026, the folks who struggle in data roles aren’t the ones missing the newest tool. They’re the ones who stop at analysis and hand off reality. They ship a model, deliver a slide deck, and say “I’m done.” But reality laughs and shifts the distribution.

The people who thrive do something simpler and rarer. They stay curious when systems get messy. They lean in when production contradicts the notebook. They care about outcomes, not artifacts.

When building models become easier, shifting towards complexity becomes the right move. And it’s my belief that owning systems end to end is that move.

Footnotes

D. Sculley et al., Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015. ↩
Aria Khademi et al., Model Monitoring and Robustness of In-Use ML Models: Quantifying Data Distribution Shifts Using Population Stability Index, arXiv 2023. ↩
Article 26, Obligations of Deployers of High-Risk AI Systems (AI Act detail). ↩

Why Creativity Just Became the Most Practical Skill

2026-01-10T00:00:00-08:00

In 1996, Larry Page was a PhD student at Stanford. Just the right time and place to be a protagonist on the rise of the nacient internet. Back then, most search engines were trying to figure it out how to provide better ranking results automatically. They were all reading the content of webpages, using keywords, text parsing, in-page signals, and so on. But Larry tried something different.

He asked a different question:

“What if a page’s value is not about the information it contains, but in how many other pages point to it?”

That reframing became the PageRank algorithm, a ranking system built on the web’s hyperlink graph (not just its text).¹ The rest is hitory, PageRank helped shape what later on became the dominat search engine and one of the most prominet companies in the wrold: Google.

But this post is not about the history of web or search engines (I wrote another post about it). It’s about another kind of engine. The engine behind human breakthroughs, something we call: creativity as perspective-shift.

We are now in AI era. Right now, LLMs can generate thousands of ~~surprisingly~~ “good” outputs. However, they still struggles to choose what is worth building, what is actually relevant (and why). It’s my believe that this selection function (taste, judgment, values) is where humans still keep an edge over AI.

So, in this post, I’ll cover three things:

Why creativity just became the most practical skill for humans.
How creativity works according to what research suggests is happens in our minds.
How you can grow your creativity with simple, repeatable practices.

Let’s dive in!

Creativity ≠ Productivity

Recently, I put a lot effort to bully my brain into being creative. I treated creativity like productivity and assumed intensity would eventually pay off. So I tried to focus harder, longer, and with more discipline than ever. The result was more effort and fewer breakthroughs.

Creativity doesn’t reward effort the way execution does.

Creativity prefers slack, not strain. Neuroscientists describe a relevant system called the Default Mode Network (DMN), which becomes more active when you’re at rest and thinking internally.² The DMN is linked to mind-wandering, memory, and our ability to recombining unrelated things into new patterns.³

In plain terms, the DMN is brain’s background “merge request” process. When you never give it idle time, nothing new gets merged.

So instead of focus harder, what I do is to switch states.

One well-known study found that an undemanding task that allows the mind to wander can improve creative incubation compared to staying locked in.⁴

But this doesn’t mean boredom is magical. It means boredom is permissive, and your brain takes the permission seriously. If you want originality, you the best way is to stop supervising every thought.

Now when I’m stuck, I stop trying harder and start making space.

I walk without inputs (no phone, no podcast).
I do dishes, clean the floor, or do something else.
I take the boring route home.

Then, I let the brain do what it does when nobody is watching: wandering about things.

Feed the Blender

Here’s the thing. Your brain is not a laser. It’s more like a blender. What comes out depends on what goes in. If your inputs are all the same creators, the same feeds, the same opinions, your creative output becomes high-resolution déjà vu.

A smarter strategy is to diversify the ingredients, not increase the volume.

Steven Johnson describes breakthrough environments as liquid networks: idea ecosystems where half-formed hunches collide, combine, and evolve.⁵

This matters because creativity is often the product of collisions of new ideas, not contemplation of what’s alrady known. If you only consume polished conclusions, you train your mind to reproduce conclusions. If you collect fragments, you give your mind raw material. This is how random insights become repeatable.

So build a liquid network on purpose.

For example:

Watch a YouTube video with under 1K views (there are many of then in my YouTube channel)
Read an old essay from last century.
Open a weird old book from another field (history, biology, architecture).
Talk to someone outside your network and ask what they’re obsessing over.

Most of it will be useless (that’s the price of finding the one useful oddity). But one strange idea can spark ten better ones.

George Mack argues for hunting “anti-social proof” inputs (i.e., good ideas that haven’t been blessed by the algorithm yet).⁶

I believe the post-2022 internet is increasingly saturated with synthetic sameness, so new is not the same thing as novel.

So the goal is definitely not consume more, the goal is consume with rarity as the intent.

Cause sometimes the best way to think differently is to stop drinking from the same firehose.

Translate Aggressively

If you want a new idea, change the format.

If your idea is written, read it out loud.
If your idea is visual, write it as prose.
If your idea is very technical, explain it to a five-year-old.

Each translation forces the brain to rebuild the idea, not just rephrase it.

Cognitive scientists call this conceptual blending: Combining elements from different mental spaces into a new structure with emergent meaning.⁷

This is why changing the medium often changes the idea. With this technique, you’re not just changing representation, you’re changing which associations become available.

So when you’re stuck, don’t think harder. Think sideways. Switch the communication channel. Say it, sketch it, storyboard it, teach it. Your brain will do the connecting if you give it a new surface to connect on.

Walk Away to Win

Some solutions only show up after you stop chasing them.

The incubation effect is the benefit you can get by setting a problem aside and returning later with improved insight or originality.⁸ Stepping away can help because attention relaxes and alternative associations have room to form.

This connects to the earlier undemanding task finding. Mind-wandering can facilitate creative incubation, especially after you’ve already done focused work.

So yes, “shower thoughts” are a meme, but the mechanism is real enough to take it seriously.

Your best ideas often arrive when your brain stops auditioning for them:

Take short walks.
Do idle chores.
Create small pockets of low-stimulation downtime.

Then come back and do one concrete next step.

Incubation is not procrastination if you return with intent.

Fight Your Evil Twin

Here’s an exercise that is both effective and mildly ridiculous.

Imagine an evil twin version of you. Same skills, same knowledge. Their only job is to outthink you. What would they try that you’re too cautious to attempt?

This is counterfactual thinking with a costume on. By blaming risky ideas on your “evil twin,” you remove your ego from the room. No embarrassment, no fear of looking stupid, just experiments.

George Mack describes this “evil twin” technique as a way to bypass your internal status police.⁶ And once ideas start flowing, you can switch back to being a responsible adult.

You can try to:

Write down ten things your twin would do.
Steal the best two.
Run tiny experiments instead of grand commitments.

Creativity hates drama and loves iterations.

If you can make it small, you can make it real.

Turn the Faucet On

People kill creativity by demanding quality too early.

At the start, the output is supposed to be bad, and that’s OK.

This is divergent thinking: generating many possibilities (including obvious and flawed ones) to escape the first ideas your brain serves you.⁹

Quantity is not the goal, it’s the tool.

A practical execution order is the following:

Volume.
Selection.
Refinement.

If you judge too early, you stop the faucet before the clean water arrives.

Try this once, seriously.

Set a timer for ten minutes. Generate twenty ideas and do not evaluate. Circle the three least stupid ideas. Improve one of them.

Creativity is less a light bulb and more a faucet, and the first water coming out is always dirty.

Constraints Are Rocket Fuel

Unlimited freedom is overrated.

Constraints do two useful things: they exclude unhelpful options and they focus attention where novelty can happen.

Patricia Stokes argues that constraints can be a reliable source of breakthrough because they force you out of default patterns.¹⁰

Limits are not cages, they are search spaces:

Use constraints deliberately
Write a full outline using only twelve sentences.
Explain your idea without using your favorite buzzwords.
Build the concept with one example, not ten.
Draft in twenty minutes, then stop.

Constraints make you choose, and choice is where originality starts.

AI makes it tempting to expand forever. But expansion is not invention. When every option is cheap, commitment becomes the differentiator. Constraints create commitment by design.

If you want a sharper mind, impose sharper edges.

Creative Fasting

Your mind can’t have its own thoughts if it’s always renting someone else’s.

Once or twice a year, consider a creative fast. No news, no social media, no podcasts, minimal algorithmic feeds. Just enough quiet to notice what your brain does when it isn’t being constantly recruited. Think of it as intermittent fasting for attention.

People often cite Japan’s sakoku policy as a metaphor for creative isolation (even though the historical reality was more nuanced than total isolation). The point of the metaphor is not isolation is good, but silence reveals signal.

There’s also a motivation angle: evaluation pressure and extrinsic incentives can undermine originality.

Teresa Amabile’s classic piece explains how environments accidentally kill creativity by squeezing intrinsic motivation.¹¹ When nobody’s watching, your inner voice gets louder.

Creative fastinf is not anti-technology mindset.

It’s pro-agency.

It’s how you stop the algorithm from drafting your personality.

If you feel behind when you unplug, good. That sensation is your dependency showing itself. Detach long enough to remember what you actually care about.

A 15 Minutes Protocol

Creativity becomes practical when it becomes scheduled:

Write the problem as a question (2 minutes).
Generate ten bad ideas fast (5 minutes).
Translate the best idea into a different medium (3 minutes).
Stop and do nothing useful (5 minutes).
Return and choose one next action.

This way, you’re training two muscles: generation and judgment. AI can help with generation, but you must own the judgment.

Do this daily for a week. So your brain learns the pattern: create, step away, return, choose. And over time, the “creative state” will start becomina a practiced state.

Final Thoughts

Creativity is not a talent.

It’s a state you can reliably enter by managing your inputs, your boredom, and your fear of bad ideas.

AI can now produce what is statistically likely. But creativity often means producing what is structurally and fundamentally new.

Margaret Boden’s call this transformational creativity (changing the rules of the game, not just the pieces).¹²

For now, that gap between “likely” and “new” is where humans still have an edge.

So when everyone else asks:

“What can I automate?”

Ask instead:

“What can I create that is worth automating toward?”

Because in an AI-saturated world, relevance does not come from output. It comes from taste, direction, and the courage to reframe the question again, and again, and again.

Footnotes

Lawrence Page et al., The PageRank Citation Ranking: Bringing Order to the Web (Stanford), and Brin & Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine (Google Research page) (PDF mirror: PDF). ↩
Raichle et al. (2001), A Default Mode of Brain Function (PNAS). ↩
Buckner, Andrews-Hanna, & Schacter (2008), The Brain’s Default Network: Anatomy, Function, and Relevance to Disease (Annals of the NY Academy of Sciences, abstract). ↩
Baird et al. (2012), Inspired by Distraction: Mind Wandering Facilitates Creative Incubation (PDF) (PubMed record: PubMed). ↩
Steven Johnson, “Where Good Ideas Come From” (TED Talk). ↩
George Mack, How to be creative (without taking drugs). ↩ ↩²
Fauconnier & Turner (2003), Conceptual Blending, Form and Meaning (PDF). ↩
Sio & Ormerod (2009), Does incubation enhance problem solving? A meta-analytic review (PDF mirror). ↩
APA Monitor, “The science behind creativity” (includes Guilford’s role and divergent thinking context). ↩
Patricia D. Stokes, Creativity from Constraints: The Psychology of Breakthrough (preview PDF). ↩
Teresa Amabile, How to Kill Creativity (Harvard Business Review) (PDF copy: PDF). ↩
Margaret A. Boden, The Creative Mind: Myths and Mechanisms (Taylor & Francis page). ↩

When Answers Get Cheaper, Questions Are Gold

2025-08-14T00:00:00-07:00

I remember, in my infancy, I had access to only a few old books at home. The new ones either weren’t available or were too expensive to acquire. If I wanted to read good content back then, I had to walk to the local library and do it on-site.

Later on at university, I observed that professors were constantly reading research papers.¹ They were respected for the sheer amount of knowledge they had accumulated over a lifetime. We valued them because their knowledge and expertise were targeted and scarce.

Now, in my 30s, reading books is not as trendy as before, knowledge and papers are available from anywhere. Indeed, getting access to information is cheaper than ever, and even the latest scientific advances move at lightspeed. A curious 10-year-old today can access cutting-edge tools and learn from the latest scientific discoveries, right from home!

More remarkably, AI has drastically changed the way we value knowledge in the last few years. Everyone can now prompt an AI assistant (like ChatGPT) for free and get almost instant answers to practically any imaginable question. I believe AI represents an optimization milestone in the way we search for human knowledge. The impact could be even more than what the Google search engine did in the early days of the internet.

AI has made answers abundant and general knowledge intuitively accessible. However, current AI models still struggle to figure out how to apply, synthesize, and adapt existing knowledge to create something genuinely new. This limitation is a consequence of the way current Large Language Models (LLMs) operate.

With the rise of more advanced reasoning models, the ability to formulate the right questions to drive AI in the right direction is a skill we need to incorporate into our everyday lives. AI has shifted our focus from trying to find answers to thinking about what and how to formulate the right questions.

I’d argue that as the value of getting answers goes down, the value of formulating new original questions that drive actual action should go up. So, coming up with original and relevant questions represents a real competitive edge for everyday tasks, i.e., a real differentiator.²

What’s more, I believe that the perceived value of those who have mastered the “art of asking questions” will increase accordingly. Just look at the rise of professions that rely heavily on questioning skills such as podcasters and interviewers. These guys have mastered the ability to ask good questions and extract value for their audience.

Today, I see myself doing a conscious effort at becoming better at the art of asking questions. This post is about techniques, patterns and anti-patterns I’ve learned over time from this practice. Let’s dive in!

👉 Check out my compilation of Job Interview Questions.³

When to Ask Questions?

Short answer: every time you can!

Back in the days, asking frequent and awkward questions was considered an annoying practice in many cases. Why? I’d argue that our inherent human laziness had something to do with it. Leaving a question hanging around meant somebody had to squeeze his brain to find an answer (i.e., to do mental work). Otherwise, an uncomfortable void of uncertainty threatened to emerge.

But this is no longer the case.

AI has completely removed the fear of “feeling stupid” for both sides: the asker and the responder. The more we use AI, the more comfortable we become with the idea of questioning everything. In meetings or events, asking is a signal of attention, and in an era where AI can respond fast, it also signals the ability to challenge the status quo and express a personal opinion, which are human traits with increasing value.

The more questions you ask, the better prepared you will be for whatever is coming next.

Techniques

Good questions have purpose and “make sense” (in general).⁴ They are unambiguous, give just enough context, and set limits.

Scope is reduced so that answer feel smaller. The question itself must strip away everything irrelevant. As a rule of thumb, if a the question doesn’t make any difference to the argument, it needs to be sharpened.

There are three techniques to improve question formulation:

Aim, scope, and payoff (ASP)
Clarity, context, and constraints (3C)
Falsifiability and measurability (FM)

Aim, Scope, Payoff (ASP)

Without aim you wander. Without scope you boil the ocean. Without payoff you can’t act. A good question invokes change. The faster the change, the more effective the question. Ideally, you want to change something in the next 10 minutes.

So, the core idea of this technique is to state what you’re trying to achieve (aim), how far you’ll look (scope), and what you’ll do with the answer (payoff).

Element	Description	Template
Aim	What you want to achieve.	“We want to __.”
Scope	Boundaries of your inquiry (data, time, users, tools).	“Within __ (data/time/users/tools).”
Payoff	What you’ll do with the answer.	“So we can __ (decision/action/experiment).”

Examples:

❌ Before (Vague)	✅ After (ASP Applied)
Product: “How do we improve onboarding?”	Aim: Increase week-1 activation. Scope: Mobile iOS, signup flow only. Payoff: Decide which of 3 changes to A/B test. Question: “What single change in the iOS signup flow would most increase week-1 activation, and which 3 options should we A/B test first?”
Research: “Is our model good?”	Aim: Decide to ship or retrain. Scope: Fraud model v2, last 30 days. Payoff: Go/no-go. Question: “Given the last 30 days, does `fraud-model-v2` beat `fraud-model-v1` on precision by ≥2pp at equal recall, so we ship or retrain?”
Personal: “How can I get healthier?”	Aim: Improve sleep. Scope: Next 14 days, bedtime routine only. Payoff: Adopt one habit. Question: “Which single bedtime habit should I try for 14 days to raise average sleep by 30 minutes?”

Clarity, Context, Constraints (3C)

Clarity prevents misreads, context prevents wheel-reinvention, constraints prevent infinite “it depends.” You want to add just enough context to the question to make it sound enough, at the same time that you force a direct answer to it.

The core idea of the 3C tecnique is to make the question unambiguous, include the background that matters, and set limits that force trade-offs.

Element	Description	Example
Clarity	Define terms, units, and success. Prefer nouns/verbs over adjectives.	“Write-heavy database workload (~5k writes/sec), 99.9% latency <10ms”
Context	Minimum viable backstory: objective, prior attempts, relevant data.	“Client churn up 3% in SMB last quarter”
Constraints	Time, budget, tools, risk tolerance, guardrails.	“EU-only, managed service, must decide by Friday”

Quick rewrites:

❌ Before (Vague)	✅ After (3C Applied)
“What’s the best database?”	“For a write-heavy workload (~5k writes/sec), 99.9% latency <10ms, EU-only, managed service, what database should we evaluate first and why?”
“How should I learn LLMs?”	“With 5 hours/week for 6 weeks and access to GCP, what learning plan gets me from zero to fine-tuning a small model on our support tickets?”
“Can we migrate quickly?”	“With a 2-month deadline, 3 engineers, $30k budget, and zero downtime tolerance, can we migrate the existing pipeline from on-prem to BigQuery?”

Falsifiability and Measurability (FM)

Decisions stick when they survive attempts to disprove them. Good questions drive rapid decisions. It’s always easier to give a boolean answer when enough context is provided. Measurements is one of the best ways to give such a context. With measures, one can turn opinion into actionable feedback.

The core idea of the FM technique is to phrase questions so answers can be tested. If it can’t be wrong, it can’t be right.

Pattern	Description	Example
Hypothesis form	State belief, expected outcome, magnitude, audience, and timeframe.	“We believe shortening the signup form from 7 to 4 fields will raise iOS activation by 3–5% for new users within 14 days.”
Acceptance criteria	Define clear pass/fail metrics.	“Ship the model to production if precision ≥ 0.92 with recall ≥ 0.55 on June data.”
Disconfirmers first	Identify what would prove the belief false.	“If activation does not increase by ≥1% after 7 days with 95% CI, abandon the change.”

Quick Rewrite:

❌ Before (Vague)	✅ After (FM Applied)
“Will the new pricing work?”	“In a 50/50 geo split for 21 days, does new pricing increase gross margin per user by ≥4% with no more than a 1pp drop in conversion?”
“Is the model good enough?”	“On the August dataset, does the model achieve F1 ≥ 0.82 and AUC ≥ 0.9, with inference latency ≤ 120ms on 95% of requests?”
“Should we improve onboarding?”	“For new users signing up in September, does reducing onboarding steps from 5 to 3 increase 7-day retention by ≥6% without lowering NPS?”

Patterns

Reusable question patterns help you think more clearly, spot blind spots, and get better answers - faster.

Here are eight patters you can borrow and use everyday.

First-principles (“What are the primitives?”): Strip away jargon and complexity until you get to the most basic truths. From there, rebuild your understanding. This works because reality is easier to reason about than opinions.
Inversion (“How could this fail?”): Instead of only asking how to succeed, flip the question and look for ways it could go wrong. Anticipating failure is often the fastest way to avoid it.
Disconfirming evidence (“What would prove me wrong?”): Ask what would make your current belief false. This keeps you from cherry-picking facts that only confirm your view.
Assumption audit (“What am I taking for granted?”): List the things you believe are true without checking. Questioning assumptions often reveals the weakest part of your thinking.
Constraint lens (“What if we had half the time/budget?”): Imagine having fewer resources. Constraints force creativity, sharpen priorities, and surface shortcuts you might have missed.
Comparative calibration (“Compared to what?”): Numbers and claims mean little without context. Always anchor them to a baseline, a competitor, or a past result.
Decomposition (“Can we split this into 3 parts?”): Break a big problem into smaller, more manageable chunks. Solving each piece separately is often faster and less overwhelming.
Time travel (pre-mortem/post-mortem): Jump forward in time. In a pre-mortem, imagine the project has failed and ask why. In a post-mortem, imagine it has succeeded and trace back the steps that led there.

Anti-Patterns

Some questions don’t just fail to help but actually backfire. They actively distort the truth or shut down useful discussion.

Here are four common traps to avoid:

Loaded and leading questions: These questions sneak in assumptions or push the respondent toward a certain answer. For example, “Why is our onboarding so bad?” assumes it is bad. Instead, ask neutrally: “How does our onboarding compare to expectations?”
Double-barreled and vague scope: Two questions in one confuse people and muddy the answer. “How do we improve onboarding and reduce churn?” is really two separate discussions. Similarly, asking with no clear scope, like “What’s the best database?”, leads to endless “it depends.” Split them and define the boundaries.
Why-blame vs. how-fix framing: Asking “Who messed this up?” shifts focus to defending reputations instead of solving problems. “How can we prevent this next time?” keeps the discussion forward-looking and solution-oriented.
Scope creep: Two questions in one confuse people and muddy the answer. “How do we improve onboarding and reduce churn?” is really two separate discussions. Similarly, asking with no clear scope (like “What’s the best database?”) leads to endless “it depends.” Split them and define the boundaries.

How to Get Better

For me, one of the best ways to learn the art of asking great questions is by listening to the most popular podcasters out there. Think about it for a second, their entire job revolves around asking the right questions to the right people.

What I usually do is keep track of the best questions they ask (I have a Notion template ready for this!). I note down the exact wording they use, the follow-up questions they make, and even the flow of their thought process when “connecting the dots.” Over time, this has helped me sharpen my own questioning skills (a lot).

Here are some of my favorite podcasters who have an interview-first and long-content format:

Footnotes

Turns out my childhood instincts were correct: reading was important! ↩
We tend to tolerate lower-quality answers from humans than from AI. In many cases, it’s less about the exact answer and more about the mental mechanism you used to get there. ↩
I read “Who” by Geoff Smart a few moons ago. It’s based on 1,300 hours of CEO interviews about hiring. NYT bestseller. Great questions and mental models. Recommended! ↩
I know, whether something “makes sense” or not is wildly subjective… but let’s at least agree it should make sense to the interviewee. ↩

Consistently Measure Your Consistency to Beat Talent

2025-07-31T00:00:00-07:00

I’ve never considered myself particularly talented at anything. Ever. Since primary school, there were always other kids performing better than me. They ran faster, got higher grades, looked better, and so on…

If we had followed the predictions, today I should be a complete social loser,¹ and my talented early peers should be doing pretty well. But a couple of decades later, I’ve noticed that only a few of these kids are actually successful today (according to most common success metrics), and I don’t consider myself a loser, at all.

Why is success non-linear?

I came to the conclusion that I’ve just outworked my ~~now lazy~~ popular early fellows with consistent hard work. That’s all. Period.

From my experience, I can confidently say that consistency beats talent. Every, single, time. No matter how hard it seems, or how good they are. The question is not “if” you can win but rather “when” you will do it.

The fundamental problem is that not everybody has what it takes to remain consistent over time. Willpower fades, motivation fluctuates, and discipline alone rarely holds up when things get tough. And let’s face it, constantly failing at anything really sucks.

So, how exactly can anyone achieve consistency?

I don’t know about you, but for me, it’s pretty straightforward: I consistently measure how consistent I am, and that’s the fuel I need to keep going.

Let me explain.

Consistency Beats Talent

You have probably already heard that consistency is what separates top achievers from the rest of us. This is actually not that obvious. Indeed, many years ago, I had serious doubts about this claim.

I heavily overestimated raw talent and luck, while understimating my own capabilities.

Just three quick examples:

I believed it would be impossible to learn English on my own without a proper teacher.
I thought I wasn’t smart enough to do a PhD because only “brilliant” people with special math thiking could go that path.
I was convinced I couldn’t get in good physical shape because of my Latin American genes.

I was wrong.

These were actually convenient excuses I made up to avoing the hard work.

Over and over, I’ve witnessed that although talent only gives a head start, it is consistency that determines how far you actually go.

Even the most talented guy in the room won’t win if they only show up occasionally. But those who show up over a long period of time, even with average skills, will eventually surpass the talented but flaky competitors.²

“The only thing that I see that is distinctly different about me is I’m not afraid to die on a treadmill. I will not be out-worked, period. You might have more talent than me, you might be smarter than me, you might be sexier than me, you might be all of those things you got it on me in nine categories. But if we get on the treadmill together, there’s two things: You’re getting off first, or I’m going to die. It’s really that simple, right? You’re not going to out-work me. It’s such a simple, basic concept. The guy who is willing to hustle the most is going to be the guy that just gets that loose ball. The majority of people who aren’t getting the places they want or aren’t achieving the things that they want in this business is strictly based on hustle. It’s strictly based on being out-worked; it’s strictly based on missing crucial opportunities. I say all the time if you stay ready, you ain’t gotta get ready.” ― Will Smith

To achieve anything meaningful in life (whether it’s getting in the top 1% at something, building a career, or improving health) consistency is the real differentiator.

It’s all about refusing to be outworked. But, like anything truly valuable, becoming consistently consistent requires significant effort and dedication.

Consistency is Hard to Achieve

If I told you that you’d need to write 69 blog posts over 9 years before getting one of them featured at the top of Hacker News, would you still write blog posts with that goal in mind?³

If I told you that you’d need to learn research methods, write papers, and get some published before landing a PhD offer that would change your life, would you still try to become a PhD student?

Or what if you have to make 100 YouTube videos until you get your first 1000 subscribers, would you still try to become a successful YouTuber?⁴

That’s a lot of work, and most people wouldn’t even begin.

The fact that we have so many distractions and information flows nowadays makes truly consistent people very rare. Admittedly, it’s very easy to get distracted by all kinds of shiny opportunities (just ask memecoin traders). That’s why becoming indistractable is such a superpower these days.

Modern society is making us addicted to easy gratification; when we don’t get it immediately, we tend to give up. Most people will never have the consistency to overcome tedious repetition for that long.⁵

If you can learn to cultivate consistency in your work, you’ll eventually beat most people. No matter how much talent, luck, or skill they have, just by being consistent. But you need to put in the effort.

All the time you’ll spend working hard and failing will make you think you’re a loser. You’ll feel like a nobody who sucks and isn’t good for anything and should just quit because you’ll never be good enough. This is how I feel most of the time, btw. But I’m aware of it.

Consistency is fundamentally a lonely, monotonous pursuit. Long-term commitment often feels isolating, boring, and repetitive. Ask any successful entrepreneur, and they’ll confirm they worked quietly, for years, before gaining traction and hit the marks.

So, consistency is hard.

How do I sustain it?

I trick my brain into it.

Consistency Needs to Be Measured

How can you make sure you’re on the rigth track? You measure it, automatically, with dashboards.

I came up with a system to track how consistent I’m being over time.

What I do:

Set ~~easy~~ realistic goals.
Track my progress (automatically).
Reward myself when hitting milestones.

Setting Realistic Goals

Every year, I set a few goals that I want to achieve. I try to keep them small (no more than 5), so I can actually achieve them.

Once these goals are set, I lower the barrier to starting by making daily targets incredibly small. So small, in fact, that saying “no” feels unreasonable.

For example:

Want to read more? Set a goal of just 10 pages daily. By year’s end, you’ve read around 10 books.
Aiming to write? Just commit to 200 words daily, and you’ll have enough for a full-length book in a year.

The key isn’t doing a lot in one sitting, but making some little progress every single day.

Tracking the Progress

Today, I don’t use a habit tracker, spreadsheet, or calendar (tried in the past but didn’t stick). Instead, I rely on 1) public exposure of my work, and 2) automation systems that keep track of my progress. This approach allows me to check the status of my projects in a data-driven manner (i.e., with analytics and data).

Nobody wants to look like a fool in public, right? When creating in public, the mere sense of public exposure provides a sense of accountability. For me, it also raises the quality bar quite a bit because then I know that other people will see my work.

Keeping track of my progress helps me see how I’m doing at a glance and make adjustments as needed. Also, looking at how the numbers change over time signals progress, which makes me feel motivated.⁶ But I don’t want to get too obsesed and waste time on vanity metrics, this is when automation comes in.

Automating the tracking process is a necessary step fordward. For example, I’ve automated the way I build my resume using $\LaTeX$ and GitHub actions. For that, I created Python scripts that scrape my number of subscribers on YouTube, the number of followers on LinkedIn, the number of citations on Google Scholar, etc. I even keep track of the timings for the races I run.

This allows me to see my progress over time, and it also gives me a sense of accountability. At the end of each week or month, I look at my numbers to see if I hit my targets, and where I fell short.

Celebrating Achievements

Celebrating is important. Every time I hit a milestone, I reward myself by doing something I enjoy but know I shouldn’t do too often (like spending money in a restaurant or buying a new gadget).

My PhD supervisor was very serious about celebrating successes. Every time we submitted a new paper, we always celebrated the submission. Yes, we celebrated not the acceptance, but the submission. This is because he knew that the process of writing a paper is long and tedious. He wanted to acknowledge the effort it took us to get to that point.

So, don’t wait for the big wins or the numbers to celebrate. Celebrate every time a planned task is done.

Final Thoughts

Consistency is the ultimate differentiator. But only a few people have it because we all have finite energy, time, and willpower. If you want to beat talent, luck, or even your own past self, start measuring it. Put numbers on it. Because what gets measured gets improved. Rely on systems, and keep in mind that everything is difficult before it becomes easy. Reward yourself for the small wins, and don’t wait for the big ones. Try to surround yourself with consistent people. And no matter what, just don’t give up.

Footnotes

Or something worse, like somebody who spend his nights on the sofa watching “Love Is Blind” on Netflix. ↩
The idea that simple repetition will notj bring you very far doesn’t hold true today, AI solves the problem of lacking smart guidance and deliberate learning. Anyone can have the right guidance to learn the right things. ↩
BTW, I believe that writing is one of the toughest things to do consistently. ↩
I’m on my way to becoming a successful YouTuber (i.e., > 100K subs). I still have a long way to go. If you want to follow my journey, you can subscribe to my channel. ↩
I have a friend who is a professional athlete, and he told me that the hardest part of his job is to stay consistent over time. He has to train every day, even when he doesn’t feel like it, and that takes a lot of mental strength. ↩
Looking at numbers and see how they grow over time is somehow addictive, just like how video game characters level up over time. ↩

I’m Switching to Python and Actually Liking It

2025-07-15T00:00:00-07:00

I started to code more in Python around 6 months ago. Why? Because of AI, obviously. It’s clear (to me) that big ~~money~~ opportunities are all over AI these days. And guess what’s the de facto programming language for AI? Yep, that sneaky one.

I had used Python before, but only for small scripts. For example, this script scrapes metadata from all videos on my YouTube channel. The metadata is dumped as a JSON file that I use to nicely display statistics of the videos on this static page. As you can see here, this little script runs in solo mode every Monday via GitHub Actions. Doing this kind of thing in Python is just way more convenient than, say, using Batch. Not only because the syntax is more human-friendly, but also because the Python interpreter is natively integrated in all Unix distros. Isn’t that cool?

So yeah, Python is powerful, and it couples very well with the now ubiquitous VSCode editor. But I didn’t treat it seriously until recently,¹ it was just after I wanted to build AI applications (RAG, Agents, GenAI tools, etc.) for the “real world” that I realized that whether you like it or not, Python is the language of choice for those matters.

So I decided to give it a serious try, and to my great surprise, I’ve found that Python, and everything around it, has really improved a lot over the last decades.

Here’re just three examples:

Python has created a very complete ecosystem of libraries and tools for processing and analyzing data.²
Python has gotten faster with optimized static compilers like Cython.
Python has done a good job of hiding its legacy ugliness (such as __init__, __new__, and similar aberrations), sweetening its syntax to accommodate developers ~~with good taste~~.

Thanks to this and many other things, I’m now feeling a particular joy for the language.

However, during this time, I’ve found that there’s still a big gap between using Python for “production-ready”³ apps vs. the usual Jupyter notebook or script-based workflow.

So in this post, I share the tools, libraries, configs, and other integrations that bring me joy, and that I now use for building my Python applications.

⚠️ This post is highly biased toward the tools I personally use today, and if you think I’m missing some gem, please let me/us know (preferably in the comment section below).

NOTE: Somehow this article got 680+ comments on Hacker News (just another proof that you never really know).

Project Structure

I prefer to use a monorepo structure (backend and frontend) for my Python projects.⁴

Why?

Because of my bad memory: I don’t like code parts scattered across multiple repositories (it’s definitely not search-friendly).
Because having multiple repos is mostly unnecessary: I’m just one guy, and I believe that if a project grows to the point that it needs to be split into multiple repositories, then it’s a sign of over-engineering.
Because I’m lazy: I like to keep things as simple as possible, compile, test, containerize, and deploy from a single location.⁵

I would like to have a tool that generates the project structure for me, but I haven’t found one that fits me yet. In the past, I used CCDS, a project initialization tool mostly for Data Science projects. It’s very good, but it’s not targeting full-stack developers as its core users.

Here’s the typical structure of a project with a frontend-backend architecture (I’ll go through each subpart later in this post):

project/
│
├── .github/ # GitHub Actions workflows for CI/CD pipelines
│ ├── workflows/ # Directory containing YAML files for automated workflows
│ └── dependabot.yml # Configuration for Dependabot to manage dependencies
│
├── .vscode/ # VSCode configuration for the project
│ ├── launch.json # Debugging configurations for VSCode
│ └── settings.json # Project-specific settings for VSCode
│
├── docs/ # Website and docs (a static SPA with MkDocs)
│
├── project-api/ # Backend API for handling business logic and heavy processing
│ ├── data/ # Directory for storing datasets or other static files
│ ├── notebooks/ # Jupyter notebooks for quick (and dirty) experimentation and prototyping
│ ├── tools/ # Utility scripts and tools for development or deployment
│ ├── src/ # Source code for the backend application
│ │ ├── app/ # Main application code
│ │ └── tests/ # Unit tests for the backend
│ │
│ ├── .dockerignore # Specifies files to exclude from Docker builds
│ ├── .python-version # Python version specification for pyenv
│ ├── Dockerfile # Docker configuration for containerizing the backend
│ ├── Makefile # Automation tasks for building, testing, and deploying
│ ├── pyproject.toml # Python project configuration file
│ ├── README.md # Documentation for the backend API
│ └── uv.lock # Lock file for dependencies managed by UV
│
├── project-ui/ # Frontend UI for the project (Next.js, React, etc.)
│
├── .gitignore # Global Git ignore file for the repository
├── .pre-commit-config.yaml # Configuration for pre-commit hooks
├── CONTRIBUTING.md # Guidelines for contributing to the project
├── docker-compose.yml # Docker Compose configuration for multi-container setups
├── LICENSE # License information for the project (I always choose MIT)
├── Makefile # Automation tasks for building, testing, and deploying
└── README.md # Main documentation for the project (main features, installation, and usage)

My project is the root directory and the name of my GitHub repo. I like short names for projects, ideally less than 10 characters long. No snake_case; separation with hyphens is OK to me. Note that the project should be self-contained, meaning it includes documentation, build/deployment infrastructure, and any other necessary files to run it standalone.

It’s important not to do any heavy data processing steps in the project-ui, as I opted to separate frontend logic from backend responsibilities. Instead, I choose to make HTTP requests to the project-api server that contains the Python code. This way, we keep the browser application light while delegating the heavy lifting and business logic to the server.

There’s an __init__.py file in project-api/src/app to indicate that app is a Python module (it can be imported from other modules).

Python Toolbox

uv

I use uv as my Python package manager and build tool. It’s all I need to install and manage dependencies.

Here are the core commands to set it up:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Install uv globally if not already installed

curl -sSfL https://astral.sh/install.sh | sh

# Initialize a new project (adds .gitignore, .python-version, pyproject.toml, etc.)

uv init project-api

# Add some dependencies into the project and update pyproject.toml

uv add --dev pytest ruff pre-commit mkdocs gitleaks fastapi pydantic

# Update the lock file with the latest versions of the dependencies

# (this will also create a .venv in your project directory if it doesn't exist)

uv sync

# (Opt‑in) explicitly create or recreate a venv in a custom path:

uv venv create .venv

# Activate the .venv yourself

# ──────────────────────────────────────────────

# On macOS/Linux:

source .venv/bin/activate

# On Windows (PowerShell):

.\.venv\Scripts\Activate.ps1

# On Windows (cmd.exe):

.\.venv\Scripts\activate.bat

Note that the most important file for uv is pyproject.toml.⁶ This file contains metadata and the list of dependencies required to build and run the project.

ruff

I really like ruff. It’s a super-fast Python linter and code formatter, designed to help lazy developers like me keep our codebases clean and maintainable. Ruff combines isort, flake8, autoflake, and similar tools into a single command-line interface:

1
2
3
4
5
6
7
# Lint all files in `/path/to/code` (and any subdirectories)

ruff check path/to/code/

# Format all files in `/path/to/code` (and any subdirectories)

ruff format path/to/code/

Ruff supports the PEP 8 style guide out of the box.

ty

ty is a type checker for Python. It is a great combo for typing, the popular Python module for adding static typing. I think typing really helps me catch type errors early in the development process. I actually don’t care about having to write more code, in fact, I prefer it if it improves code quality and reduces the likelihood of runtime errors.

NOTE: At the time of writing, ty is still in early development by Astral (the same company behind uv and ruff), but I’ve been using it and haven’t found any noticeable flaws so far.

pytest

pytest is the most popular testing library for Python. Writing simple and scalable test cases with it is just super easy. It supports fixtures, parameterized tests, and has a rich ecosystem of plugins. Just create a file named test_.py in project-api/src/app/tests/, and run:

uv run pytest

That’s it!

Pydantic

Pydantic is a data validation and settings management library for Python. It helps manage all kinds of configuration settings, such as API keys, database connection details, or model parameters (hardcoding these values is a very bad practice, btw).

In particular, Pydantic Settings allows you to define application configurations using Pydantic models. It can automatically load settings from environment variables or special .env files, validate their types, and make them easily accessible in your code.

Here’s an illustrative example:

1
2
3
4
5
6
7
8
9
10
from pydantic import BaseSettings

class Settings(BaseSettings):
api_key: str
db_url: str

    class Config:
        env_file = ".env"

settings = Settings()

Now, when you run this code, Pydantic will automatically load the values of api_key and db_url from the .env file or environment variables. These values will be accessible and validated according to the types defined in the Settings model. Just great!

MkDocs

I use MkDocs for documentation and static generation of the website for the project.⁷ I’m not a designer, so I prefer to just copy an aesthetically pleasing design from another similar open-source project and make some simple modifications to the CSS (like changing fonts and colors).

FastAPI

I use FastAPI for building APIs. It has been a game changer for me; it allows for easy creation of RESTful APIs with automatic validation, serialization, and documentation. FastAPI is built on top of Starlette and Pydantic, which means it provides excellent performance and type safety. It’s fast, easy to use, and integrates seamlessly with Pydantic for data validation.

Dataclasses

Dataclasses is not a library but a Python feature that provides a way to define classes that are primarily used to store data. They offer a simple syntax for creating classes that automatically generate special methods like __init__(), __repr__(), and __eq__().

This greatly reduces boilerplate when creating data containers.

Here’s an example:

1
2
3
4
5
6
7
8
9
from dataclasses import dataclass

@dataclass
class Point:
x: int
y: int

p = Point(1, 2)
print(p) # Output: Point(x=1, y=2)

So goodbye boilerplate and cryptic code!

Version Control

GitHub Actions

I’m a fanboy of GitHub Actions, especially for CI across different OSs. I recommend using it for both API and UI pipelines.

A typical workflow for project-api looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
name: CI-API

on:
push:
branches: - main
pull_request:
branches: - main

jobs:
build-and-test:
runs-on: ubuntu-latest
steps: - name: Checkout code
uses: actions/checkout@v3 - name: Build Docker image
run: docker build -t project-api:ci ./project-api - name: Run tests
run: docker run --rm project-api:ci pytest

Note that this workflow uses Docker to run the tests in an isolated environment.⁸ You can change the OS by setting the runs-on parameter to windows-latest or macos-latest.

Dependabot

Handling dependencies is a pain, but Dependabot makes it easier. It automatically checks for outdated dependencies and creates pull requests to update them.

Here’s a sample configuration for Dependabot in the .github/dependabot.yml file:

1
2
3
4
5
6
7
8
version: 2
updates:

- package-ecosystem: "uv"
  directory: "/"
  schedule:
  interval: "weekly"

Gitleaks

If there’s something that could hurt our reputation, it’s committing sensitive information, like API keys or passwords, directly to a repository. Fortunately, Gitleaks helps prevent this from happening. There’s just no reason not to use it.

Pre-commit Hooks

I use pre-commit to run checks and format code before committing. It helps ensure that the code is always in a good state and follows the project’s coding standards. For example, I use it to run ruff-pre-commit and gitleaks before committing my code.

Here’s a sample .pre-commit-config.yaml file that I use:

1
2
3
4
5
6
7
8
9
10
11
12
13
repos:

- repo: 
  rev: v0.12.3 # Ruff version.
  hooks:
  - id: ruff-check # Run the linter.
    args: [ --fix ]
  - id: ruff-format # Run the formatter.
- repo: 
  rev: v8.27.2
  hooks:
  - id: gitleaks

Infrastructure Management

Make

Make is a Swiss Army knife, a classic utility for automating tasks. I use it to create simple shortcuts for common development commands. Instead of remembering and typing out long CLI incantations to run tests, build Docker images, or start services, I define these tasks in a Makefile. Then I just run commands like make test or make infrastructure-up.

As you might have noticed, there is a Makefile in both the project-api and the global project directories:

project/project-api/Makefile: For linting, testing, and running the API.
project/Makefile: For building and running the infrastructure (via docker-compose).

Here’s an extremely simple example of the project-api Makefile:

1
2
3
4
5
6
7
8
9
10
11
DIR := . # project/project-api/Makefile

test:
uv run pytest

format-fix:
uv run ruff format $(DIR)
uv run ruff check --select I --fix

lint-fix:
uv run ruff check --fix

Now, if I want to run the tests, I just run make test, and it executes uv run pytest in the current directory.

For the global project, I use the following Makefile:

1
2
3
4
5
6
7
8
infrastructure-build:
docker compose build

infrastructure-up:
docker compose up --build -d

infrastructure-stop:
docker compose stop

make is a powerful tool that can help you automate almost anything in your development workflow. Although the examples above are very simple, just imagine how you can add more complex tasks as needed.

Docker

Docker is a tool that allows you to package your application and its dependencies into a container,including everything needed to run: dependencies, system tools, code, and runtime OS. When working locally, I use Docker Compose to connect all Docker images into the same network. Like Docker for dependencies, Docker Compose allows encapsulating the whole application stack and separating it from the rest of your local environment.

To fully grasp this concept, let’s take a look at a simple docker-compose.yml file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
version: '3.8'
services:
project-api:
build:
context: ./project-api
dockerfile: Dockerfile
ports: - "8000:8000"
volumes: - ./project-api:/app
environment: - ENV_VAR=value
networks: - project-network

project-ui:
build:
context: ./project-ui
dockerfile: Dockerfile
ports: - "3000:3000"
networks: - project-network

networks:
project-network:
driver: bridge

In this file, we define two services: project-api and project-ui. Each service has its own build context (Dockerfile), ports, volumes, and environment variables.

Here’s a sample Dockerfile for the project-api service:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FROM python:3.11-slim

# Install system dependencies

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

WORKDIR /app

COPY uv.lock pyproject.toml README.md ./
RUN uv sync --frozen --no-cache

# Bring in the actual application code

COPY src/app app/
COPY tools tools/

# Define a command to run the application

CMD ["/app/.venv/bin/fastapi", "run", "project/infrastructure/api.py", "--port", "8000", "--host", "0.0.0.0"]

As you can see, the Dockerfile starts from a Python base image, installs dependencies, copies the project files, and defines the command to run the FastAPI application.

This way, you can run the entire application stack with a single command:

docker compose up --build -d

Footnotes

If you know me, you know I used to be mostly a Java/JavaScript/R kind of guy. ↩
For example, today Jupyter is bundled by almost every major cloud provider as the de facto tool for interactive data science and scientific computing. ↩
“Production-ready,” for me, means I can deploy the app to the cloud as-is, without needing to make a lot of infrastructure changes. ↩
Don’t get me wrong, I understand there are cases where a multi-repo structure is necessary, like when multiple teams work on different parts of the project or when dependencies needs to be shared across projects. ↩
I believe that aviding premature decomposition is a good idea. If a codebase is less than, say, 1/2 million LoC, then setting a network layer (like API calls) over it only would make maintenance a pain for ~~non-Amazon~~ rational developers. ↩
A pyproject.toml file is similar to package.json in Node.js or pom.xml in Java. ↩
By the way, I think every single project on GitHub should have its own website (it’s extremely easy via GitHub Pages), so no excuses, sorry. ↩
Using Docker for CI ensures parity with production environments, but it might add some cold-start overhead. You know… compromises, life is full of them. ↩

Evaluation Metrics for Real-Time Financial Fraud Detection ML Models

2025-05-08T00:00:00-07:00

After training a machine learning model for real-time financial fraud detection, the next step is to evaluate its performance. Fraud detection models face unique challenges during evaluation. Two examples are (1) class imbalance and (2) the different costs of false positives vs. false negatives. Class imbalance means that the model doesn’t have enough fraudulent transactions to learn from, as they are rare compared to legitimate ones. False positives, which flag legitimate transactions as fraudulent, can lead to customer dissatisfaction, while false negatives, where fraudulent transactions go undetected, can result in significant financial losses. In this post, I cover the most common metrics and considerations for evaluating fraud detection models while keeping these unique challenges in mind.

Confusion Matrix

Most financial fraud detection systems generate a binary output: a prediction indicating whether a transaction (txt) is fraudulent (1) or genuine (0).

By leveraging this universal approach to binary classification, we can establish standard evaluation methodologies to assess the performance of this type of models.

The confusion matrix is a widely used tool for summarizing and visualizing the performance of a classification model in a tabular format. It provides a clear breakdown of predictions vs. actual outcomes.

In a confusion matrix:

The x-axis represents the ground-truth labels (actual outcomes).
The y-axis represents the predictions made by the classification model.

Both axes are divided into two categories: positive (fraudulent txt) and negative (genuine txt). The positive class corresponds to the minority class (fraud), while the negative class corresponds to the majority class (genuine).

This representation allows us to visualize the model’s performance in terms of true positives, true negatives, false positives, and false negatives.

True Positives (TP): The number of fraudulent transactions correctly identified as fraud.
True Negatives (TN): The number of genuine transactions correctly identified as genuine.
False Positives (FP): The number of genuine transactions incorrectly flagged as fraudulent.
False Negatives (FN): The number of fraudulent transactions missed by the system.

By analyzing these metrics, we gain a comprehensive view of the model’s strengths and weaknesses, enabling informed decisions for further optimization.

Precision

Precision is the fraction of transactions that were actually fraud among all transactions the model flagged as fraud (predicted positive).

For example, if a model flagged 100 transactions as fraud, and 80 of those were indeed fraudulent, then:

\[\text{Precision} = \frac{TP}{TP + FP} = \frac{80}{80 + 20} = 0.8\]

High Precision means few false positives, so it is crucial for operational efficiency.

Low Precision means investigators waste time on many false alarms, and customers suffer unnecessary transaction declines.

Many top systems aim for very high Precision (e.g., 0.9+) at low fraud rates, but there’s a trade-off with Recall. For example, if we lower the threshold to catch more fraud, Precision may drop. Therefore, Precision is often reported at a certain operating point or as an average if multiple thresholds are considered.

An example interpretation: “Of the transactions our system blocked, 95% were indeed fraudulent”, that’s a Precision of 95%.

Recall

Recall is the fraction of actual fraud cases that the model correctly predicts as fraud (true positives) out of all actual fraud cases.

For example, if there were 100 actual fraud cases and the model caught 80 of them, then:

\[\text{Recall} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = 0.8\]

A Recall of 0.80 means 80% of fraud instances were detected (20% missed).

High Recall means few false negatives, which is critical in fraud detection because missing a fraudulent transaction can lead to financial losses.

Low Recall means many frauds slip through and cause losses.

We can usually increase Recall by lowering the detection threshold at the cost of Precision. In practice, businesses may set a Recall target like “catch at least 70% of fraud” and then maximize Precision under that constraint.

F1-Score

F1-Score, or F1 for short, is the harmonic mean of Precision and Recall. It gives a single-figure balance of both metrics, which is useful when we want a combined score for model selection or when class distribution is skewed.

For example, if Precision is 0.8 and Recall is 0.6, then:

\[F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \approx 0.686\]

High F1 means both Precision and Recall are reasonably high. Low F1 means either Precision or Recall is low, which is undesirable in fraud detection.

Overall, F1 is a good metric to assess a fraud detection model. It is also a popular metric in Kaggle competitions and papers to compare models, ensuring they are not just optimizing one at the expense of the other.

FPR

False Positive Rate (FPR) is is the share of legitimate transactions that get incorrectly flagged as fraud.

For example, if there were 100 legitimate transactions and the model flagged 5 of them as fraud, then:

\[\text{FPR} = \frac{FP}{TN + FP} = \frac{5}{95 + 5} = 0.05\]

FPR is important because it directly impacts customer experience. High FPR means many legitimate transactions are blocked, leading to customer dissatisfaction and potential loss of business trust.

Sometimes businesses set FPR requirements to control false alarms. For example: “We can only tolerate reviewing 0.5% of transactions, so FPR must be ≤ 0.005.”

FNR

False Negative Rate (FNR) is the share of fraudulent transactions the model misses.

For example, if there were 100 actual fraud cases and the model missed 2 of them, then:

\[\text{FNR} = \frac{FN}{TP + FN} = \frac{2}{98 + 2} = 0.025\]

Some businesses set FNR requirements to control missed fraud. For example: “We cannot tolerate missing more than 10% of fraud, so FNR ≤ 0.1” which implies Recall ≥ 0.9.

TNR

True Negative Rate (TNR) or Specificity measures how well the system correctly identifies legitimate transactions as non-fraud.

For example, if there were 1000 legitimate transactions and the model flagged 50 of them incorrectly as fraud (FP), the calculation would be:

\[\text{TNR} = \frac{TN}{TN + FP} = \frac{950}{950 + 50} = 0.95\]

TNR is often overlooked in fraud detection because it’s essentially the complement of the False Positive Rate (FPR):

\[\text{TNR} = 1 - \text{FPR}\]

TNR is typically very high in fraud detection systems because the number of legitimate transactions (TN) is much larger than the number of frauds or false positives.

Since Precision already focuses on avoiding false positives, and we typically assume we want to approve as many legitimate transactions as possible, TNR doesn’t usually take center stage.

However, in some contexts, like regulatory requirements or customer experience, it’s important to keep FPR below a certain threshold, such as “FPR ≤ 0.1%,” which directly relates to maintaining high TNR.

AUC-ROC

The Area Under the ROC Curve¹ (AUC-ROC) measures a model’s ability to distinguish fraud from non-fraud across all possible thresholds. In essence, it plots Recall against FPR.

The AUC is the area under this curve:

AUC = 0.5 means random guessing.
AUC = 1.0 means perfect discrimination.

This area is computed as follows:

\[\text{AUC} = \int_0^1 \text{Recall}(FPR) \, dFPR\]

AUC is threshold-independent: it summarizes performance across all thresholds, and it’s less sensitive to class imbalance than accuracy.

An intuitive interpretation: “If I randomly pick a fraud and a legitimate transaction, AUC is the chance the fraud gets a higher risk score.”

AUC-PR

The Area Under Precision-Recall Curve (AUC-PR) plots Precision vs. Recall and focuses squarely on the minority class (fraud), so it tells us how well the model catches fraud while keeping false positives low.

In highly imbalanced data such as fraud detection, AUC-PR is more informative than AUC-ROC because it answers how well the model balances Precision and Recall where it matters.

For instance, a model could have AUC-ROC = 0.98 and still have AUC-PR = 0.10, which means that the model detects fraud more often than non-fraud, but when it comes to real-world detection, Precision at high Recall isn’t stellar.

AUC-PR is the go-to metric when fraud cases are rare, and we care about catching as many as possible without overwhelming the system with false alarms.

Threshold for AUC-PR

Once we have chosen the best model as per AUC-PR, we need to decide a threshold, denoted as $\tau$, to convert this model’s fraud probability output into a concrete binary decision (fraud or not fraud).

If we look at the Precision-Recall (PR) curve in the figure below, different values of $\tau$ correspond to different trade-offs between Precision and Recall.

Sample Precision-Recall curves for two models A and B. Model B is superior to model A as is reflected in the AUC-PR values of the two models. Different points on the PR curve represent different threshold values and different trade-offs between Precision and Recall metrics.

We ultimately need to choose the right trade-off that suits our use case. The threshold $\tau$ determines the decision boundary for classifying transactions as fraudulent or genuine. Mathematically, this can be expressed as:

\[\text{Decision: Fraud if } P(x) > \tau\]

Where:

$P(x)$ is the predicted probability of fraud for transaction $x$.
$\tau$ is the threshold value.

With the above framework in mind, we can decide the threshold value $\tau$ based on the value of $k$, where $k$ represents the minimum Precision we want to maintain.

For example, if we want to maintain a minimum Precision of 90%, then $k = 90$. Using the Precision-Recall curve, we can derive the threshold value $\tau$ as well as the equivalent Recall value at 90% Precision.

This strategy allows us to calculate the optimal threshold $\tau$ while evaluating our trained model on the test set. Once the threshold is determined, it can be used to classify transactions during deployment:

\[\text{Fraud if } P(x) > \tau, \text{ otherwise Genuine.}\]

By adjusting $\tau$, we can balance Precision and Recall to meet specific business objectives and constraints.

Latency

Latency is the time it takes for the system to process a transaction and make an inference. Keeping latency low is crucial for real-time systems. Fraud models not only need to have good statistical performance but also operate quickly enough to be used in practice.

Latency and complexity matter in payment systems. In the example below, the Payment Server dispatches parallel calls to KYC Service, Fraud Check, and Payment Rail, but the transaction can only complete once the slowest of these services responds (the KYC Service in this example). Even though Fraud Check takes just 25 ms, any fluctuation (like a network hiccup or a slow third-party response) can bottleneck the entire flow. That’s why latency is a system-wide risk amplifier.

flowchart LR
  %% define the Payment App node
  PaymentApp[Payment App]

  %% container for backend services
  subgraph container [" "]
    direction LR
    Server[Payment Server]
    KYC[KYC Service]
    Fraud[Fraud Check]
    Rail[Payment Rail]
  end

  %% draw the edges with labels
  PaymentApp -->|50ms| Server
  Server     -->|50ms| KYC
  Server     -->|25ms| Fraud
  Server     -->|25ms| Rail

  %% style all four links purple
  linkStyle 0,1,2,3 stroke:#800080,stroke-width:2px

  %% highlight Fraud Check in red
  style Fraud fill:#ffe6e6,stroke:#ff0000,stroke-width:2px

Real-time fraud detection latency is commonly measured in two ways:

Online decision latency (ODL): How long it takes to score a single transaction and respond (which affects user experience and fraud blocking effectiveness).
Time-to-detection for fraud patterns (TTD): If an attack starts at a certain time, how long before the system detects and flags it.

ODL is usually measured in milliseconds. For example, a payment system might have an end-to-end latency budget of 200ms for authorization, out of which fraud checks get 20–30ms. Modern systems often aim for fraud model inference under ~50ms. In practice, we can look at 99th percentile latency (e.g., 99% of transactions scored in <500ms), to ensure worst-case delays are bounded.

TTD is more about monitoring and measuring the resilience of the system to detect an emerging fraud modus operandi. For example: “Did we catch the new fraud ring the first day it appeared, or did it go undetected for weeks?” This is harder to quantify but important in evaluating adaptive systems.

Summary

In practice, evaluating a fraud detection model involves:

Analyzing the confusion matrix at the operating point.
Reviewing Recall, F1, and AUC-PR.
Choosing a threshold that satisfies business constraints (e.g., maximum number of tolerable false positives).

But evaluation doesn’t stop at metrics. Weighting fraud by transaction amount matters: catching a 10,000 USD fraud is more impactful than catching five 1,000 USD cases. Moreover, metrics on static test sets aren’t enough. We also need to perform backtesting (simulate past performance) and sandbox testing (simulate deployment), and monitor the model in production.

Observe how fraud patterns change: Do attackers evolve? Do false positives creep up?

Or even better: run A/B tests.

Put the new model in production in shadow mode and compare it to the previous version.

But that’s content for another post.

Footnotes

ROC Curve stands for “Receiver Operating Characteristic”, a very weird name, indeed. ↩

From Classical ML to DNNs and GNNs for Real-Time Financial Fraud Detection

2025-04-03T00:00:00-07:00

Financial fraud is a pervasive problem costing institutions and customers billions annually.¹ Most known examples include credit card fraud, fraudulent online payments, and money laundering. Banks worldwide faced an estimated $\$442$ billion in fraud-related losses in 2023 alone. In particular, credit card transactional fraud is projected to reach $\$43$ billion in annual losses by 2026. Beyond direct losses, fraud undermines customer trust and damages banks’ reputation. For example, it leads to false positives where legitimate transactions are wrongly blocked. Consequently, financial fraud detection systems (a.k.a fraud scoring) must not only catch as much fraud as possible but also minimize false positives.

Fraudsters’ tactics evolve rapidly. Traditional rule-based systems (or simple statistical methods) have proven inadequate against today’s adaptive fraud models. On one hand, fraudsters form complex schemes and exploit networks of accounts. On the other hand, legitimate transaction volumes continue to grow due to the rise of e-commerce and digital payments.

This situation has driven a shift toward Machine Learning (ML) and AI-based approaches that can learn subtle patterns and adapt over time. Critically, financial fraud detection must happen in real-time (or near-real time) to intervene before fraudsters can complete illicit transactions. Catching fraud “closer to the time of fraud occurrence is key” so that suspicious transactions can be blocked or flagged immediately.²

This article deep dives into the current state-of-the-art of real-time transactional fraud detection, spanning both academic research and current industry practices.

I cover the major model families used today:

Classical ML models: Logistic regression, decision trees, random forests, and SVMs.
Deep Learning models: ANNs, CNNs, RNNs/LSTMs, autoencoders, and GANs.
Graph-based models: GNNs and graph algorithms that leverage transaction relationships.
Transformer-based and foundation models: Large pre-trained models like Stripe’s payments foundation model.

For each category, I discuss representative use cases or studies, highlight strengths and weaknesses, and comment on their suitability for real-time fraud detection.

Classical ML Models

Classical ML algorithms have long been used in fraud detection and remain strong baselines in both research and production systems. These include linear models like logistic regression, distance-based classifiers like k-Nearest Neighbors, and tree-based models such as random forest and gradient boosted trees (e.g., XGBoost). These approaches operate on hand-crafted features derived from transaction data (e.g., transaction_amount, location, device_id, time_of_day, etc.), often requiring substantial feature engineering by domain experts.

Logistic regression is a foundational model in fraud detection. Banks and financial institutions have historically relied on it due to its simplicity and interpretability (each coefficient $w_i$ has a direct and intuitive meaning). A positive coefficient means the feature increases the log-odds of fraud, a negative coefficient means it decreases the risk.

\[P(y = 1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}\]

$\mathbf{x}$: Feature vector (e.g., transaction amount, time of day, merchant category)
$\mathbf{w}$: Coefficients (risk factors)
$b$: Bias or intercept

Even today, logistic models serve as interpretable baseline detectors and are sometimes combined with a Business Rule Management Systems. However, linear models struggle to capture complex non-linear patterns in large transaction datasets.

Decision trees and ensemble forests address this by automatically learning non-linear splits and interactions. In fact, boosted decision tree ensembles (like XGBoost) became popular in fraud detection competitions and industry solutions due to their high accuracy on tabular data.³ These models can capture anomalous combinations of features in individual transactions effectively, learning complex, non-linear interactions between features. For example, the winning solutions of the IEEE-CIS fraud detection Kaggle challenge (2019) heavily used engineered features fed into gradient boosting models, achieving strong performance (AUC ≈ 0.91).

Support Vector Machines (SVMs) have also been explored in academic studies. However, while they can model non-linear boundaries (with kernels), they tend to be computationally heavy for large datasets and offer no interpretable output. Therefore, the industry has gravitated more to tree ensembles for complex models.

Strengths

Classical ML models are typically fast to train and infer, and many (especially logistic regression and decision trees) are relatively easy to interpret. For instance, a logistic regression might directly quantify how much a mismatched billing address raises fraud probability, and a decision tree might provide a rule-like structure (e.g., “if IP country ≠ card country and amount > $1000 ⇒ flag fraud”). More complex models like XGBoost still allow some interpretability through feature importance scores, CHAP values, or partial dependence plots.

Classical ML models can be deployed in real-time with minimal latency. A logistic regression is essentially a dot-product of features, and even a large XGBoost ensemble can score a transaction in tens of milliseconds or less on modern hardware.

They also perform well with limited data. With careful feature engineering, a simple model can already catch a large fraction of fraud patterns. Consequently, industry adoption is widespread, many banks initially deploy logistic or tree-based models in production, and even today XGBoost is a common choice in fraud ML pipelines.

Weaknesses

A key limitation of classical ML models is the reliance on manual feature engineering. In other words, they cannot automatically invent new abstractions beyond the input features given. These models may miss complex patterns such as sequential spending behavior or R-ring collusion⁴ between groups of accounts unless analysts explicitly code such features (e.g., number of purchases in the last hour, or count of accounts sharing an email domain).

They may also struggle with high-dimensional data like raw event logs or image data (this is where deep learning excels). However, this is less an issue for structured transaction records.

Another challenge is class imbalance. The occurrence of fraud is typically rare (often $ <1\% $ of transactions), which can bias models to predict the majority “non-fraud” class. Techniques like balanced class weighting, undersampling, or SMOTE are often needed to train classical models effectively on imbalanced fraud data.⁵

Finally, while faster than deep neural networks, complex ensembles (hundreds of trees) can become memory-intensive and may require optimization for ultra-low latency at high transaction volumes.

Real-Time Suitability

Classical models are generally well-suited to real-time fraud scoring. They have low latency inference and modest resource requirements.

For example, a bank’s fraud engine might run a logistic regression and a few decision tree rules in under 10ms per transaction on a CPU. Even a sophisticated random forest or gradient boosting model can be served via highly optimized C++ libraries or cloud ML endpoints to meet sub-hundred-millisecond SLAs.⁶

The straightforward nature of these models also simplifies transaction monitoring and model updates. New data can be used to frequently retrain or update coefficients (even via online learning for logistic regression). The main caution is that if fraud patterns shift significantly (concept drift), purely static classical models will need frequent retraining to keep up.

In practice, many organizations retrain or fine-tune their fraud models on recent data weekly or even daily to adapt to new fraud tactics. So, while classical models are fast to deploy and iterate on, they do require ongoing maintenance to remain effective.

Examples

Representative research and use-cases for classical methods include:

Logistic regression and decision trees as baseline models: Many banks have deployed logistic regression for real-time credit card fraud scoring due to its interpretability.
Ensemble methods in academic studies: Research has focused on evaluating logistic vs. decision tree vs. random forest on a credit card dataset (often finding tree ensembles outperform linear models in Recall).⁷
Kaggle competitions: XGBoost was heavily used in the Kaggle IEEE-CIS 2019 competition, leveraging high accuracy on tabular features.
Hybrid systems: Many production systems combine manual business rules for known high-risk patterns with an ML model for subtler patterns, using the rules for immediate high-precision flags and the ML model for broad coverage.

Deep Learning Models

In recent years, Deep Neural Networks (DNNs) have been applied to transaction fraud detection with promising results. DNNs can automatically learn complex feature representations from raw data, potentially capturing patterns that are hard to manually engineer or find with classical ML models.

Deep Learning Architectures

Several deep architectures have been explored for fraud detection. Below, I summarize the most common types.

Feed-Forward Neural Networks (ANNs)

ANNs are multi-layer perceptron treating each transaction’s features as input neurons. These can model non-linear combinations of features beyond what logistic regression can capture. In practice, simple feed-forward networks have been used as a baseline deep model for fraud (e.g., a 3-layer network on credit card data). They often perform similarly to tree ensembles if ample data is available but are harder to interpret. They also don’t inherently handle sequential or time-based information beyond what the input features provide.

Convolutional Neural Networks (CNNs)

CNNs are most famous for image-related tasks. However, they have also being been applied to fraud by treating transaction data as temporal or spatial sequences. For example, a CNN can slide over a sequence of past transactions for a user to detect local patterns or use 1D convolution on time-series of transaction amounts.

CNNs excel at automatic feature extraction of localized patterns. Some research reformats transaction histories into a 2D “image” (e.g., time vs. feature dimension) so that CNNs can detect anomalous shapes.

CNNs for detecting fraud have seen limited but growing use. One recent study reported ~99% detection accuracy with a CNN on a credit card dataset.⁸ However, such high accuracy is likely due to the highly imbalanced nature of the dataset (using AUC or F1 is more meaningful).

Recurrent Neural Networks (RNNs)

RNNs, including LSTM and GRU networks, are well-suited for sequential transactional data. They maintain a memory of past events, making them ideal for modeling an account’s behavior over time.

For example, an LSTM can consume a customer’s sequence of transactions (with timestamps) and detect if the latest transaction is anomalous given the recent pattern. This temporal modeling is very powerful for fraud because many fraud patterns only make sense in context (e.g., a sudden spending spike, or a purchase in a new country right after another far-away purchase).

Research has shown LSTM-based models can effectively distinguish fraudulent vs. legitimate sequences. In one case, an LSTM achieved significantly higher Recall than static models by catching subtle temporal shifts in user behavior.⁹ RNNs do require sequential data, so for one-off transactions without history they are less applicable (unless modeling at the merchant or account aggregate level).

Autoencoders

Autoencoders are unsupervised anomaly detection models that learn to compress and reconstruct data. When trained on predominantly legitimate transactions, an autoencoder captures the underlying structure of normal behavior (a.k.a. the “normal manifold”). As a result, it can reconstruct typical transactions with very low error, but struggles with atypical or anomalous ones. A transaction that doesn’t conform to the learned normal pattern will produce a higher reconstruction error. By setting a threshold, we can flag transactions with unusually high reconstruction error as potential fraud.

Autoencoders shine in fraud detection, particularly when labeled fraud data is scarce or nonexistent.¹⁰ Their strength lies in identifying transactions that deviate from the learned “normal” without requiring explicit fraud labels during training. For example, an autoencoder trained on millions of legitimate transactions will likely assign high reconstruction error to fraudulent ones it’s never seen before. Variational Autoencoders (VAEs), which introduce probabilistic modeling and latent-space regularization—have also been explored for fraud detection, offering potentially richer representations of normal transaction behavior.¹¹

Generative Adversarial Networks (GANs)

GANs consist of a generator and discriminator. The generator creates synthetic data, while the discriminator tries to distinguish real from fake data.

There are two main applications of GANs in fraud detection:

Generate realistic synthetic fraud examples: GANs can augment training data to address class imbalance. The generator is trained to produce fake transactions that the discriminator (trained to distinguish real vs. fake) finds plausible. By adding these synthetic frauds to the training set, models (including non-deep models) can learn a broader decision boundary.
Serve as anomaly detectors: The generator tries to model the distribution of legitimate transactions, and the discriminator’s output can highlight outliers.

Some financial institutions have experimented with GANs. For example, Swedbank reportedly used GANs to generate additional fraudulent examples for training their models. However, GAN training can be complex and less common in production. Still, in research, GAN-based methods have shown improved Recall by expanding the fraud training sample space.¹²

Hybrid Deep Learning Models

There are also custom DNNs architectures combining elements of the above, or combining deep models with classical ones.

For example, a “wide and deep model” might have a linear (wide) component for memorizing known risk patterns and a neural network (deep) component for generalization. Another example is combining an LSTM for sequence modeling with a feed-forward network for static features (“dual-stream” models).

Ensembles of deep and non-deep models have also been used (e.g., using an autoencoder’s anomaly score as an input feature to a random forest). Recent research explores stacking deep models with tree models to improve robustness and interpretability.

Strengths

DNNs biggest advantage is automated feature learning. These types of models can uncover intricate, non-linear relationships and subtle correlations within massive datasets that older methods miss. They can digest raw inputs (inc. unstructured data) and find patterns without explicit human-designed features. For instance, an RNN can learn the notion of “rapid spending spree” or “geographical inconsistency” from raw sequences, which would be hard to capture with handcrafted features.

In fraud detection, large payment companies have millions of transactions which deep models can leverage to potentially exceed the accuracy of simpler models. DNNs also tend to improve with more data, whereas classical models may saturate in performance.

Another strength is handling complex data types. For example, if one incorporates additional signals like device fingerprints, text (e.g., product names), or network information, deep networks can combine these modalities more seamlessly (e.g., an embedding layer for device ID, an LSTM for text description, etc.).

In practice, DNNs have shown higher Recall at a given false-positive rate compared to classical models, in several cases.⁹ They are also adaptive architectures like RNNs or online learning frameworks can update as new data comes in, enabling continuous learning, which is important as fraud scenarios evolve.

Weaknesses

The primary downsides of DNN are complexity and interpretability.

Deep networks are considered “black boxes”, meaning that it’s non-trivial to explain why a certain transaction was flagged as fraudulent. This is problematic for financial institutions that need to justify decisions to customers or regulators. Techniques like SHapley Additive exPlanations (SHAP) or Local Interpretable Model-Agnostic Explanations (LIME) can help interpret feature importance for deep models, but it’s still harder compared to a linear model or decision tree.

Another issue is the data and compute requirement. Training large DNNs may require GPUs and extensive hyperparameter tuning, which can be overkill for some fraud datasets, especially if data is limited or highly imbalanced.¹³ In fact, many academic studies on the popular Kaggle credit card dataset (284,807 transactions) found that simpler models can match DNNs performance, likely because the dataset is small and mostly numeric features.

Overfitting is a risk too, fraud datasets are skewed and sometimes composed of static snapshots in time. A DNN might memorize past fraud patterns that fraudsters no longer use, if not carefully regularized.

Finally, latency can be a concern. A large CNN or LSTM might take longer to evaluate than a logistic regression. However, many deep models used for fraud are not excessively large (e.g., an LSTM with a few hundred units), and with optimized inference (batching, quantization, etc.) they can often still meet real-time requirements. I discuss latency more later, but suffice it to say that deploying deep models at scale might necessitate GPU acceleration or model optimizations in high-throughput environments.

Real-Time Suitability

DNNs models can be deployed for real-time fraud scoring, but it requires more care than classical models. Simpler networks (small MLPs) are no issue in real-time. However, RNNs or CNNs might introduce slight latency (tens of milliseconds). Nevertheless, modern inference servers and even FPGAs/TPUs can handle thousands of inferences per second. For instance, Visa reportedly targets fraud model evaluations in under ~25ms as part of their payment authorization pipeline. It’s feasible to achieve this with a moderately sized neural network and good infrastructure.

Scaling to high transaction volumes is another aspect. Deep models may consume more CPU/GPU resources, so a cloud deployment might need to autoscale instances or use GPU inference for peak loads.

A potential strategy for real-time use is a two-stage system: a fast classical model first filters obvious cases (either definitely legitimate or obviously fraudulent), and a slower deep model only analyzes the ambiguous middle chunk of transactions. This way, the heavy model is used on a fraction of traffic to keep overall throughput high.

Additionally, organizations often maintain a feedback loop. Flagged predictions are first reviewed by analysts or via outcomes like chargebacks, and then a DNN model is retrained frequently to incorporate the latest data.

Some deep models can be updated via online learning. For example, an RNN that continuously updates its hidden state or a streaming NN that periodically retrains on a rolling window of data, which helps keep them current with concept drift.

Examples

Notable examples of deep learning in fraud detection:

Feedforward DNNs: PayPal in the mid-2010s applied neural networks to fraud, fintech companies like Feedzai have further advanced this methodology by combining DNNs with tree-based models.¹⁴
RNNs and LSTMs: Multiple studies have shown that LSTM networks can detect sequential fraud behavior that static models miss, improving Recall by capturing temporal patterns. Large merchants have employed LSTM-based models to analyze user event streams, enabling the detection of account takeovers and session-based fraud in real-time.
Autoencoder-based anomaly detection: Unsupervised autoencoders have been used by banks to flag new types of fraud. For instance, an autoencoder trained on normal mobile transactions flagged anomalies that turned out to be new fraud rings exploiting a loophole (detected via high reconstruction error).
Hybrid models: Recent trends include using DNNs to generate features for a gradient boosted tree. One effective approach is to use deep learning models, such as autoencoders or embedding networks, to learn rich feature representations from transaction data. These learned embeddings are then fed into XGBoost, combining the deep models’ ability to capture complex patterns with the interpretability and efficiency of tree-based methods

Graph-Based Models

Groups of fraudsters might share information (e.g., using the same stolen cards or devices), or a single fraudster might operate many accounts that transact with each other. A powerful class of methods treats the financial system as a graph, linking entities like users, accounts, devices, IP addresses, merchants, etc. Graph-based fraud detection models aim to exploit these relational structures to detect fraud patterns that single-transaction models might miss. Classical graph algorithms can then be applied, such as community detection¹⁵ and link analysis (e.g., PageRank on the fraud graph).

Illustration of entity linkages in transaction fraud: Shared devices, phone numbers, and locations connect different users. Fraudsters (devil icons) may create many accounts that all link through common data points (phone, IP, geo), forming suspicious subgraphs that graph-based methods can detect.

For example, in a bipartite graph of credit card transactions, one set of nodes represent cardholders, another set are merchants, and there is an edge connecting a cardholder to a merchant for each transaction. Fraudulent cards might cluster via merchant edges (e.g., a fraud ring testing many stolen cards at one merchant), or vice versa.¹⁶ Similarly, for online payments we can create nodes for user accounts, email addresses, IP addresses, device IDs, etc., and connect nodes that are observed together in a transaction or account registration. This yields a rich heterogeneous graph of entities.

Graph Neural Networks (GNNs) in recent years has led to many applications of this technology in fraud detection.¹⁷ ¹⁸ ¹⁹ GNNs are deep learning models designed for graph-structured data. They propagate information along edges, allowing each node to aggregate features from its neighbors. In fraud terms, a GNN can learn to identify suspicious nodes (e.g., users or transactions) by looking at their connected partners. For instance, if a particular device ID node connects to many user accounts that were later flagged as fraud, a GNN can learn to embed that device node as high-risk, which in turn raises the risk of any new account connected to it.

GNNs consider connections between accounts and transactions to reveal patterns of suspicious activity across the network. By incorporating relational context, GNNs have demonstrated higher fraud detection accuracy and fewer false positives than models that ignore graph structure. For example, combining GNNs features with an XGBoost classifier led to catching fraud that would otherwise go undetected and reducing false alarms due to the added network context. A GNN approach might catch a seemingly normal transaction if the card, device, or IP involved has connections to known frauds that a non-graph model wouldn’t see.

Several types of GNNs architectures have been used. Notably, Graph Convolutional Networks (GCN), GraphSAGE, heterogeneous GNNs for multi-type node graphs, and even Graph Transformers.

A popular benchmark for GNNs is the Elliptic dataset, a Bitcoin transaction graph where GNNs have been applied to identify illicit transactions by classifying nodes in a large transaction graph. GNNs have also been applied to credit card networks: e.g., researchers have built graphs linking credit card numbers, merchants, and phone numbers, and used a heterogeneous GNN to detect fraud cases involving synthetic identities and collusive merchants.²⁰

Strengths

Graph-based methods can detect patterns of collusion and linkage that purely feature-based models miss. They effectively augment each transaction with context. Rather than evaluating an event in isolation, the model considers the broader network (device usage graph, money flow graph, etc.). This is crucial for catching fraud rings. For example, multiple accounts controlled by one entity or chains of transactions moving funds, which might appear normal individually but are anomalous in aggregate. GNNs in particular combine the best of both worlds: they leverage graph structure + attribute features together, learning meaningful representations of nodes/edges.²⁰ This is important when fraudsters deliberately make individual transactions look innocuous but cannot hide the relationships (e.g., reusing the same phone or IP address across many accounts).

Another advantage is in reducing false positives by providing context. For example, a transaction with a new device might normally seem risky, but if that device has a long history with the same user and no links to bad accounts, a graph model can recognize it as low risk, avoiding a false alarm. Industry reports indicate that adding graph features or GNNs outputs has improved Precision of fraud systems by filtering out cases that looked suspicious in isolation but were safe in context.²¹

Weaknesses

The biggest challenge is complexity in implementation and deployment. Building and maintaining the graph data (a.k.a. the “graph pipeline”) is non-trivial. Transactions arrive in a stream and must update the graph in real-time (e.g., adding new nodes, new edges). Querying the graph for each new transaction’s neighborhood can be slow if not engineered well. The inference itself can be heavy. Running a GNNs means loading a subgraph and doing matrix operations that are costlier than a simple ML model. Consequently, many current GNNs solutions operate in batch mode (offline). There are limited reference architectures for real-time GNNs serving, though this is an active development area.

Another issue is scalability. Graphs of financial transactions or users can be enormous (millions of nodes, tens of millions of edges). Training a full GNNs on such a graph might not fit in memory or might be extremely slow without sampling techniques. Some approaches use graph sampling or partitioning to handle this, or only use GNNs to generate features offline.

GNNs can be hard to interpret (even more so than regular deep nets) since the features are aggregate of neighbors. It can be challenging to explain to an analyst why a certain account was flagged: the reason might be “it’s connected to three other accounts that had chargebacks,” which is somewhat understandable, but the GNN’s learned weights on those connections are not human-interpretable beyond that concept.

Real-Time Suitability

Real-time deployment of graph-based models is at the cutting edge. It is being done in industry but often with approximations. One pragmatic solution is to use graph analytics to create additional features for a traditional model. For example, compute features like “number of accounts sharing this card’s IP address that were fraud” or “average fraud score of neighbors” and update these in real-time, then let a gradient boosting model or neural network consume them. This doesn’t require full GNNs online inference, but captures some graph insights. However, truly deploying a GNNs in production for each event requires a fast graph database or in-memory graph store.

AWS demonstrated a prototype using Amazon Neptune (graph DB) + DGL (Deep Graph Library) to serve GNNs predictions in real-time by querying a subgraph around the target node for each inference.²⁰ This kind of pipeline can risk score a transaction within seconds, which may be acceptable for certain use cases (e.g., online account opening fraud). However, for high-frequency card transactions that need sub-second decisions, a full GNNs might still be too slow today unless heavily optimized.

An alternative is what Nvidia suggests: use GNNs offline to produce node embeddings or risk scores, then feed those into a superfast inference system (like an XGBoost model or a rules engine) for the real-time decision.²¹ This hybrid approach was shown to work at large scale, where GNN-based features improved detection by even a small percent (say 1% AUC gain), which for big banks translates to millions saved.

Lastly, maintaining graph models demands continuous updates as the graph evolves. This is still manageable, as new data can be incrementally added, but one must watch for concept drift in graph structure. For example, fraud rings forming new connectivity patterns.

Examples

Representative examples of graph-based fraud detection:

Blockchain networks: The Elliptic Bitcoin Dataset is a graph of 203,769 transactions (nodes) with known illicit vs. licit labels. GNNs models on this dataset achieved strong results, showing that analyzing the transaction network is effective for detecting illicit cryptocurrency flows.
Credit card networks: Researchers built a graph of credit card transaction and applied a GNNs which outperformed a baseline MLP by leveraging connections (e.g., card linked to a fraudulent merchant gives card a higher fraud probability).
E-commerce networks: Companies like Alibaba and PayPal have internal systems modeling user networks. For example, accounts connected via a shared device or IP can indicate sybil attacks or mule accounts. Graph algorithms identified clusters of accounts that share many attributes (forming fraud communities) which were then taken down as a whole.
Telecom identity fraud: Graphs connecting phone numbers, IDs, and addresses have been used to catch identity fraud rings. A famous case is detecting “bust-out fraud” in which a group of credit card accounts all max out and default: the accounts often share phone or address; linking them in a graph helps catch the ring before the bust-out completes.
Social networks: In social finance platforms or peer-to-peer payments, graph methods are used to detect money laundering or collusion by analyzing the network of transactions among users (e.g., unusually interconnected payment groups).

Overall, graph-based methods, especially GNNs, represent a cutting-edge approach that can significantly enhance fraud detection by considering relational data. As tooling and infrastructure improve (graph databases, streaming graph processing), I expect to see more real-time GNNs deployments for fraud in the coming years.

Transformer Models

Transformers

Transformers (originally developed for language processing) have revolutionized many domains, and they are now making inroads in fraud detection. The key innovation of transformers is the self-attention mechanism, which allows modeling long-range dependencies in sequences. In the context of transaction data, transformers can analyze transaction sequences or sets of features in flexible ways.

Large pre-trained foundation models (akin to GPT or BERT, but for payments) are emerging. In this case, a model is pre-trained on massive amounts of transaction data to learn general patterns, then fine-tuned for specific fraud tasks. So that these models can “speak” transactional data.

“One of the most notable recent developments comes from Stripe’s transformer-based payments foundation model. This is a large-scale self-supervised model trained on tens of billions of transactions to create embeddings of each transaction. The idea is analogous to how LLMs work: to learn a high-dimensional embedding for a transaction that captures its essential characteristics and context. Transactions with similar patterns end up with similar embeddings, e.g., transactions from the same bank or the same email domain cluster together in embedding space. These embeddings serve as a universal representation that can be used for various tasks: fraud detection, risk scoring, identifying businesses in trouble, etc. For the fraud use-case, Stripe reports a dramatic improvement: by feeding sequences of these transaction embeddings into a downstream classifier, they achieved an increase in detection rate for certain fraud attacks from 59% to 97% overnight. In particular, they targeted “card testing fraud” (i.e., fraudsters testing stolen card credentials with small purchases), something that often hides in high-volume data. The transformer foundation model was able to spot subtle sequential patterns of card testing that previous feature-engineered models missed, blocking attacks in real-time before they could do damage.”

Researchers have applied Transformer encoders to tabular data as well.²² For example, using models like TabTransformer or integration of transformers with structured data. They reported improved accuracy over MLPs and even over tree models in some cases.²³

The ability of transformers to focus attention on important features or interactions could be beneficial for high-dimensional transaction data. For example, a transformer might learn to put high attention on the device_id feature when the ip_address_country is different from the billing country, effectively learning a rule-like interaction that would be hard for a linear model.

Transformers can also model cross-item sequences: one can feed a sequence of past transactions as a “sentence” into a transformer, where each transaction is like a token embedding (comprising attributes like amount, merchant category, etc.). The transformer can then output a representation of the sequence or of the next transaction’s risk. This is similar to an RNN’s use but with the advantage of attention capturing long-range dependencies (e.g., a pattern that repeats after 20 transactions). There have been experiments where a transformer outperformed LSTM on fraud sequence classification, due to its parallel processing and ability to consider all transactions’ relations at once.

Another angle is using transformer models for entity resolution and representation in fraud. For instance, a transformer can be trained on the corpus of all descriptions or merchant names that a user has transacted with, thereby learning a “profile” of the user’s spending habits and detecting an out-of-profile transaction (similar to how language models detect an odd word in a sentence). Additionally, BERT-like models can be used on event logs or customer support chats to detect social engineering fraud attempts, though that’s adjacent to transaction fraud.

Foundation models

Foundation models in fraud detection refer to large models trained on broad data that can then be adapted. Besides Stripe’s payments’ model, other financial institutions are likely developing similar pre-trained embeddings. For example, a consortium of banks could train a model on pooled transaction data (in a privacy-preserving way, or via federated learning) to get universal fraud features.

These large models may use transformers or other architectures, but the common theme is self-supervised learning: e.g., predicting a masked field of a transaction (merchant_category, or amount) from other fields, or predicting the next transaction given past ones. Through such tasks, the model gains a deep understanding of normal transactional patterns. When fine-tuned on a specific fraud dataset, it starts with a rich feature space and thus performs better with less training data than a model from scratch. This is analogous to how image models pre-trained on ImageNet are fine-tuned for medical images with small datasets.

Strengths

Transformers and foundation models bring state-of-the-art pattern recognition to fraud. They particularly shine in capturing complex interactions and sequential/temporal patterns. The attention mechanism allows the model to focus on the most relevant parts of the input for each decision. For fraud, this could mean focusing on certain past transactions or specific features that are indicative of risk in context. This yields high detection performance, especially for “hard fraud” that evades simpler models.

Another strength is multitasking capabilities. A large foundation model can be trained once and then used for various related tasks such as fraud, credit risk, or marketing predictions simply by fine-tuning or prompting, rather than maintaining separate models for each. This “one model, many tasks” approach can simplify the system and leverage cross-task learning (e.g., learning what a risky transaction looks like might also help predict chargebacks or customer churn).

Moreover, transformers can handle heterogeneous data relatively easily. One can concatenate different feature types and the self-attention will figure out which parts to emphasize. For example, Stripe’s model encodes each transaction as a dense vector capturing numeric fields, categorical fields, etc., all in one embedding.

Finally, foundation models can enable few-shot or zero-shot fraud detection. Imagine detecting a new fraud pattern that wasn’t in the training data. A pre-trained model that has generally learned “how transactions usually look” might pick up the anomaly better than a model trained only on past known frauds.

Weaknesses

The obvious downsides are resource intensity and complexity. Training a transformer on billions of transactions is a monumental effort, requiring distributed training, specialized hardware (TPUs/GPUs), and careful tuning. This is typically only within reach of large organizations or collaborations. In production, serving a large transformer in real-time can be challenging due to model size and latency. Transformers can have millions of parameters, and even if each inference is 50-100ms on a GPU, at very high transaction volumes (thousands per second) this could be costly or slow without scaling out. Techniques like model quantization, knowledge distillation, or efficient transformer variants (e.g., Transformer Lite) might be needed.

Another concern is explainability. Even more so than a standard deep network, a giant foundation model is a black box. Understanding its decisions requires advanced explainable AI methods, like interpreting attention weights or using SHAP on the embedding features, which is an active research area. For regulated industries, one might still use a simpler surrogate model to justify decisions externally, while the transformer works under the hood.

Overfitting and concept drift are also concerns. A foundation model might capture a lot of patterns, including some that are spurious or not causally related to fraud. If fraudsters adapt, the model might need periodic re-training or fine-tuning with fresh data to unlearn outdated correlations. For example, the Stripe model is self-supervised (no fraud labels in pre-training) which helps it generalize, but any discriminative fine-tuning on fraud labels will still need updating as fraud evolves.

Real-Time Suitability

Surprisingly, with the right engineering, even large transformers can be used in or near real-time. For example, optimizing the embedding generation via GPU inference or caching mechanisms. One strategy is to pre-compute embeddings for entities (like a card or user) so that only incremental computation is needed per new transaction. Another strategy is two-stage scoring: use a smaller model to thin out events, then apply the heavy model to the most suspicious subset. If real-time means sub-second (say <500ms), a moderately sized transformer model on modern inference servers can fit that window, especially if batch processing a few transactions together to amortize overhead. Cloud providers also offer accelerated inference endpoints (like AWS Inferentia chips or Azure’s ONNX runtime with GPU) to deploy large models with low latency.

That said, not every company will want to deploy a 100M+ parameter model for each transaction if a simpler model would do. There is a trade-off between maximum accuracy and infrastructure cost/complexity. In many cases, a foundation model could be used to periodically score accounts offline (to detect emerging fraud rings) and a simpler online model handle immediate decisions, combining their outputs.

Examples

Use cases and research for transformers in fraud:

Stripe’s Payments Foundation Model: A transformer-based model trained on billions of transactions, now used to embed transactions and feed into Stripe’s real-time fraud systems. It improved certain fraud detection rates from 59% to 97% and enabled detection of subtle sequential fraud patterns that were previously missed.
Tabular transformers: Studies like Chang et al.²² applied a transformer to the Kaggle credit card dataset and compared it to SVM, Random Forest, XGBoost, etc. The transformer achieved comparable or superior Precision/Recall, demonstrating that even on tabular data a transformer can learn effectively.
Sequence anomaly detection: Some works use transformers to model time series of transactions per account. A transformer may be trained to predict the next transaction features; if the actual next transaction diverges significantly, it could flag an anomaly. This is analogous to language model use (predict next word).
Cross-entity sequence modeling: Transformers can also encode sequences of transactions across entities, e.g., tracing a chain of transactions through intermediary accounts (useful in money laundering detection). The recent FraudGT model²⁴ combines ideas of GNNs and transformer to handle transaction graphs with sequential relations.
Foundation models for documents and text in fraud: While not the focus here, note that transformers (BERT, GPT) are heavily used to detect fraud in textual data (e.g., scam emails, fraudulent insurance claims text, etc). In a holistic fraud system, a foundation model might take into account not just the structured transaction info but also any unstructured data, like customer input or messages, to make a decision.

Transformer-based models and foundation models represent the frontier of fraud detection modeling. They offer unparalleled modeling capacity and flexibility, at the cost of high complexity. Early results, especially from industry leaders, indicate they can substantially raise the bar on fraud detection performance when deployed thoughtfully. As these models become more accessible (with open-source frameworks and possibly smaller specialized versions), more fraud teams will likely adopt them, particularly for large-scale, multi-faceted fraud problems where simpler models hit a ceiling.

Appendix

Public Datasets

Research in fraud detection often relies on a few key public datasets to evaluate models, given that real financial data is usually proprietary.

Below I summarize some of the most commonly used datasets, along with their characteristics:

:globe_with_meridians: Credit Card Fraud Detection (Kaggle, 2013): A classic dataset containing real European credit card transactions over two days. Its key characteristics are its extreme class imbalance (0.172% fraud) and anonymized features (28 PCA components), making it a standard benchmark for testing algorithms on imbalanced data.
:globe_with_meridians: IEEE-CIS Fraud Detection (Kaggle, 2019): A large, rich dataset from an e-commerce provider, released for a Kaggle competition. It features ~300 raw features (device info, card details, etc.), missing values, and a moderate imbalance (3.5% fraud). It is ideal for evaluating complex feature engineering and ensemble models for card-not-present fraud.
:globe_with_meridians: PaySim (Kaggle, 2016): A large-scale synthetic dataset that simulates mobile money transactions. It contains over 6 million transactions and is useful for testing model scalability in a controlled environment. Because it is synthetic, models may achieve unrealistically high performance.
:globe_with_meridians: Elliptic Bitcoin Dataset (Kaggle, 2019): A temporal graph of over 200,000 Bitcoin transactions, where nodes are transactions and edges represent fund flows. It is a key benchmark for evaluating graph-based fraud detection methods like GNNs. Only a small fraction of nodes are labeled as illicit, presenting a challenge.

⚠️ Due to high imbalance, accuracy is not informative (e.g., the credit card dataset has 99.8% non-fraud, so a trivial model gets 99.8% accuracy by predicting all non-fraud!). Hence, papers report metrics like AUC-ROC, Precision/Recall, or F1-score. For instance, on the Kaggle credit card data, an AUC-ROC around 0.95+ is achievable by top models, and PR AUC is much lower (since base fraud rate is 0.172%). In IEEE-CIS data, top models achieved about 0.92–0.94 AUC-ROC in the competition. PaySim being synthetic often yields extremely high AUC (sometimes >0.99 for simple models) since patterns might be easier to learn. When evaluating on these sets, it’s crucial to use proper cross-validation or the given train/test splits to avoid overfitting (particularly an issue with the small Kaggle credit card data).

Overall, these datasets have driven a lot of research. However, one should be cautious when extrapolating results from them to real-world performance. Real production data can be more complex (concept drift, additional features, feedback loops). Nonetheless, the above datasets provide valuable benchmarks to compare algorithms under controlled conditions.

External Resources

Footnotes

Oztas, Berkan, et al. “Transaction monitoring in anti-money laundering: A qualitative analysis and points of view from industry.” Future Generation Computer Systems (2024). ↩
G. Praspaliauskas, V. Raman (2023). “Real-time fraud detection using AWS serverless and machine learning services. AWS Machine Learning Blog – outlines a serverless architecture using Amazon Kinesis, Lambda, and Amazon Fraud Detector for near-real-time fraud prevention.” ↩
Desai, Ajit, Anneke Kosse, and Jacob Sharples. “Finding a needle in a haystack: a machine learning framework for anomaly detection in payment systems..” The Journal of Finance and Data Science 11 (2025): 100163. ↩
R-ring collusion is a form of coordinated behavior where multiple accounts, potentially belonging to different individuals or groups, engage in fraudulent activities that benefit each other. ↩
For a Python library dedicated to handling imbalanced datasets and techniques, see imbalanced-learn, which provides tools for oversampling, undersampling, and more. ↩
Service Level Agreement (SLA) is a commitment between a service provider and a client that outlines the expected level of service, including performance metrics and response times. ↩
Afriyie, Jonathan Kwaku, et al. “A supervised machine learning algorithm for detecting and predicting fraud in credit card transactions. Decision Analytics Journal 6 (2023): 100163.” ↩
Onyeoma, Chidinma Faith, et al. “Credit Card Fraud Detection Using Deep Neural Network with Shapley Additive Explanations.” 2024 International Conference on Frontiers of Information Technology (FIT). IEEE, 2024. ↩
Kandi, Kianeh, and Antonio García-Dopico. “Enhancing Performance of Credit Card Model by Utilizing LSTM Networks and XGBoost Algorithms.” Machine Learning and Knowledge Extraction 7.1 (2025): 20. ↩ ↩²
Cherif, Asma, et al. “Encoder–decoder graph neural network for credit card fraud detection.” Journal of King Saud University-Computer and Information Sciences 36.3 (2024). ↩
Alshameri, Faleh, and Ran Xia. “An Evaluation of Variational Autoencoder in Credit Card Anomaly Detection.” Big Data Mining and Analytics (2024). ↩
Charitou, Charitos, Artur d’Avila Garcez, and Simo Dragicevic. “Semi-supervised GANs for fraud detection.” 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. ↩
Huang, Huajie, et al. “Imbalanced credit card fraud detection data: A solution based on hybrid neural network and clustering-based undersampling technique.” Applied Soft Computing 154 (2024). ↩
Branco, Bernardo, et al. “Interleaved sequence RNNs for fraud detection.” Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020. ↩
Masihullah, Shaik, et al. “Identifying fraud rings using domain aware weighted community detection.” International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Cham: Springer International Publishing, 2022. ↩
Boyapati, Mallika, and Ramazan Aygun. “BalancerGNN: Balancer Graph Neural Networks for imbalanced datasets: A case study on fraud detection..” Neural Networks 182 (2025): 106926. ↩
Motie, Soroor, and Bijan Raahemi. “Financial fraud detection using graph neural networks: A systematic review.” Expert Systems with Applications (2024). ↩
Shih, Yi-Cheng, et al. “Fund transfer fraud detection: Analyzing irregular transactions and customer relationships with self-attention and graph neural networks.” Expert Systems with Applications. 2025. ↩
Tong, Guoxiang, and Jieyu Shen. “Financial transaction fraud detector based on imbalance learning and graph neural network.” Applied Soft Computing 149 (2023): 110984. ↩
Jian Zhang et al. (2022). “Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and DGL. AWS ML Blog – explains how graph neural networks can be served in real-time for fraud detection, noting challenges in moving from batch to real-time GNN inference.” ↩ ↩² ↩³
Summer Liu et al. (2024). “Supercharging Fraud Detection in Financial Services with GNNs. NVIDIA Technical Blog.” ↩ ↩²
Yu, Chang, et al. “Credit Card Fraud Detection Using Advanced Transformer Model.” 2024 IEEE International Conference on Metaverse Computing, Networking, and Applications (MetaCom). IEEE, 2024. ↩ ↩²
Krutikov, Sergei, et al. “Challenging Gradient Boosted Decision Trees with Tabular Transformers for Fraud Detection at Booking.com.” arXiv preprint arXiv:2405.13692 (2024). ↩
Lin, Junhong, et al. “FraudGT: A Simple, Effective, and Efficient Graph Transformer for Financial Fraud Detection.” Proceedings of the 5th ACM International Conference on AI in Finance. 2024. ↩

Building GenAI Applications Today

2024-11-17T00:00:00-08:00

The AI fever has been around for a while now (4 years as I can count). It reminds me of the gold rush and subsequent Californian Dream from the 19th century. The new “happy idea” is that today any single individual could get rich almost instantly by leveraging the novel AI’s-based capabilities the right way. The potential opportunities to apply Generative AI (GenAI) for profit span almost all areas of development, from pure arts to fundamental science, from medical diagnosis to engineering, and so on. AI agents can now generate new scientific hypotheses. Yet, as history seems to repeat itself, every time OpenAI releases a new model offering more powerful capabilities, many solo entrepreneurs and small startups flounder and fade. For instance, I’ve attended many hackathons where teams have tried to build AI-powered solutions for problems that could be solved with a simple rule-based system. There are many many reasons why GenAI projects fail, and less clear reasons (to me) of why some succeed. So, if you’re a developer or entrepreneur itching to dive into the GenAI space for fun or profit, this post is my step back to reflect on what (I believe) are good fits for this technology and what shouldn’t be built with it. Let’s dive in!

What Seems to Work

I’m not an entrepreneur myself (yet), but I’m overall an enthusiast of the startup ecosystem. I’ve listened to the Indie Hackers Podcasts, read TechCrunch, and still check cool launches on Gumroad from time to time. Over the years, I’ve seen many startups succeed and fail along the way for diverse reasons. Some of them failed due to overhyped tech, unrealistic business models, or poor timing (e.g., Pets.com, Theranos, or Rdio).

Obviously, whenever there is a new technology, there is a wave of startups trying to profit from it. Do you remember the blockchain hype 10 years ago? I can’t forget about the many crypto millionaires that emerged from it (especially because I wasn’t one of them).

I’m now convinced that GenAI is a powerful tech (much broader than blockchain) that can transform whole industries and create new new ones. GenAI will first optimize existing processes and then transform them. First, everything related to handling paperwork and repetitive tasks is going to be completely automated (and that’s actually good because these boring tasks were not meant for humans in the first place anyway). Second, GenAI creates new opportunities for everyone, including artists and creators, to express themselves in ways that were not possible before. Third, GenAI helps us understand the world better by analyzing data in an unprecedented manner, generating new hypotheses, and planning new experiments, thus propelling the scientific discoveries that ultimately transform our lives for the better.

Success Factors

The success of any GenAI project is extremelly hard to predict. However, the ones that succeed share some common traits that set them apart from the noise. Below are factors to keep in mind when evaluating whether an AI idea is worth your time and resources.

Clear Value Proposition. Before jumping into the development of any AI system, ask yourself: Does this app solve a real-world problem? And is it doing so in a way that’s better, faster, or cheaper than existing solutions? The best AI ideas are those that tackle well-defined issues, making people’s lives easier or helping businesses streamline operations. Without a compelling value proposition, even the most advanced AI will struggle to gain traction.

Scalability. While niche markets can be tempting, they are often not the best starting point for an AI system unless you have a very clear, long-term vision. Aim for ideas that can scale, reaching wider audiences or applying across various industries. Scalability isn’t just about expanding your user base. It’s about adapting the technology to different use cases, which will increase the likelihood of long-term success.

Ethical Design. Every AI system comes with a responsibility: to design it with ethics in mind. Consider the potential negative uses of your AI. While technologies like deepfake generators show immense creative potential, they also pose significant risks for misuse. It’s essential to build safeguards and establish ethical boundaries to prevent your AI from being used for harmful purposes.

Examples

Below are three applications of GenAI today that I would like to see more of:

Personalized Financial Advisors: Apps that analyze spending habits, investment opportunities, and financial health using AI are gold. These tools cater to the rising demand for financial literacy and can scale personalized advice to millions (i.e., tangible ROI for users). For example, Wealthfront’s AI-driven financial planning features are transforming personal finance management.
Content Creation Tools: GenAI is a boon for creators. Platforms that assist with scriptwriting, graphic design, or even video editing are gaining more and more traction because they save time, amplify human creativity, and have clear market demand. For example, Adobe’s Firefly enhances creativity by automating repetitive tasks.
AI in Healthcare: There’s a growing emphasis on preventive health solutions that are scalable and cost-effective. Tools like AI-powered symptom checkers or personalized fitness coaches empower users to manage their health better. For example, MyFitnessPal leverages AI for smarter diet recommendations.

While we are still in the early days of GenAI, it seems to me that what works the best is still to focus on solving real world problems in a scalable way. If you’ve worked on a GenAI product, then you don’t need to be remembered that no matter the tech, is the product what matters the most.

What to Avoid

Most GenAI ideas are not worth pursuing. Numerous AI-driven apps fail due to over-saturation and lack of resources to compete against the tech giants that already have customer trust and a large user base. Other fail due to technical glitches, poor product pivots, or simply because they don’t solve a real problem. There is another set that fail because of premature obsolescense (e.g., a friend built a now dead startup 4 years ago about simplifying email through summarization). Failure is hard to predict when one truly believes in a proyect.

Lessons from AI Failures

Building a successful AI system isn’t just about using the best foundational models and feeding them with data. It’s about learning from the mistakes of others. A few notable failures provide crucial insights into what can go wrong, and what to avoid in the development process. Below is a small compilation.

Bias Is The Biggest Enemy. One of the most dangerous pitfalls in AI is bias, which can unintentionally emerge through data or algorithmic design. GenAI systems that perpetuate bias can quickly damage a company’s reputation, especially in sensitive industries. A classic example of this is Microsoft’s Tay chatbot, which was launched in 2016 to interact with users on Twitter. Unfortunately, it was quickly hijacked by biased and offensive content due to insufficient safeguards. Current AI systems must be designed with bias mitigation in mind, ensuring that they are fair and ethical.

Privacy Is Non-Negotiable. When it comes to AI, privacy is not a feature, it’s a fundamental requirement. Especially in sectors like healthcare and finance, mishandling sensitive user data can lead to significant regulatory penalties and, perhaps worse, the loss of customer trust. Companies must prioritize data protection, making sure that systems are secure and that users’ privacy is never compromised. This includes adhering to global privacy standards like GDPR or CCPA, and being transparent about how user data is collected and used.

Overpromising. While the potential of AI can be captivating, it’s essential to remain grounded and transparent about the technology’s current capabilities. Many companies have fallen into the trap of hyping up their AI solutions, only to disappoint users with results that fall short of expectations. For example, Forward Health promised a futuristic healthcare experience with its CarePods but failed to deliver on its ambitious vision. To avoid overpromising, companies should set realistic goals, communicate openly with users, and focus on incremental improvements rather than grandiose claims.

In addition to the primary lessons above, there are a few more pitfalls that commonly arise in AI development. These may not always be immediately obvious but are equally critical in creating a successful AI application.

Use GenAI When You Don’t Need It. GenAI is incredibly powerful, but it’s not a one-size-fits-all solution. As the saying goes, “Not everything is a nail.” Sometimes, simpler solutions like linear programming or basic algorithms can solve the problem more efficiently and cost-effectively. For example, optimizing energy consumption with a basic schedule based on electricity prices is far more effective and cheaper than running the same data through a complex language model (as noted by AI expert Chip Huyen). Before opting for GenAI, carefully consider if it’s the best approach.

Confusing Bad Product with Bad AI. A common misconception is that poor user experiences or ineffective AI solutions are the result of faulty algorithms. In reality, many issues arise from poor product design or a lack of attention to the user interface (UI). For example, a chatbot might function perfectly but fail to engage users simply because they don’t know how to interact with it. A well-designed UX can transform even a mediocre AI into something genuinely useful. Intuit’s chatbot, for instance, was able to enhance the overall experience through smart design choices, demonstrating that good design can elevate AI performance.

Lack of Model Customization. While pre-trained open-source models offer a quick and easy starting point, relying on them without tailoring them to your specific use case is a mistake. Using these models without fine-tuning them is akin to trying to run a marathon in flip-flops. It might work to some extent, but you’re not going to achieve optimal results. Customizing and fine-tuning models allows them to meet the specific needs of your application, making the difference between an average app and one that delivers real value to users.

Examples

Here are four GenAI development directions that I would advise against pursuing:

Overhyped General-Purpose Chatbots: Chatbots are ubiquitous, and most fail to differentiate themselves. Unless your bot is solving a specific problem better than existing solutions, it’s just another chatbot. For example, Microsoft’s Tay chatbot famously spiraled out of control due to poor moderation and biased training data.
Poorly Thought-Out Healthcare Applications: While healthcare is promising, it’s also highly regulated. Products that don’t comply with data protection laws or fail to address ethical concerns will face backlash. For example, Forward Health’s CarePods failed due to technical glitches and poor location choices.
Tools Targeting Hyper-Niche Markets: Niche markets often lack the scale needed to make an app profitable. AI tools for ultra-specific tasks, like “AI for knitting pattern generation,” may not justify the investment. For example, Tally’s credit management platform collapsed due to limited scalability and poor product pivots.

In summary, avoid building AI apps that don’t solve a real problem, are overhyped, or target markets that are too small to scale.

Final Thoughts

The AI gold rush is far from over. Many GenAI startups are rushing headfirst into this space, hoping to strike gold, but failing to dig deep enough to find the right nuggets. In the future, the winners will be those who embrace simplicity, scale, and ethics, while staying grounded in real-world needs. To make AI a lasting success, focus on creating products that truly solve problems for the user, rather than just jumping on the bandwagon. Who knows? Maybe your next AI project will end up as the one everyone’s talking about. Or maybe it will be yet another “failed startup” story.

You will never know until you try.

External Resources

Why GenAI Will NOT Replace Software Engineers

2024-08-19T00:00:00-07:00

The field of Generative Artificial Intelligence (GenAI) has made incredible strides in recent years, particularly in source code analysis and generation. The excitement is well justified. Today’s GenAI systems can perform common software development tasks more efficiently than many human engineers.

I’ve experienced the power of this technology firsthand. ChatGPT-4o is truly impressive at fixing bugs, refactoring code, and even adding entirely new features to my software projects. However, the excitement about speeding up software development isn’t universally shared across the industry. Many developers are concerned that GenAI systems will soon replace them, rendering their skills obsolete. With its powerful capabilities, GenAI seems poised to disrupt the software development job market, potentially taking over a wide range of engineering roles.

Over the past few months, I’ve attended several academic and industrial conferences and read a plethora of research papers on this topic. Yet, even among world-class experts, there’s no clear consensus on where we’re headed. Amid the hype, massive AI investments, and pervasive fear of missing out, confusion and uncertainty about AI’s current and future impact on the software industry abound.

In this blog post, I’ll share my perspective on the capabilities of today’s GenAI systems to see how they measure up to the demands of real-world software development.

Spoiler: I’m on the optimists’ side. So let’s dive in!

© Since ancient times, we've tried to anticipate the future with certain levels of pessimism, and (ironically) we've been proven wrong most of the time. Photo of a garden sundial in the form of an armillary sphere in Skansen, Stockholm, Sweden.

Behind The Sensational Headlines

If you follow the latest tech newsletters (as I do), you’ll find out that the media seems excited to push out sensational news declaring that GenAI is set to take over a wide range of software engineers jobs.¹ For those who make a living ~~pretending to write~~ writing source code, headlines like “Is There a Future for Software Engineers?” make it easy to feel like the ground is shifting beneath our feet.

The concern is understandable. Advancements in natural language processing (NLP) have powered GenAI systems to tackle tasks once deemed uniquely human. These tasks not only include writing code but also other hardcore software engineering activities, such as gathering requirements, creating documentation, and even identifying user needs.

Consequently, we’re starting to develop a deep fear on the possibility that our professional careers might soon be eclipsed by machines capable of churning out entire software applications faster than we can.²

So, if you work in tech, you’re probably wondering:

Is the career I’ve worked so hard to build at risk?
Do GenAI tools truly grasp the complexities of a software project?
Can GenAI tools interpret vague business requirements, understand user needs, and make informed trade-offs like a human engineer?

The Catch

The idea of AI kicking out software engineers is not new. For instance, in 2022 an article in Nature Journal predicted this very phenomenon. The authors let it clear that ChatGPT and AlphaCode would replace software engineers in the coming years. But let’s be honest, many companies have tried to automate software development in the past, and most of them have failed.³ So, this kind of prediction is not new.

“Most AI startups and companies are primarily focused on creating wrappers around existing AI models without offering real value. I believe most of them will fail in the coming years.”

Let’s consider the following scenario: a single person without any programming experience today can leverage an AI-powered IDE, like the popular Cursor AI, to develop a full mobile app for personal finance management. All from scratch!

Impressive, right?

But there is a catch. The developed app likely doesn’t introduce any groundbreaking innovation because its building blocks (libraries, APIs, and templates) already exist and are probably very well established in the market. Current GenAI code assistants merely reassemble these components to fit a well-known specific use case. While undeniably useful, tools like ChatGPT or GitHub Copilot still don’t have the ability to fully understand the context of a software project from the business, technical, and user perspectives.

This distinction is vital: GenAI excels at recombination of existing knowledge, but genuine innovation (i.e., the ability to transform abstract ideas into novel practical solutions that deliver unique value) requires more than that.⁴ It demands a profound understanding of the problem domain, which includes grasping complex trade-offs, navigating edge cases, and adapting to evolving constraints.

Having diverse viewpoints and broad experience in managing a complex set of challenges is something that GenAI systems are still far from achieving.

GenAI Isn’t Replacing Engineers

GenAI systems operate within well-defined boundaries set by the algorithms powering them and the data they’ve been trained on. They’re astonishingly effective at recognizing patterns, generating text, and synthesizing existing information in seemingly novel ways. However, when it comes to the messy, chaotic, and unpredictable world of software development, GenAI falls short in several key areas due to its inherent reasoning limitations.

Reciting vs. Reasoning: GenAI can find patterns quickly, but it can't reason like a human. Source: MIT News.

Here are some key limitations of GenAI in the context of software engineering:

No Contextual Understanding: GenAI lacks the domain expertise and intuition required to assess edge cases or fully grasp the implications of design decisions.
No True Creativity: While GenAI can recombine existing elements in unexpected ways, its “creativity” stems from probabilistic reasoning, not from subjective experience, intentionality, or insight.
No Accountability: Software projects don’t live in isolation, they exist to solve real-world problems. An engineer must account for business needs, user behavior, and technical feasibility, which requires human judgment and responsibility.

Let’s take software architecture as an example. A GenAI model can suggest patterns or frameworks based on existing designs. However, defining the architecture for a mission-critical system that has not been done before and for which there is not available training data is a task that requires human extrapolation (e.g., balancing scalability, performance, and security). This is because architecture is not just about choosing the right tools or patterns, it’s about understanding the problem space, anticipating future needs, and making informed trade-offs decisions.

As I mentioned in a previous blog post, GenAI powered tools are not a silver bullet. They are exceptional at augmentation of human capabilities rather than outright replacement. They can speed up repetitive tasks, such as generation of boilerplate code, fix well-known bugs or code smells, or doing basic refactorings, but lack the capacity for strategic thinking or long-term planning. Besides, getting code written so easily feels somehow like cheating, which decreases the perceived value of the resulting output.

The Essence of Software Engineering

At its core, software engineering is about understanding real-world needs and finding software solutions to them. It’s about translating chaotic, ambiguous problems into clean, structured systems. Software engineering is a job of high cognitive complexity because it not only requires technical skills but also a deep understanding of human behavior, business goals, and the broader context in which software operates.

For example, let’s say the goal is to build a mobile application for booking fitness classes. One solution might be to create a simple interface that allows users to choose a class and reserve a spot. But is that really enough? What if users also want to see instructor profiles, class reviews, or even receive reminders based on their schedule? Should it integrate with their fitness tracker or offer personalized class recommendations? The possibilities are endless.

As in any other software project, the search space for building a useful app like the one mentioned before is vast. The key point is that software engineers don’t just dive into coding without fully understanding the deeper needs of the user. Humans have the ability to grasp context, consider variables like user preferences, time, convenience, and even emotions. This is something machines can’t do without a clear direction from human insight.

And this is precisely where GenAI falls short. Sure, GenAI can generate code, but it doesn’t truly understand the problem at hand. It doesn’t think as a human would do, it lacks specific business context, and that make it fundamentally different. It cannot sit down with stakeholders and clarify their needs. It can’t challenge assumptions or deal with conflicting requirements. All it can do is guess based on patterns it’s been trained on. And often, that guess is way off the real needs of the users, because those needs are chaotic, and according to my experience, most users don’t really know what they want.

Steve Jobs understood the importance of human creativity in software development.

In my opinion, this is where human engineers shine. We understand nuance. We can think creatively and adapt to changing environments. And why does this matter? Because the world needs critical thinking. As long as humans remain as complex, messy creatures with changing needs, there will always be a demand for someone who can turn those messy needs into clean software solutions.

A dummy example showing the limitations of current GenAI in understanding context.

Even if we stop writing code ourselves and rely fully on natural language interfaces to generate it, the essence of software engineering will remain the same. Source code is just the way we found to instruct computer so that they can solve human-centered problems. Therefore, I believe we should focus on developing our problem-solving skills instead of writing code. Our ability to define requirements and solve human problems (using AI systems) is going to be much more valuable in the coming years.

Example of a loop of requirement engineering in the age of AI. Requirement engineers, stakeholder, engineers and domain experts cooperate with AI agents to define the requirements of a software project.

Now you might be thinking: What about the software engineering tasks that are more straightforward and annoying for humans? Could GenAI take over there?

Engineering Value Is in the Details

One of the biggest challenges in software development is that it’s error-prone at almost every step. If you misunderstand a customer need, you end up solving the wrong problem. Misinterpret a functional requirement, and you will end up designing a system that’s overkill, or worse, one that under-delivers.

For example, imagine you’re tasked with building a real-time messaging system for a customer support team handling thousands of chats per day concurrently. The natural assumption might be to scale up with advanced microservices and expensive infrastructure to ensure immediate response times. But in the prototyping phase, a more practical solution could involve implementing a simple queuing system to handle chat overflow during peak hours, reducing the need for costly infrastructure while keeping the user experience intact.

Example of a notification system architecture using a queuing system to handle peak hours. Source: CometChat.

A GenAI powered engineer, on the other hand, would probably go with the more “typical” solution, like adding servers and microservices, because that’s how the majority of documentation resources to build these kinds of system explain how it’s done. GenAI models doesn’t have the creativity to consider alternative options. They follow the existing data, the information on what is has been trained on. I foresee this as a potential problem in the future, because it could lead to a homogenization of software solutions, where everyone uses the same tools and the same approaches, and that’s not good for software diversity and innovation.

Personally, I’m seeing too much GenAI generated code in my own work and on GitHub these days. The difference between a good solution and a great solution often comes down to human ingenuity, finding that less obvious, but more effective approach. I believe the real magic in engineering happens in those transitions, from needs to requirements, from designs to code. These steps are where human insight makes all the difference.

“My entire career working in AI/ML has shown me that domain knowledge is the highest leverage investment. Most of the tooling for AI/agents/etc. really is still in the technical realm, intended for AI/ML people or maybe engineers. But imagine a world where the domain expert had the agency to iterate and improve an AI, without having to go through an AI engineer intermediary. That would open the door to create incredibly powerful AI systems quickly. Keep a close eye on the product they are building, because I think it will close this gap.” – Skylar Payne

Of course, automating the boring parts is a great thing. That’s what engineers have been doing since the early days, right? However, today is even more important to look for smarter ways to solve problems, because GenAI might offer a set of solutions, but it’s our job to consider all the possibilities and make a critical evaluation of each of them.

So, what does this mean for the whole software industry, where things are constantly in flux? Can GenAI keep up with those changes?

The Dynamic Nature of Software Development

The concept of agile software development is all about flexibility. This means adapting to change, whether that change comes from shifting market conditions, evolving user needs, or even new regulatory requirements. It’s a fast-paced, ever-changing process, and that exactly where current GenAI systems based on Large Language Models (LLMs) struggle the most.

Agile software development is all about adapting to change.

For example, imagine you are midway through a project when suddenly, the market shifts. The client wants to pivot. You need to revise your priorities and adjust the entire architecture of your software. Can a GenAI system do that? Yes, it can! But what if the conditions are completely new because of an event that has never ever happened before? For example, during a market crash or a natural disaster. It’s hard for me to imagine a GenAI system that can adapt to those kinds of unpredictable situations.

AI operates based on past data and predefined assumptions. It’s not going to sit in a meeting with you, hear the client’s concerns, and propose a totally new approach. But part of our work is finding out what the client really needs, even when they don’t know it themselves!

Now, don’t get me wrong. Current GenAI can assist in agile development, especially when it comes to routine tasks like generating boilerplate code or testing. But when it comes to adapting to new, unexpected demands, they show some limitations.

AI will not replace programmers but will fundamentally change the development landscape, making human creativity and problem-solving capabilities more essential than ever. Adaptability is one of the most valuable skills in software engineering. As long as the world keeps changing, there will always be a need for people who can pivot quickly and come up with creative solutions.

Now, let’s take a closer look at the situations where GenAI actually excels in software engineering, and how we can leverage that.

Where GenAI Can Actually Help

Take code reviews, for example. I think it’s a very tedious but necessary task. You’ve got to sift through line after line of someone else’s work, looking for bugs, performance issues, or security vulnerabilities. GenAI agents, like those integrated on GitHub or other platforms to review pull-requests, can help automate parts of this process, flagging potential issues before a human ever gets involved.

AI can help with tasks that are repetitive, like generating boilerplate code, adding unit tests, or writing documentation. It can analyze large datasets and suggest optimizations. Indeed, many surveys suggest that developers are eager to get away from these tasks to GenAI assistants.

But when it comes to making the final call, the human touch is still essential.

“With LLMs, coding won’t be enough to differentiate as an engineer, you’ll need to think about the product, business KPIs, strategy etc. You need to think about solutions to problems, not software tools. And PMs are going to be expected to get more technical. Nothing is stopping them now, LLMs help you code. They will need to use their product knowledge and combine it with technicals to give engineers more tangible instructions. Those sound pretty similar, and the result will be less fragmentation and miscommunication between PMs and engineers, something that’s far too common today. They’ll start speaking more similar languages.”

In my opinion, the future isn’t about GenAI taking over but about humans learning how to work alongside GenAI to be more efficient and effective. And the engineers who can collaborate with GenAI agents (or whatever we’ll call them) will be the ones who thrive in the future.

Stack Overflow Developer Survey 2024 results on the perceived ability of AI tools to handle complex tasks.

We should integrate GenAI into our workflows where it makes sense, use it to handle the boring stuff, so we can focus on the creative and complex tasks.

And I know what you may be thinking, if GenAI can assist in low level tasks, that means it could eventually replace less experienced engineers. Isn’t it?

Can GenAI Replace Entry Level Engineers?

Well, the answer is… yes/maybe.

But…

Less experienced engineers are not just doing the easy tasks. They’re learning, growing professionally, and building the experience they need to tackle more complex problems down the road. GenAI might be able to take over some of the simpler tasks, but it can’t replace the learning process.

For example, if you have been around for a while in this industry, think about how much you learned from your first projects. It wasn’t just about writing down the code (programmers are not typists). It was about understanding the problem, communicating with other developers, figuring out how to structure the solution, and dealing with unexpected bugs along the way.

If we remove the possibility of learning that stuff, and put GenAI systems in place, then it’ll be harder for the young people to gain that kind of experience or develop the necessary intuition to solve real-world problems.

What should we do then? Well… if you’re an entry level engineer, your goal should not be to compete with AI. It should be to delegate the tasks that GenAI can efficiently handle, and moving on to the bigger, more interesting challenges.

So, keep pushing yourself. Don’t get too comfortable with the easy tasks and just delegate those to GenAI systems. Focus on learning the skills that only humans can master.

Final Thoughts

As software ~~developers~~ creators, the key is to think critically about where GenAI adds value and where it falls short. Instead of viewing GenAI as a competitor, we should treat it as a collaborator. By leveraging GenAI for grunt work, we can focus on higher-value tasks, such as crafting innovative designs, optimizing user experiences, and solving problems that GenAI simply cannot understand at a conceptual level (i.e., leveraging the human experience).

Stack Overflow Developer Survey 2024 results about the perceived threat of AI tools for software engineers.

The truth is, your skills remain highly relevant, so long as you adapt to the new order. The engineers who thrive in this new landscape will be those who augment their capabilities with GenAI while continuing to bring their unique human creativity, expertise, and empathy to the table.

In conclusion, I don’t think GenAI is going to replace software engineers (at least not anytime soon). And sure, it can handle repetitive tasks and assist in certain areas, but the real value coming from software development still requires human intuition, creativity, and problem-solving.

Because, let me say it again, software development isn’t just about writing code, it’s about solving problems. These problems are often really messy, poorly defined, and which solutions are deeply tied to human intuition and creativity. The main task of engineers is to balance conflicting priorities, adapt to ever-changing environments, and navigate the ambiguous waters between business needs and technical solutions.

If there is one way to keep our pockets safe, I think it is to stay ahead of the curve.

Footnotes

This is understandable once we realize that the traditional media is in the business of creating sensational headlines and monetizing our attention rather than providing accurate information. ↩
Not to mention the negative impact of AI related layouts on our ~~highly overvalued~~ software engineering pockets. ↩
I still remember Dreamweaver from Macromedia (later purchased by Adobe). It was a no code app that promised to make web development just a matter of throwing components to a canvas. It didn’t work out as expected, and today we are still writing HTML, CSS and JavaScript, ouch! ↩
Billings, Jay Jay, et al. 2017 “Will humans even write code in 2040 and what would that mean for extreme heterogeneity in computing?”, in arXiv. ↩