ML Affairs

When the Map Lies: Why AIS Trust Matters in Contested Waters

2026-05-21T00:00:00+00:00

Publication note: this article was first published through Vortexa on Medium as When the Map Lies: Why AIS Trust Matters in Contested Waters. This ML-Affairs version is my author republication of that Vortexa-led publication.

In calm periods, vessel tracking can look deceptively simple.

A ship emits AIS. A platform collects those positions. A map draws a line. An analyst sees where the vessel is going.

That mental model breaks the moment geopolitics enters the water.

I have spent enough time around streaming systems to know that bad data rarely fails politely. It usually arrives looking plausible enough to be trusted, and by the time someone notices, it has already shaped a downstream decision.

That is why this problem matters to me.

When the Strait of Hormuz becomes contested, when the Red Sea becomes a security problem, when sanctions pressure grows around opaque oil movements, the vessel track is no longer just a technical artifact. It becomes evidence. And evidence can be incomplete, manipulated, delayed, spoofed, or deliberately made ambiguous.

That is the world energy markets now operate in.

The Strait of Hormuz remains one of the world’s most important oil chokepoints. The U.S. Energy Information Administration, using Vortexa tanker-tracking data, estimates that oil flows through the strait averaged about 20 million barrels per day in 2024, while the International Energy Agency describes it as a route with limited bypass options and major consequences for world oil markets if disrupted. Vortexa has also publicly analysed how Red Sea risk changes oil, gas, and tanker-market behaviour, and how tanker diversions can alter voyage duration, freight, and route economics.

When conflict, attacks, sanctions pressure, or navigation interference touch those routes, the question is not abstract. It affects crude flows, LNG flows, sanctions exposure, freight risk, pricing, and the confidence with which traders and analysts can explain what is happening.

The hard part is not putting vessels on a map.

The hard part is knowing when the map deserves to be believed.

That is where Vortexa’s value becomes easiest to understand.

Not as a company that merely displays vessel data, but as a company that turns messy maritime signals into trusted intelligence about how energy moves through the world.

Visual note: the illustrative images in this article are AI-generated synthetic diagrams used only for explanation. They are not evidence of actual vessel activity, cargo movements, port calls, ownership structures, sanctions status, or wrongdoing by any real vessel, company, port, or state.

The New Vocabulary of Maritime Risk

If you work near maritime intelligence today, a few terms keep appearing.

A dark vessel is a vessel that disappears from normal visibility for some period of time. Sometimes that is benign: coverage gaps, equipment issues, or legitimate operational constraints. In higher-risk contexts, though, going dark can help hide port calls, route changes, ship-to-ship transfers, or cargo movements.

Synthetic illustration only: a dark period can increase uncertainty, but the image does not describe or allege any real vessel movement, cargo, or port activity.

Spoofing is more active. A vessel or its surrounding environment produces false position information. The ship may appear somewhere it is not. It may seem to move across land, sit inside a port it never entered, or trace physically absurd paths. Scientific American has described how GPS spoofing in the Strait of Hormuz can cause AIS-derived vessel locations to show impossible movements, while U.S. Treasury maritime guidance warns that deceptive shipping practices can include disabling or manipulating AIS.

Synthetic illustration only: spoofing examples help explain the signal pattern; they should not be read as evidence of a specific port call or vessel event.

A shadow fleet is the broader operating model. It usually refers to networks of vessels, owners, flags, insurers, intermediaries, and cargo movements designed to move sanctioned or opaque commodities while making attribution difficult. The U.S. Treasury’s maritime oil advisory explicitly warns that shadow-trade actors can conceal ownership and manipulate or disable AIS, and OFAC has continued to sanction vessels and managers linked to Iranian shadow-fleet activity.

Synthetic illustration only: the diagram summarizes risk factors and transparency problems; it does not identify any real company, vessel, insurer, broker, or buyer.

These are not exotic edge cases anymore.

They are part of the operating environment for energy intelligence.

Hormuz Makes The Problem Concrete

The Strait of Hormuz is a clean example because it connects the technical problem to a market consequence immediately.

When the region is calm, a vessel track through Hormuz is a line through a chokepoint.

When the region is under pressure, that same line becomes a claim about whether crude is moving, whether LNG is delayed, whether a sanctioned vessel crossed, whether a cargo is stuck, whether a tanker turned back, and whether the market should price disruption or resilience.

Public reporting in recent years has repeatedly surfaced categories of concern around attacked vessels, suspected spoofing, identity manipulation, opaque ownership, and sanctions-linked movements. We do not need to reproduce each vessel case inside this article to see the engineering lesson.

Each story asks a different operational question.

An attacked or disrupted vessel asks: where was it, what route did it take, what traffic was nearby, and what flows were affected?

A spoofing concern asks: where did the vessel say it was, where was it likely to be, and what evidence contradicts the reported position?

An identity concern asks: which identity is being used, whether that identity makes sense, and what physical vessel is actually moving cargo?

A shadow-fleet transit asks: whether a vessel crossed a chokepoint, changed behaviour, approached a terminal, performed a ship-to-ship transfer, or appeared in a suspicious location at the wrong time.

This is exactly where clean signal becomes commercial intelligence.

AIS Is A Claim, Not A Fact

The most important mental shift is simple:

Every incoming AIS position is a claim about reality.

Some claims are routine. Some are incomplete. Some are stale. Some are physically implausible. Some are probably generated by equipment or coverage problems. Some may be deliberately deceptive.

A serious maritime intelligence platform cannot treat all of those claims equally.

It has to ask questions before allowing a position to shape downstream interpretation:

Is this position plausible given the vessel’s recent movement?
Does the point sit somewhere a vessel can physically be?
Does the signal behaviour match the region’s known risk profile?
Does the position agree with other signals, or does it look isolated?
Does the source type, timing, and repetition pattern make the point more or less trustworthy?
If we reject this point, can we still preserve enough evidence for analysts and systems to understand why?

That last question matters.

The goal is not to hide uncertainty. The goal is to structure it.

A strange ping should not be allowed to rewrite the vessel story just because it arrived last.

What Vortexa Adds

At Vortexa, the challenge is not simply to ingest more maritime data.

The challenge is to make that data decision-grade.

Energy-market users do not care about raw pings for their own sake. Vortexa’s public tanker-tracking material frames the customer value in exactly these terms: flows, origins, destinations, cargo, route changes, and ship-to-ship transfers. They care about the questions those pings support:

Is oil moving through the Strait of Hormuz or waiting outside it?
Did a vessel actually call at a port, or did it only appear nearby?
Is a cargo delayed, rerouted, hidden, or transferred?
Is a flow disruption real, temporary, or an artifact of broken tracking data?
Are sanctions-linked vessels behaving differently from ordinary commercial traffic?
Can we explain the answer clearly enough for a trader, analyst, risk team, or customer workflow to trust it?

That is the commercial value of signal cleaning.

Not cleanliness for its own sake.

Cleaning because every bad vessel position can leak into a bad market story.

Here is the chain I worry about.

A spoofed position suggests a tanker crossed Hormuz. That false crossing feeds a port-call inference. The port-call inference feeds a cargo movement view. The cargo movement view feeds a flow estimate. By the time the error reaches the analyst, it no longer looks like a bad ping; it looks like a market fact.

That is the failure mode a serious trust layer has to prevent.

The Trust Layer

The safest public way to describe the work is this:

Vortexa builds a trajectory validation layer between raw maritime signals and the products that depend on them.

That layer combines several kinds of judgment.

First, there is immediate signal validation. Some positions are suspicious without needing much history. They may fall on land, appear in known problematic areas, repeat in ways that suggest stuck equipment, or come from patterns that should not be allowed to contaminate downstream analytics.

Second, there is trajectory-level validation. A vessel has a recent story: its speed, heading, timing, route, and prior positions. A new point should strengthen that story, challenge it, or start a competing explanation. It should not automatically become truth because it arrived most recently.

Third, there is geography-aware posture. The same kind of irregularity does not mean the same thing everywhere. Sparse coverage in one region may call for patience. Heavy spoofing risk in another may call for stricter confirmation. A global system has to adapt to the operating environment without exposing brittle manual logic to every downstream user.

Fourth, there is observability. If the system distrusts signals more often in a specific region, analysts and engineers need to see that. A spike in suspected spoofing or sparse AIS behaviour is not just a data-quality note. It is part of the geopolitical and operational picture.

Trust decisions need their own telemetry. A rising spoofing or sparse-AIS signal is not just a model output; it is an operational question the system should force into view.

This is where the work becomes more interesting than “filtering.”

A basic filter removes bad points.

A trust layer preserves the difference between a healthy track, a suspicious claim, a competing trajectory, a sparse-coverage region, and a spoofing-prone environment.

That distinction is where downstream intelligence gets stronger.

Why This Is An ML Systems Problem

It would be tempting to describe this as a set of rules.

That would be misleading.

Rules matter. Deterministic checks are useful. There are obvious cases where a point should be challenged immediately. A ping on land should not be treated casually. Terrestrial AIS reception has physical characteristics; a coastal receiver cannot plausibly receive a normal signal from arbitrarily far away. These simple constraints matter.

But the valuable part is not a pile of if-statements.

The valuable part is the system’s ability to maintain a stateful view of vessel movement under uncertainty.

In practice, that means the platform needs to classify emission quality, associate observations with the right trajectory, score anomalies, calibrate confidence by region, and learn from analyst-reviewed cases. A vessel may have one coherent main trajectory and another suspicious sequence of points that should not yet be trusted. The system has to decide when a new signal belongs to the trusted path, when it should be buffered, when it should be isolated as noise, and when enough evidence has accumulated to change the story.

That is not a cosmetic map problem.

That is an applied machine-learning and systems problem:

state needs to be maintained per vessel
uncertainty needs to be represented explicitly
signals need to be scored against physical plausibility
suspicious tracks need to be separated from trusted tracks
confidence needs to be calibrated for different regions and source behaviours
decisions need to be explainable enough for humans and downstream systems
the system needs to keep working at global scale, continuously

The useful thing is not the line itself. It is the system keeping enough state to decide whether that line still deserves trust.

The value is not that the system draws a smoother line.

The value is that it prevents a false line from becoming a false market conclusion.

What the Gulf Examples Show

The Gulf region makes this less theoretical.

In a dense and sensitive region, the raw signal can contain both the movement we care about and the noise that would mislead downstream systems. Some emissions belong to coherent vessel movement. Some should be challenged. Some should be kept visible as evidence without being allowed to become the trusted story.

That is why the distinction between displaying data and validating data matters.

In the Gulf region, signal quality is not a backend hygiene concern. It directly affects whether the platform can reason about flows, port calls, and disruption with confidence.

A sparse period in one region, an isolated jump in another, and a cluster of suspicious emissions near a chokepoint do not carry the same meaning. Treating them as identical would be operationally lazy.

The system has to preserve enough uncertainty to avoid overclaiming, while still being decisive enough to keep bad evidence from contaminating the product.

A cleaned track is useful because it protects the higher-level questions: where the vessel likely moved, what it likely did, and what confidence the downstream system should have.

Customers should not have to know every reason a ping was rejected. They should be able to trust that the platform has disciplined ways of separating movement, noise, uncertainty, and context before the answer reaches them.

Why Vortexa Matters Here

There is a reason this work belongs in an energy intelligence company, not just a vessel-tracking demo.

The end product is not the point on the map.

The end product is the interpretation built on top of that point:

flows by origin and destination
port calls
cargo movement
congestion and disruption
sanctions exposure
market balances
regional risk
trade-route behaviour

If the vessel track is wrong, those higher-level products inherit the error.

If the system cannot separate a spoofed track from a plausible one, the analyst gets a confident fiction.

If the platform cannot see the difference between sparse coverage, deliberate deception, and real movement, then the market story becomes fragile exactly when customers need it most.

That is the value Vortexa provides: a cleaner, more defensible view of reality when reality is being actively obscured.

Closing Thought

In contested waters, vessel tracking is not a map problem. It is a trust problem.

And trust is not created by one algorithm, one data source, or one clever filter.

It is created by a full intelligence stack: raw signal ingestion, validation, stateful trajectory reasoning, regional context, observability, analyst feedback, and market-aware interpretation.

That is why AIS cleaning matters.

Because in a world of spoofing, dark vessels, shadow fleets, and geopolitical disruption, the companies that understand oil and gas flows best will be the ones that know when the signal deserves to be believed.

A vessel track is not valuable because it is drawn.

It is valuable because it has earned the right to be trusted.

MCP Finally Clicked: It Is Plumbing. Trust Is The Product.

2026-05-18T00:00:00+00:00

I am joining the MCP party a little late.

Not because I ignored it completely, but because the first pass did not feel as obvious to me as the enthusiasm around it suggested. There was a lot of jargon. The setup path had more moving pieces than I wanted. The whole business of piping messages through local processes, gateways, containers, profiles, and JSON schemas felt slightly tedious before it felt useful.

That is usually a sign that I do not understand something existentially yet.

I do not just want to know what command to run. I want to understand why the thing had to exist in the first place. What problem was the ecosystem trying to solve? What else was happening around the same time? Which part is genuinely new, and which part is just an old integration problem with a better name?

So this post is mostly written for the version of me that needed the explanation to be digestible.

Hopefully it helps someone else too.

My conclusion after working through it is simple:

MCP is valuable exactly where it is boring.

It standardizes connector plumbing. It does not absolve the host application from trust, routing, approval, or evidence discipline.

This was the part I needed to demystify first: the assistant can look calm while the connector plumbing underneath still feels messy.

Why MCP Had To Happen

LLMs started as text systems.

You sent text in. You got text out. That was already useful, but it had an obvious ceiling. A model could explain how to search email, but it could not search your email. It could suggest a calendar event, but it could not inspect your calendar. It could tell you which shell command might help, but it could not safely inspect the repository unless the surrounding application gave it a controlled way to do so.

Then AI applications started adding tools.

The pattern itself was sensible. The model should not directly touch the outside world. It should ask for help, and the host application should mediate the action.

The Tool Loop

Model sees a gap

Host validates request

Host executes action

Evidence returns

Model writes answer

The important detail is ownership: the model proposes, but the host validates, executes, records, and shapes the evidence before the final answer is written.

The problem was not the loop. The problem was glue code.

Every serious AI application began needing connectors: Gmail, Slack, GitHub, files, calendars, databases, browsers, search, internal systems. Each connector needed provider-specific setup, credentials, scopes, pagination, rate-limit handling, argument schemas, error normalization, result shaping, and model-facing descriptions.

That glue code has an unpleasant habit: it looks small when you write one connector and becomes architectural weight when every host application repeats it differently.

Gmail needs one shape of OAuth, search, labels, snippets, and attachment handling. Slack needs another shape of channels, threads, users, bot permissions, and message posting. GitHub, calendars, browsers, databases, and files all bring their own little integration worlds. Then each AI host still has to translate those worlds into something a model can discover and call.

That is how you end up with a connector zoo, and then with glue code proliferating around the zoo.

The timing matters too. Around the same period, the conversation moved from chatbots toward agents, coding assistants, desktop assistants, local runtimes, and tools that could act on real systems. Models were getting better, but they were still isolated from the places where useful work actually happens. A coding assistant needs the repository. A personal assistant needs calendar and email. A business assistant needs internal documents, tickets, dashboards, and databases.

That world cannot scale on every app hand-rolling every connector and every connector contract.

Anthropic introduced the Model Context Protocol on November 25, 2024 as an open standard for connecting AI assistants to systems where data lives. The official MCP docs use the USB-C analogy: one standard connection shape instead of a different cable for every device. That analogy is imperfect, but useful enough.

The deeper point is this:

MCP exists because every AI app should not have to reinvent the same connector protocol differently.

Before MCP, connecting an assistant to external systems usually meant each host application had to invent its own integration language. MCP gives those integrations a common shape, so the glue code can move behind a more standard boundary instead of leaking into every product in a slightly different form.

The Jargon That Tripped Me

The words are part of the problem, so it is worth clearing them before drawing the system.

In MCP language, a callable operation is often called a tool. In product language, that can be confusing. A normal user does not think “Gmail has 17 tools.” They think “Gmail is a tool, and it can do several things.”

I now prefer this vocabulary:

AI application / MCP host: the product boundary that owns the assistant experience and coordinates one or more MCP server connections.
Harness / agent control layer: the host-side component that owns routing, validation, approvals, audit, and evidence shaping.
Model runtime or provider: where inference happens, such as LM Studio, Ollama, Claude, or another hosted model API.
MCP client: the per-server connection component the host uses to talk to an MCP server.
MCP server: the process across the protocol boundary that exposes external functions through MCP.
MCP Manager: software that helps install, run, group, configure, or authorize MCP servers.
Product tool: a user-recognizable capability such as Gmail, Calendar, Wikipedia, Search, or Slack.
Function: one executable operation inside that product tool, such as search_messages, list_events, or get_summary. In many cases, the function eventually becomes an ordinary API call.

That distinction sounds pedantic until you build the UI.

If an MCP Manager profile shows Gmail, Slack, and Wikipedia, that is not the same thing as telling the model it can call every function from every server. It only means those servers are visible or available through the manager.

Visibility is not execution.

Once that vocabulary is less slippery, the mental model becomes much easier.

A Tool Is Usually An API Call

This is the missing layer in many MCP explanations.

Before MCP, there were already APIs.

An API is a contract that lets one software system ask another software system to do something. In a REST API, that contract usually looks like HTTP endpoints, methods, parameters, authentication, and JSON responses. A human developer reads the documentation, understands the authentication model, writes client code, handles errors, and decides how the result should be used.

For example, a simple weather integration might eventually call an HTTP endpoint shaped roughly like this:

GET /weather/current?city=London&units=metric
Authorization: Bearer ...

That is not an AI concept. It is normal application integration.

The API exposes an endpoint. The application code owns the orchestration.

What MCP changes is the consumer of that contract. Instead of only giving a human developer an endpoint to wire manually, the MCP server exposes a capability in a form that an AI host can discover, describe to a model, validate, and invoke.

The same weather capability might become a model-readable tool description:

{
  "name": "get_current_weather",
  "description": "Get the current weather for a city.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name"
      },
      "units": {
        "type": "string",
        "enum": ["metric", "imperial"]
      }
    },
    "required": ["city"]
  }
}

Then, when the user asks “what is the weather in London?”, the model does not call the weather provider directly. The host gives the model a constrained tool contract. The model proposes a structured call:

{
  "tool": "get_current_weather",
  "arguments": {
    "city": "London",
    "units": "metric"
  }
}

The host-side harness validates that request. The MCP client sends it to the weather MCP server. The server translates the tool call into the provider-specific API request, handles the provider response, and returns structured data back across the MCP boundary.

That is the concrete shape of the idea.

API To Tool

Provider API endpoint

MCP server wraps it

Tool name and schema

Host exposes tool

Model proposes call

The model sees the constrained, structured tool contract. The server still deals with the ordinary provider API behind the boundary.

A tool is not magic agent intelligence.

It is usually an API capability wrapped in a model-readable contract.

This is also why MCP can feel underwhelming when inspected closely.

Under the hood, many MCP servers are wrappers around ordinary APIs. A GitHub server may call the GitHub API. A Slack server may call the Slack API. A Gmail server may call the Gmail API. The novelty is not that APIs suddenly exist. The novelty is that the assistant ecosystem gets a standard way to discover capabilities, see their schemas, call them with structured arguments, and receive structured results.

In other words:

API Versus MCP

REST API: "Here are endpoints. Developer, wire the integration and orchestration yourself."
MCP: "Here are capabilities in a form an AI host can expose to a model and invoke through a standard protocol."

That distinction matters because it prevents two bad interpretations.

The first is over-selling MCP as if it replaces APIs. It does not. It often sits on top of them.

The second is under-selling MCP as just an API wrapper. It is often a wrapper, but the wrapper is doing something specific: turning provider-specific operations into a common, discoverable, schema-backed tool interface for an LLM runtime.

The Simplest Mental Model

Here is the version that finally made it click for me.

An MCP server is the adapter that exposes those capabilities through the MCP protocol.

It is called a server because, from the assistant’s point of view, it serves capabilities over a protocol boundary. That does not mean it has to be a public web server running somewhere on the internet. It can be a local process, a Docker container, or a small service launched by the host. External here means outside the host boundary, not necessarily remote.

The practical pattern is usually one MCP server per capability provider.

A Gmail MCP server is the Gmail-side adapter. It can expose many callable functions:

search_messages
list_labels
get_thread
create_draft
send_message

A Slack MCP server is the Slack-side adapter. A filesystem MCP server is the local-files adapter. The server is the boundary around the provider; the functions inside it are the individual operations.

Most of the time, the server still wraps something ordinary:

a Gmail MCP server wraps the Gmail API
a Slack MCP server wraps the Slack API
a filesystem MCP server wraps local files
a Wikipedia MCP server wraps Wikipedia data

In MCP terms, the AI application is the MCP host.

For this post, think of the host as the product boundary: the assistant UI, conversation state, model interface, harness, and MCP clients all live on the host side. The actual model may be local or hosted; the host is the application that calls it. The harness is not a separate MCP role. It is the part of the host I care about because it owns routing, validation, approvals, audit, and evidence shaping.

An MCP client is the host-side protocol connection to one MCP server. If the host talks to Gmail, Slack, and the filesystem, it may maintain separate MCP client connections for each server. Across that protocol boundary sit the MCP servers.

AI application / MCP host

Assistant UI + conversation state

Harness / agent control layer
routing, policy, approvals, audit, evidence

Model interface
calls local runtime or hosted model API

MCP clients
host-side connections to servers

MCP protocol boundary

MCP servers

Gmail MCP server
search, labels, threads, drafts, send

Slack MCP server
channels, threads, messages

Filesystem MCP server
read, search, metadata

Provider boundary

Provider / resource

Gmail API

Slack API

Local files

That means MCP gives the host and server a repeatable handshake around capabilities that may ultimately be API calls:

The MCP Handshake

Discovery: the host asks which functions the server exposes.
Description: the server returns names, descriptions, and argument schemas.
Execution: the host calls one function with validated arguments.
Result: the server returns structured data for the host to shape into evidence.

That is the useful part.

But notice what is missing from that list.

MCP does not decide whether the function is safe. It does not decide whether the user approved it. It does not decide whether a Gmail result should be summarized, redacted, logged, cached, or shown back to the model. It does not decide whether send_email should be available just because search_messages is available.

Those are product and harness decisions.

The Context Window Is The Real Tool Problem

Once the vocabulary became clearer, the harder question was not “can MCP expose tools?”

It was this:

How does an LLM choose tools without overloading its context window with the full tooling universe?

This is one of the most important practical agent-engineering problems.

The lazy version of a chatbot application, especially before a clean server boundary exists, is to start every thread with a huge prompt:

You can search Gmail, list Gmail labels, fetch Gmail threads,
create drafts, search Slack, post to Slack, inspect files,
query calendars, search tickets, browse the web...

That is not an architecture. It is a context-window landfill.

MCP gives the host a standard way to discover tool schemas, but the host still has to decide which of those schemas should reach the model at this moment. Dumping every discovered server, function, argument definition, OAuth caveat, and connector detail into every conversation recreates the old problem in a new place.

The lighter abstraction is to expose intentions first:

{
  "available_intents": [
    { "intent": "search_emails", "policy": "read_only" },
    { "intent": "query_calendar", "policy": "read_only" },
    { "intent": "inspect_local_files", "policy": "read_only" },
    { "intent": "search_slack", "policy": "read_only" }
  ]
}

Now the LLM can answer the first routing question without seeing the full Gmail, Slack, Calendar, and filesystem manifests. For a user request like “show me the latest three emails”, the model can emit a narrow structured intention:

{
  "intent": "search_emails"
}

The host-side harness then maps that intention to the relevant MCP server and exposes only the small tool subset needed for the next model decision:

{
  "available_tools": [
    {
      "name": "gmail_list_messages",
      "arguments": {
        "max_results": "number",
        "query": "string"
      }
    },
    {
      "name": "gmail_get_thread",
      "arguments": {
        "thread_id": "string"
      }
    }
  ]
}

Only then does the model choose the exact callable function:

{
  "tool": "gmail_list_messages",
  "arguments": {
    "max_results": 3,
    "query": "newer_than:1d"
  }
}

That is the staged narrowing process. The LLM owns intention identification. The harness owns mapping, policy, approval, execution, audit, and evidence shaping. The MCP server owns the provider-specific call shape.

User request"Show me the latest three emails."

User turnConversation state reaches the harness.

Expose intentionsPrompt includes small intent menu, not every schema.

Structured intent{"intent":"search_emails"}

Expose Gmail toolsOnly the Gmail read subset is made visible.

Exact tool callgmail_list_messages({max_results:3})

Validate policySchema, scopes, read-only posture, approval rules.

Provider requestThe Gmail MCP server calls the Gmail API.

MCP resultMetadata returns to the harness as evidence.

Shape evidenceRedact, audit, compact, then send evidence to the model.

AnswerThe model writes the human-readable response.

If a write action appears in the middle of that flow, the harness should pause for preview and approval before execution. The point is not to make the model timid. The point is to keep the model’s search space small while keeping side effects under product control.

The trick is not to show the model every tool. Narrow the search space first, then ask for the exact structured call.

Some systems expose all tools directly to the model. For a small demo or a tiny tool ecosystem, that is reasonable. It is simpler, has fewer orchestration steps, and avoids another round trip.

But as the tool ecosystem grows, that simplicity stops being free.

If the model sees hundreds of tools and thousands of schema fields, the context window becomes a dumping ground. The practical answer is hierarchical exposure: choose intention from a small set, expose only the relevant tool subset, then generate the exact structured call.

Docker MCP Toolkit Is A Manager, Not The Trust Model

This is where my own work made the lesson concrete.

I have been working on a local AI-first solution, and one of the practical questions was how external tools should appear without making the user paste transport commands into a form like a punishment.

The first time the boundary became obvious, the UI could see more than the runtime was willing to use. A manager profile could show external capabilities. A tools page could display them. But that did not mean the model should immediately receive every function behind that profile.

That felt annoying at first, because it made the product look less “connected” than the setup technically was. But the annoyance was useful. It forced the distinction I had been missing.

Docker Desktop’s MCP Toolkit is useful here because it gives a manager-like UI around catalogs, profiles, containers, gateway behavior, and credential support. Docker’s own docs describe the catalog as a curated collection of MCP servers and the gateway as a proxy that handles server lifecycle, routing, and authentication across profiles.

That is useful plumbing.

But it is still plumbing.

The host-side harness still has to decide what enters the model-visible manifest.

For example, a Docker MCP profile may make Wikipedia, Gmail, and Slack visible. The host-side harness may still choose a much narrower runtime posture:

Available Versus Routed

Wikipedia: available and routed, because the enabled functions are read-only.
Gmail: available, but not routed until account authorization, scopes, and read-only policy are clear.
Slack: available, but write-capable functions stay blocked until approval flows exist.

This is the distinction I care about most:

Available is not the same as routed.

Routed is not the same as approved.

Once that clicked, MCP stopped looking like a magical agent feature and started looking like a sensible extension boundary.

MCP standardizes connection.

It does not standardize judgment.

Docker MCP Toolkit can make tools visible, but the host-side harness still decides which functions are routed into the model-visible manifest.

If you want the terminal version of creating a Docker MCP profile, assigning catalog servers, and registering the gateway with a client, I moved that into a small companion reference: Docker MCP Toolkit profile setup.

The product point remains the same: a profile registers servers with Docker MCP. It does not make every advertised function safe to route into the model.

Read-Only Is Not A Vibe

One of the fastest ways to make an assistant feel impressive is to let it touch personal tools.

One of the fastest ways to make it untrustworthy is to blur read and write behavior.

The same Gmail server can make the distinction obvious. “Summarize my unread email from this morning” is a read path. The MCP server knows how to talk to Gmail, but the host-side harness still decides whether the Gmail functions are visible, which result fields come back, and what gets recorded.

“Reply to Alex and say I will be ten minutes late” is a different class of action. The system should compose a draft, show the recipient and exact text, wait for approval, and only then send. If the same path treats search and send as merely two advertised functions, the architecture has already lost the important distinction.

I would start with read-only actions:

Read-Only First

Gmail: search messages, list labels, fetch snippets.
Calendar: list events, check free/busy, show calendar names.
Wikipedia: search, get summaries, fetch article metadata.

I would hold back actions that change the world:

Approval Required

Gmail: send email, delete email, forward attachment.
Calendar: create events, update events, invite attendees, cancel meetings.
Slack: post messages, react, invite users.

Those actions need previews, approvals, audit records, and revocation. They should not become available merely because a server advertises them.

The same applies to evidence.

If a Wikipedia connector returns a title and URL, that can be cited as Wikipedia evidence. If a Gmail connector returns a message subject, the system should not invent a public source URL because some generic normalizer once did that for Wikipedia. Helpful fallbacks become false provenance when they are applied globally.

This is where the boring engineering matters.

Fail closed when a function is ambiguous. Keep credentials out of prompts. Do not route write actions before the approval path exists. Do not turn private snippets into fake citations. Keep the audit trail local and explicit.

This is not fear. It is interface discipline.

This is also how I think about local AI-first systems. Local-first does not have to mean offline-only. Gmail, Slack, Calendar, and Search may still be external APIs. The important part is that credentials stay out of prompts and logs, approval state and audit records remain under the user’s control, and the host-side harness decides what evidence reaches the model.

Read-only is a product policy, not a hopeful interpretation of a function name. Visibility does not imply permission.

Where I Would Start

For a practical local assistant, I would not start by connecting every personal tool and hoping policy catches up.

I would start narrower:

connect one public, read-only server such as Wikipedia
show the discovered functions grouped under one user-facing tool
route only functions with explicit read-only metadata
record tool calls and compact evidence
add Gmail or Calendar read-only after authorization and scope display are clear
add write actions only after preview, approval, audit, and disable paths are real

That order is slower than a demo.

It is also much closer to something I would trust.

The uncomfortable part is that responsible MCP adoption can look unimpressive at first. A profile may show ten exciting tools, while the assistant only routes one public read-only connector. That looks cautious because it is cautious.

But the alternative is worse: a system that confuses discovery with permission, then discovers the trust model only after private data or side effects are already in the path.

This is the staged path I trust more than a flashy demo: read-only first, then identity, scopes, approval previews, and audit before write-capable functions become routine.

Closing Thoughts

MCP makes much more sense to me when I stop treating it as an agent feature and start treating it as an integration standard.

It is not the assistant. It is not the model runtime or provider. It is not the safety model. It is not the product UI. It is not the approval system.

It is the protocol that lets an AI host discover and call external capabilities in a more standard way.

That is already enough.

The real product work starts after discovery: deciding what should be visible, what should be routed, what should require approval, what evidence should return, and what must never enter the prompt in the first place.

That is why I now find MCP interesting.

Not because it removes the hard parts, but because it gives the hard parts a cleaner place to live.

Useful References

ML Engineering Needs A Taxonomy

2026-04-14T00:00:00+00:00

If you spend enough time reading job descriptions in this space, a pattern starts to feel impossible to ignore.

Everyone says they want an ML Engineer.

And honestly, the ambiguity is getting tiring.

That reaction is not theoretical for me. I started much closer to data science: notebooks, experiments, signal, modelling. Then the work pulled me toward software engineering, data engineering, pipelines, deployment, observability, and production support, because useful ML does not stop at the point where a model looks promising.

Over time, that is how I grew into ML Engineering: not as a clean title switch, but as a set of responsibilities that kept appearing at the boundary between learning systems and production systems.

So my issue is not that titles need to be perfect. They never will be. Real work is messy, teams evolve, and people grow across boundaries. Some overlap is not only healthy; it is necessary.

But there is a difference between healthy overlap and lazy role design. The market keeps blurring the difference, then acts surprised when hiring becomes noisy and delivery becomes uneven.

Part of this is normal. Almost law-governed, in the sense that it was always going to happen once a young field started expanding quickly. First the work appears. Then the titles arrive. Then everyone tries to sound a bit more complete than they really are.

That is how you end up with profiles that drift from useful description into theatre:

AI Strategist
Data Whisperer
Full-Stack ML Scientist
MLOps Architect Evangelist

The theatre is funny until it becomes the hiring spec.

Some of that is harmless branding. Some of it is people trying to survive a confusing market. But the same exaggeration that helps people market themselves also makes the field harder to describe clearly.

And when companies copy that ambiguity into job descriptions, it stops being funny.

What they often seem to mean is:

a data scientist who can productionise models
a software engineer who understands modelling
a data engineer who can own features and pipelines
an ML platform engineer who understands infrastructure, MLOps, serving, observability, reproducibility, and the surrounding ecosystem of tools and practices
if senior, then we are talking about someone who can also think about product, experimentation, and stakeholder communication
and finally, someone who keeps up with a research landscape that has recently exploded in both depth and width

In other words, a unicorn.

The unicorn appears when one role quietly absorbs five centres of gravity.

This is not just a wording issue.

It is a taxonomy failure, and taxonomy is how teams coordinate work.

Role language decides what hiring loops test for, what teams staff for, what people own, and what they are judged against. Once that language becomes vague enough, the problem stops being semantic and becomes operational.

The field has become too wide for vague labels to carry this much responsibility.

When taxonomy fails, organisations do not just misname work. They mis-coordinate it.

The Ambiguity Problem

Terms like data scientist, applied data scientist, data engineer, software engineer, and ML Engineer are all used inconsistently.

Sometimes that is understandable. Real teams evolve. Smaller companies need broader people. Titles drift.

But the ambiguity is now large enough to create real problems:

hiring managers ask for impossible overlap
candidates do not know what success in the role actually means
teams build fuzzy ownership boundaries
delivery slows down because responsibilities are unclear
people get judged against expectations nobody made explicit

When roles are unclear, ownership fragments. Work gets duplicated, responsibilities fall through gaps, and accountability becomes negotiable. The system does not just slow down; it becomes harder to reason about.

That is where this stops being semantics and starts becoming an organisational failure mode.

T-Shaped Is Good. Unlimited Width Is Not.

I am strongly in favour of people being T-shaped.

That matters. Breadth improves collaboration. It reduces blind spots. It helps teams understand each other’s constraints.

My own path depended on that kind of breadth. Moving from data science into software and data engineering was uncomfortable at times, but it made me a better engineer. It taught me why a beautiful model can still be useless if the features are unstable, the pipeline is fragile, the deployment path is unclear, or nobody can explain the production behaviour six months later.

Breadth helps. Unlimited width collapses into role design theatre.

But there is a difference between healthy breadth and role collapse.

There is a point where “be cross-functional” turns into “please cover four disciplines badly enough that we can pretend one headcount is enough.”

That is not maturity.

That is under-specification disguised as ambition.

A Working Taxonomy

I do not think these roles need perfectly rigid borders, but I do think we need clearer centres of gravity.

1. Software engineers

Their centre of gravity is:

application design
interfaces and boundaries
maintainability
testing discipline
deployment, reliability, and operational quality

They may or may not know much about modelling.

That is not the point of the role.

2. Data engineers

Their centre of gravity is:

data movement
storage and retrieval patterns
pipeline reliability
batch and streaming data infrastructure
data quality and availability

They make sure the data side of the system can actually support the work being asked of it.

3. Data scientists

Their centre of gravity is:

extracting signal from data
experimentation
hypothesis testing
feature and model development
evaluation and interpretation

They sit closest to learning from data and turning uncertainty into useful insight.

4. ML engineers

This is the ambiguous one, which is precisely why it needs more care.

To me, the ML Engineer owns the boundary between experimentation and production, where models, data, and systems must behave reliably under real-world constraints.

It is not “person who does some of all of the above.”

It is the role that cares about:

turning model-driven logic into production systems
managing the boundary between experimentation and serving
making inference, features, deployment, monitoring, rollback, and reproducibility actually work together
understanding enough of software, data, and modelling to make the whole thing operationally coherent

That is already a serious role.

It does not need to secretly absorb every adjacent discipline to be legitimate.

Where The Industry Gets It Wrong

The confusion starts when companies use ML Engineer as a placeholder for unfinished thinking.

Instead of deciding what the team is missing, they post a role that quietly asks for:

modelling depth
platform depth
data pipeline ownership
backend engineering
experimentation design
stakeholder fluency
production support

Sometimes one person can cover a surprising amount of that.

In many cases, though, this is not confusion. It is a team design shortcut, trying to compress multiple roles into one headcount.

But building a role definition around the best-case outlier is not a sound organisational strategy.

Why This Matters In Practice

This ambiguity creates at least three practical problems.

1. It distorts hiring

If the role is unclear, the interview loop becomes unclear too.

You end up testing fragments of four disciplines and then pretending the aggregate signal means something precise.

2. It creates unfair expectations

People join thinking they were hired for one centre of gravity and then discover they are being evaluated against three others.

That is bad management, not professional growth.

3. It weakens team design

When roles are vague, interfaces become vague too.

And when interfaces are vague, teams stop designing good collaboration boundaries and start relying on heroic overlap. That creates ownership gaps, blame diffusion, duplicated work, burnout, and systems that degrade because nobody clearly owns end-to-end quality.

That does not scale well.

What We Need Instead

This is not really an argument about job titles.

It is an argument about how work is partitioned in complex systems: where the interfaces are, where ownership sits, and where accountability lands.

We need a clearer taxonomy and a more honest way of describing overlap.

Not rigid boxes.

But explicit definitions, shared language, and a better sense of where one role’s centre of gravity ends and another begins.

The goal is not rigid boxes. The goal is honest interfaces.

The interesting part is not eliminating overlap. The interesting part is naming it properly.

That is where tools like Venn diagrams, capability maps, and role definitions actually help. They force teams to say:

what this role owns
what it touches
what it collaborates with
what it is not expected to carry alone

That is healthier for hiring and much healthier for delivery.

The Real Takeaway

I do not think the industry needs fewer broad engineers.

I think it needs more honesty about what breadth costs.

T-shaped growth is real.

But asking one person to fully cover software engineering, data engineering, data science, and ML Engineering is usually not ambition. It is taxonomy failure.

Clearer role definitions are not admin. They are part of how complex work gets partitioned.

Once role definitions become vague enough, organisations stop designing teams and start relying on exceptional individuals. That might work occasionally. It does not scale.

A taxonomy is not bureaucracy. It is how the work gets divided clearly enough for systems, teams, and people to survive contact with production.

Coding Got Cheap. Verification Did Not.

2026-04-08T00:00:00+00:00

Right now, the loudest claim around LLM coding tools is that coding is becoming a commodity.

I think that is directionally right. What I do not think follows automatically is the part people usually jump to next: that software delivery will therefore speed up by the same factor. The more I use these tools, the less convinced I am by that leap.

Yes, they can write routine code quickly; they can refactor at a pace that would have felt absurd not long ago. But one friction point keeps getting sharper every time:

We have increased write throughput.

We have not increased verification throughput at the same rate.

That is the part I think many teams are about to feel much more acutely: review friction.

At least, that was obvious in my own team within a week of all of us adopting LLM CLIs more seriously in our workflow. Code was appearing faster. Refactors were cheaper. Experiments were easier to try. But the moment those changes started piling up, the real constraint showed itself again: someone still had to understand them, review them, and decide whether they were safe to merge.

And while this is easiest to see with LLM CLIs and all the current code-vibing enthusiasm, I do think the point extends to agents too.

Agents do not have the agency they would need to make software delivery scale in a production environment.

They can generate code. They can propose plans. They can widen the search space. But they do not own production risk. They do not carry on-call duty. They do not defend the change in front of a customer. They do not absorb the cost of being wrong.

That responsibility is still human.

And because that responsibility is still human, the bottleneck has moved.

The new imbalance is simple: code generation is accelerating faster than review and verification.

From Writing To Verification

For a while, most of the conversation around coding agents was about output:

how many files they can touch
how quickly they can scaffold
how much code they can produce in one go
whether coding itself is becoming a commodity

That is no longer enough as a way of thinking.

If code generation gets ten times faster while review, integration, and verification stay roughly flat, the system does not become ten times faster. It becomes unstable.

What used to be scarce was code production. What is scarce now is trust. And trust is slower. It lives inside:

review bandwidth
change understanding
test quality
integration sequencing
rollback confidence
the ability to explain why a change is safe

That is why I do not find “these tools make engineers faster” a very useful claim on its own. Faster at producing diffs is not the same thing as faster at delivering software. Worse, if you leave the system unchanged, the imbalance compounds:

more code appears
reviewers get overloaded
review quality drops
defects move downstream
rollback frequency rises
trust in generated changes starts to erode

So no, the bottleneck did not disappear; it moved from writing code to trusting code.

The Wrong Fix: More Agents

I think many teams are still responding to this with the wrong instinct.

If generation is cheap, they assume the answer is to introduce even more agents, even more automatic change, even more output.

But more agents do not solve a trust bottleneck; they amplify it. Without strong engineering constraints, cheap generation gives you:

bigger pull requests because exploration is cheap
noisier pull requests because changing code is cheap
more speculative diffs because rewriting is cheap
slower reviews because understanding still costs the same

That is not scale; it is faster chaos. If teams do not build a stronger trust system around these tools, they will not really scale AI-assisted development. They will just generate more change than they can responsibly absorb.

The Better Framing: Verification Systems Design

This is why I think the right framing is not “how do we optimise the PR process?”

but:

How do we design a verification system that can keep up with generated change?

Smaller PRs matter. Merge queues matter. I believe that strongly. But they are not enough on their own. They improve the shape of change. They do not automatically make change trustworthy.

If you want AI-assisted development to scale, you need a system that turns fast code generation into verifiable, reviewable, bounded progress. That means moving from reviewing code to reviewing guarantees.

A verification system is not just a pile of checks. It is a structured way of turning change into bounded, testable, explainable units of risk.

Review Guarantees, Not Just Diffs

Right now, too many AI-assisted workflows still look like this:

Tool writes code

Human reviews diff

Human approves

Hope nothing subtle broke

That does not scale; it just shifts cognitive load onto the reviewer.

The better pattern is to require every serious change to state clearly:

what changed;
what must remain true;
how we know it works;
what failure modes were considered.

If that information is missing, the reviewer is being asked to reconstruct intent from the diff, infer risk from context, and simulate behaviour in their head.

That is expensive, and that is exactly the kind of review friction we should be trying to remove. The important part is to make those guarantees tangible. For example:

this transformation preserves ordering invariants;
this refactor is behaviorally equivalent under property tests;
this change cannot affect downstream state transitions because the boundary remains unchanged.

Once a reviewer sees that kind of claim backed by evidence, the whole exercise changes. They stop scanning raw volume and start checking bounded risk.

A better review model is not “read more diff.” It is “check stronger guarantees.”

Back To Fundamentals

This is the part I find slightly amusing. Once you follow the argument through, the answer starts sounding strangely old-fashioned. If review friction is the bottleneck, then we do not get out of it with more theatrical tooling.

We get out of it by returning to fundamentals:

smaller PRs;
clearer intent;
narrower scope;
better tests;
merge queues, and;
easier rollback.

That is not because these are fashionable process ideas. It is because they reduce the cost of review and verification.

Large PRs force reviewers into archaeology. They have to reverse-engineer intent, infer boundaries, and simulate outcomes in their head.

Small PRs let them ask a much narrower question:

Is this one change understandable, bounded, and safe to merge?

That is a real throughput advantage.

In an agent-assisted workflow, this matters even more. The natural temptation is to let the tool range widely and submit one impressive diff. That is exactly the wrong shape of change if trust is the bottleneck.

So yes, smaller PRs, stacked changes, narrow intent, and one decision per review unit, become a must! They are no longer, simply, about hygiene. It is part of the verification system.

This is also where a simple test-driven instinct helps a lot. For example, if someone wants to do a refactor, one very clean pattern is:

first PR: add tests and increase coverage
second PR: do the refactor

The separation matters.

In the first PR, the intent is obvious: we are improving confidence.
In the second PR, the tests stay fixed, which makes the claim much narrower: behaviour should stay the same.

That lowers cognitive load immediately.

The same principle generalises. If a change is behavioural, keep the scope small. If a feature is large, deliver it in steps. The hardest work is usually restructuring, and that is exactly where thinking hard about incremental delivery matters most.

If you want something practical to adapt for your own team, I put together a reusable reference here:

PR template for higher-trust AI-assisted delivery

Force Decomposition At Generation Time

This is where I would push the workflow harder. Do not wait until review time to discover that the diff is too large. Force decomposition earlier.

The correct shape is:

Task

Plan

Substeps

PR sequence

Not:

Task

Giant AI diff

Panic review

This is one of the most useful things these tools can do, by the way. They should not just write code. They should help propose the incremental delivery plan by which the code can be introduced safely.

That is a much better use of an agent than simply asking it for more implementation.

Small PRs are not tidiness theatre. They are one of the cleanest ways to lower review friction.

Shift Validation Left Into Machines

If humans remain the primary validators of AI-generated code, I do not think the model scales very far.

Humans should still own risk. But they should not be forced to simulate execution in their head for every meaningful change. That means stronger machine-side verification.

1. Property-based testing

I think property-based testing is one of the most underused tools here.

Why?

Well, because many AI-generated bugs are not obvious syntax bugs. They are edge-case bugs. Boundary bugs. “This looked correct for three examples and broke on the fourth” bugs.

Property-based testing helps because it checks invariants across many generated inputs instead of blessing one or two happy-path examples.

A few practical cases (skip these if you get the point):

a parser should round-trip valid inputs without losing structure
a serialization layer should preserve data after encode/decode
a ranking function should preserve ordering invariants you care about
a pricing or allocation function should never produce negative totals or violate conservation constraints
a stream transformation should preserve event counts when it is not supposed to drop or duplicate events
an aggregate that should only grow as more events arrive should remain monotonic
a pipeline that depends on arrival order should preserve event ordering where that contract is supposed to hold

That matters because it turns “I read the diff and it seemed fine” into “the core property stayed true under many cases.”

That is a better verification signal.

2. Static analysis gates

Static analysis is another place where teams should be more aggressive.

Not static analysis theatre. Not one more badge in CI. Real gates.

Practical examples:

type errors should fail fast;
nullability violations should fail fast;
unsafe imports or forbidden dependencies should fail fast;
obvious dead code or unhandled branches should fail fast, and;
insecure patterns or dangerous API usage should fail fast.

The more routine structural mistakes a machine can reject automatically, the less human energy gets wasted on basic hygiene.

That leaves humans freer to review the part that actually matters: design, guarantees, and risk.

3. Runtime assertions

I am much less enthusiastic about runtime assertions than about tests, validation, or stronger system boundaries.

Most of the time, if you need an assertion, it is worth asking whether the system should have prevented that state earlier through better design, clearer contracts, or stricter validation.

In other words, I would not treat assertions as a primary verification strategy.

They still have a narrow place, though, around internal invariants that should be impossible if the rest of the system is behaving correctly. For example:

a state machine reaches an illegal transition;
two mutually exclusive internal flags are both true;
an event-ordering assumption inside one component is suddenly broken, and;
an internal contract is violated in a way that risks silent corruption.

That is where a loud failure can be better than quietly propagating bad state.

So ok, assertions can help, but only as a last line of defence. I would much rather prevent bad states than merely notice them at runtime.

Add Risk Awareness To Review

Another thing I think teams need is a more explicit notion of change risk.

Not every AI-generated change should go through the same review path.

There is a difference between:

a local refactor;
a business-logic change;
a concurrency change;
a stateful systems change, or;
a distributed recovery or integration change.

Those should not all be treated as the same kind of review object.

What I would want is some form of confidence or risk scoring:

🟢 low-risk cosmetic or local changes get a lighter path
🟠 medium-risk logic changes get stronger automated evidence
🔴 high-risk stateful or distributed changes get narrower scope and deeper human scrutiny

Right now, most teams still treat this too uniformly:

Open PR

Assign reviewer

Hope for the best

That is not mature enough for the level of change velocity these tools can produce.

Trust Is What Makes Automation Scale

If there is one broader point underneath all of this, it is that:

Automation does not scale on capability alone; it scales on trust.

If an AI system is not trustworthy, people will hesitate to adopt it, hesitate to depend on it, and ultimately refuse to give it real responsibility. That is true whether we are talking about coding tools, agents, or any other form of automation.

And trust does not appear by magic. It comes from being able to explain what the system is doing, trace why it did it, bound the risk, and verify that it is behaving safely enough to rely on.

That is why verification matters so much. A strong verification system is how an organisation turns output into trust.

The self-driving cars example makes that point clear.

The problem with self-driving was never just whether people would emotionally accept the absence of a driver.

You can put a human in the driver’s seat and solve part of the problem for a while. That gives you supervision, and maybe enough trust to experiment. But it also shows the limit immediately: you still have not built enough trust into the system for automation to carry the responsibility on its own.

To unlock the real benefit, you need a validation system strong enough to make the absence of a driver trustworthy.

Simulation mattered.
Certification mattered.
Safety cases mattered.

Verification pipelines mattered.

We did not start trusting self-driving because models improved. We trusted it only to the extent that validation systems became industrial.

We do not need agents with mystical agency.

We need enough trust in their output that automation can carry more of the load without a human having to re-derive everything from scratch.

Automation starts to scale when trust is built into the delivery path itself, not when a human has to keep rescuing the system from the driver’s seat.

If I were designing for this bottleneck deliberately, I would want something closer to this:

A task is decomposed into a sequence of narrow changes before major implementation begins.
Each change states intent, invariants, and how correctness will be validated.
Automated checks do the first line of trust work: tests, static analysis, diff classification, CI.
Reviewers focus mostly on boundary decisions, guarantees, and system fit.
Merge queues and rollback paths keep integration disciplined and stop trust from being wasted in merge thrash.

That is a much more serious model than “AI writes, human skims, merge and pray.”

The practical takeaway is not to resist agents. It is to build an engineering system where review and verification can keep up with them.

The real unit of speed is not how quickly code appears in a branch. It is how quickly a team can move a change from idea to trusted production without losing control of the system.

That is the metric that matters. And once you define speed that way, the answer stops sounding futuristic. It becomes strangely familiar:

smaller PRs;
clearer intent;
stronger guarantees;
better tests;
static analysis gates;
selective runtime assertions;
merge queues, and;
low-friction rollback.

These are not bureaucratic leftovers from a slower era. They are what make faster tooling usable.

If LLM tooling keeps improving, the teams that win will not be the ones that generate the most code.

They will be the ones that turn trust into a system.

If coding is becoming a commodity, verification is not.

And if agents do not have agency, the burden of trust still sits with us.

Many teams are about to discover that the next productivity battle is not about writing code at all. It is about whether their engineering system can metabolise AI-generated change without losing control.

The best prompt in the world will not save a team that cannot review, verify, and integrate change with discipline.

That is a much less theatrical advantage. It is also the real one.

Kafka Streams vs Flink Is The Wrong Question

2026-04-01T00:00:00+00:00

I am not neutral about Flink.

I have spent years advocating for it, using it anywhere I could, organizing London meetups around it before COVID, and talking to anyone who would listen about why the dataflow model is such a good way to think. I still love that model. I love how naturally event-driven systems can align to a domain: a ship enters a port, this state changes, that downstream action happens next. Both Flink and Kafka Streams let you express stateful processes in a way that can stay close to business reality.

And that is exactly why this lesson was useful for me.

When I joined a later role, I found myself surrounded by repositories built with Kafka Streams. My first instinct was simple: replace them with Flink. Some of those repos were chaotic, under-loved, and far away from the kind of streaming architecture I like to build. I felt outside my waters. I wanted to modernize, refactor, migrate, clean the slate.

But over time, after giving those systems the attention they deserved, I learned something more valuable than another framework argument:

The useful question is not whether Flink is "better" than Kafka Streams.

The useful question is when your streaming problem stops being an application concern and becomes a platform concern.

That is still the line I care about most. But now I care about it with much more respect for both sides.

This is the real fork in the road: not which mascot wins, but whether the system is still application-shaped or is becoming a platform concern.

The Bias I Had To Correct

There is a recurring engineering mistake hiding in this topic: you inherit a system that feels old, untidy, or unfashionable, and you start reaching for the framework you know better.

I have had to relearn this lesson more than once in my career. It is almost embarrassing how often it comes back, which is probably proof of how important it is.

I originally wanted to replace those Kafka Streams solutions largely because I was more fluent in Flink. That fluency gave me clarity in one framework and discomfort in the other, and I briefly mistook that feeling for architecture.

That is a dangerous mistake.

Once I slowed down, cleaned up the code, made the domain model clearer, and brought more disciplined engineering practices to those codebases, I ended up with a much less dramatic conclusion:

if you give an existing streaming system enough love, enough structure, and enough respect for the underlying model, you can get very far without rewriting it.

That does not make Flink less good. It just makes engineering judgment less theatrical.

The urge to rewrite is strong. The better question is whether the system is structurally wrong or simply under-engineered.

The Lesson

Framework preference is not architecture. My first instinct was to rewrite messy Kafka Streams systems into Flink. The better answer was to clean the model first, then decide whether the runtime was actually the problem.

What I Still Love About Flink

Let me be clear: I am still a very strong Flink advocate.

I still think the Flink dataflow model is one of the cleanest ways to reason about stateful stream processing. Operator boundaries are explicit. State feels local to the operator that owns it. Checkpointing, recovery, repartitioning, and event-time semantics feel like first-class runtime concepts instead of side effects of a library attached to a broker.

That is a big deal to me, because I care a lot about how easily a streaming system can be explained.

When a framework makes the flow of state and events easy to communicate, it usually also makes the system easier to maintain.

But none of that comes for free.

Flink asks you to pay an upfront complexity tax in operations, onboarding, debugging, and platform maturity. Misconfigured jobs are not charming. They are expensive. The model feels cleaner once you have paid that tax, not before.

This is the part many framework comparisons skip: the platform is powerful, but you do pay for the privilege of operating it well.

This is why I still reach for Flink eagerly when the runtime itself needs to be a serious part of the design.

Where Kafka Streams Grew On Me

What changed for me was not that I stopped liking Flink. What changed is that I learned to appreciate where Kafka Streams is more enabling than I first allowed.

1. The State Model Is Different, Not Just Worse

One of the things that threw me off at first was the ergonomics of state in Kafka Streams.

Kafka Streams gives you state stores, changelog-backed recovery, and table-oriented patterns that can feel more globally available than Flink’s cleaner operator-local state style. The processor API is very explicit that processors interact with attached state stores, and those stores are fault-tolerant by default. In practice, the default persistent path is a local RocksDB store backed by a compacted changelog topic. On top of that, table abstractions and GlobalKTable-style patterns can make shared reference data or queryable state feel very convenient in the application model.

That convenience comes with real trade-offs:

local RocksDB state is fast and useful, but fault tolerance still depends on changelogs
restore times can still become painful at scale, especially when local state is lost and the store must rebuild from the changelog
the relationship between topology code and materialized state can become messy in under-disciplined repos
the convenience of reachable state can encourage poor habits if the model is not kept clear

But convenience is still convenience. There are use cases where having easier access to shared or queryable state is genuinely useful, and it would be dishonest to pretend otherwise.

My instinct, because of my Flink background, was to push Kafka Streams code toward a more operator-local way of thinking anyway: make state ownership clearer, keep logic close to the transform that really owns it, and avoid turning the topology into a stateful soup. That discipline improved those codebases a lot.

But that is exactly the point: bringing some Flink-style discipline into Kafka Streams made the code better. It did not prove that the whole system needed to become Flink.

2. Kafka-Native Integration Is A Real Strength

I am not even talking here about the obvious ecosystem point in a lazy way. Yes, Kafka Streams lives naturally inside the Kafka ecosystem. Yes, it works comfortably with keyed messages, schemas, topics, and the usual surrounding tooling. Yes, schema-registry-oriented flows often feel more straightforward there.

That matters. Not because Flink cannot do these things. It can. But because being native to the ecosystem reduces friction when the whole world around the application is already shaped like Kafka.

You should not dismiss that as a minor detail. It is part of the operating model.

Where Flink Still Pulls Away

This is where my original instincts still hold up.

1. Scaling Stops At The Broker Boundary Much Earlier In Kafka Streams

The scaling constraint in Kafka Streams is tightly tied to partitions, tasks, and instances. That is not a bug. It is the design. It is also why the system stays so close to Kafka itself.

But it has consequences.

There comes a point where adding more application instances does not really solve the problem because the partitioning boundary is already telling you how far you can go cleanly. You can absolutely scale Kafka Streams, but the broker topology keeps exerting a much stronger influence on the application topology.

At that point, scaling stops being primarily demand-driven and starts becoming topology-constrained.

Flink, by contrast, is still constrained at the source when consuming from Kafka, but once records are inside the runtime it has far more freedom to repartition, redistribute work, and run operators at a different parallelism from the source. I would not call that infinite scaling. I would call it a materially more flexible runtime.

That difference becomes major once traffic spikes, repartition pressure, or uneven workloads start shaping your architecture.

2. Checkpointing And Recovery Are In A Different League

This is still one of the clearest differentiators for me.

Flink’s checkpointing model is part of the platform. Recovery is an explicit runtime capability, not just the consequence of rebuilding local state from changelogs. The barrier-based snapshotting model, savepoints, and state redistribution semantics are exactly the kind of thing that make Flink feel like an engine rather than a library.

In Kafka Streams, the picture is a little more nuanced than “it always has to read the whole changelog again.” If the local state store still exists, the runtime can replay from the previously checkpointed offset and catch up from there. If local state is gone, it has to rebuild from the changelog from the beginning of the retained data. That is meaningfully better than a naive full replay every time, and it is one of the reasons the RocksDB path works as well as it does in practice.

But the deeper point still holds: fault tolerance and task migration are still anchored in changelog restoration, and on large stateful applications that can become one of the dominant operational pain points. Retention choices matter. Restore time matters. Recovery becomes less predictable under failure. Operational patience starts turning into architecture.

Running Streams applications and state restoration

At smaller scale this looks like an implementation detail. At larger scale it starts deciding how painful failure and recovery really feel in production.

That is the point where Flink stops being a nice architectural preference and starts becoming a serious operational advantage.

The Real Trade-Off

So, here is the trade in one sentence:

Kafka Streams is a very good way to build Kafka-native streaming applications.

Flink is a very good way to operate stateful dataflows as a platform concern.

Those are not the same problem, even if the diagrams sometimes look similar.

And this is why I do not buy generic advice like “use Flink if you need scale” or “use Kafka Streams if you want simplicity.”

Both statements are misleading. They sound practical, but they hide the real failure modes, encourage cargo-cult architecture, and make comfort-driven rewrites sound more principled than they are.

The better rule is this:

If your system is still primarily an application that processes Kafka topics, Kafka Streams is often the right engineering choice.

If your system is becoming a stateful processing layer that needs explicit control over time, state, replay, recovery, and heterogeneous I/O, Flink starts to justify its existence very quickly.

The Harder Lesson

This is the part I most wanted to say personally.

I am still a huge Flink proponent. That has not changed.

What has changed is that I now trust myself less when my first reaction is “we should rewrite this in the framework I prefer.”

That reaction is often just comfort seeking.

Sometimes you really should migrate. Sometimes the runtime boundary is wrong, recovery is too painful, scaling is too constrained, and Flink is the more honest architecture.

But sometimes the better engineering decision is to love the existing system properly: clarify the model, clean the state boundaries, improve the abstractions, respect the domain flow, and stop assuming that old means wrong.

That was the lesson here for me.

If I had followed my first instinct blindly, I would have replaced some systems for the wrong reason.

What I Would Actually Do

If I were starting with a Kafka-centric JVM team, modest operational requirements, and clean Kafka-in/Kafka-out topologies, I would still be very happy with Kafka Streams.

I would move toward Flink once one or more of these became persistently true:

stateful jobs became expensive to recover or rescale
I needed a broader processing platform rather than a library
event-time and replay behaviour started driving design choices
the system stopped being comfortably Kafka-shaped
operability and runtime visibility became a daily concern rather than an occasional debugging aid

That is the moment Flink stops being overkill and starts being the more honest architecture.

And that brings me back to where I started.

I still love Flink. I still think its model is easier to reason about once runtime concerns become serious. I still think it is the stronger platform when state, recovery, and rescaling dominate the design.

Many rewrites begin as comfort and only later get dressed up as architecture.

That is the part I understand better now, and it is probably the most useful thing this comparison taught me.

PyFlink In 2026: Better Than Its Reputation, Still Not Frictionless

2026-03-27T00:00:00+00:00

I do not think teams reach for PyFlink because Python feels nicer to type.

They reach for it when they have already paid the cost of splitting one ML system across two ecosystems.

I have seen that pain in the most annoying way possible: training and experimentation lived in Python, but the prediction path had to live in Java. On paper that sounds manageable. In practice it meant subtle differences in floating-point behavior, parsing choices, and even heading-angle calculations were enough to create inconsistent predictions. We lost months chasing what looked like model problems and turned out to be feature mismatches.

That is the part many architecture discussions understate. Once training is in Python and prediction is in Java, the real problem is no longer just inference. It becomes feature parity, interface parity, and the feedback loop between two runtimes that each have their own libraries, their own defaults, and their own ways of being almost the same.

This is the real tax of cross-language serving paths: not dramatic failure, but endless small mismatches that make the system harder to trust.

You can try to escape that with ONNX. You can rebuild parts of the feature logic in Java. You can expose the model behind a service boundary and call it remotely. All of these are reasonable patterns. None of them are free.

Four years ago, ONNX was not mature enough for the kinds of models and custom ops we cared about. The easy story broke precisely where real systems stop being toy examples. The fallback was the pattern most teams know well: deploy the model as a service and call it over REST. That works, but now your prediction pipeline owns an extra network hop, another SLA, another scaling surface, and one more place where raw features must remain perfectly aligned.

Model-as-a-service is often the sensible compromise. It is also where clean separation starts charging rent in latency, SLAs, and feature-parity work.

This is why I think the case for PyFlink should be stated more bluntly than it usually is:

If the real source of friction in your system is that your training, feature logic, and model-adjacent code live naturally in Python, then "just use Java Flink" is not a neutral suggestion.

It is an architectural trade, and often an expensive one.

That is the real driver for PyFlink adoption.

I went back to an older PyFlink review recently because I did not want to turn one painful period into a permanent opinion. Some of those frustrations had aged well. Some had not. And PyFlink is exactly the kind of technology people form a durable opinion about after one painful quarter and then never revisit.

That would have been lazy here, because the story has moved. PyFlink is in a better place now than many engineers assume. The official docs cover installation, packaging Python environments, debugging, a Python DataStream API, and connector examples. That is already a more serious platform story than the older dismissive take that it is simply immature.

But the core trade-off has not disappeared.

PyFlink is now real enough to take seriously, but it still does not let you forget that Flink is fundamentally a JVM-first distributed runtime. That is the part people need to hold in their head at the same time as the improvements.

What Has Improved Since The Older Evaluation

The first thing worth saying is that some of the older criticisms are now too blunt.

PyFlink is no longer just a thin curiosity around the Table API. The current docs cover installation, a Python DataStream API, debugging, dependency management, packaging Python environments for cluster execution, and connector examples:

That is already a materially better story than the one many engineers still carry around in their heads.

A few concrete improvements stand out:

1. The Python Story Is Better Documented

The installation docs now state clear Python version requirements. At the time of writing, PyFlink requires Python 3.9, 3.10, 3.11 or 3.12:

PyFlink installation

That sounds minor, but it is not. One of the easiest ways to waste time with cross-language frameworks is by discovering environment assumptions too late. The current docs at least acknowledge that this is a real part of the user experience.

2. The DataStream Story Is No Longer Hand-Wavy

One of the old reasons people dismissed PyFlink was that serious low-level streaming work still felt like Java territory.

That is less true now. The Python DataStream API is documented, examples exist, and the API surface is real enough that you can reason about it as a deliberate part of the platform rather than a side alley:

Intro to the Python DataStream API

I would still be careful not to confuse “documented” with “equally frictionless as the JVM path,” but the old complaint that PyFlink is barely there is no longer a fair description.

3. Debugging And Packaging Are Better Acknowledged

The older review spent a lot of energy on setup, environment pain, and debugging awkwardness.

Those pains have not disappeared, but the current docs are more honest about them. They cover packaging Python environments, adding JARs, client-side versus TaskManager-side logging, local debugging, remote debugging, and profiling:

This matters because it tells you something important about the maturity of the ecosystem: it now documents the pain instead of pretending it is not there.

That is progress, even if it is not magic.

Why PyFlink Is Genuinely Attractive

Despite the caveats, I do think PyFlink has a very real value proposition.

1. It Keeps The Streaming Layer Closer To The Actual ML Ecosystem

This is the point I think most comparisons understate, and it is the one that matters most to me.

The strongest argument for PyFlink is not merely “our team prefers Python.” The stronger argument is that the surrounding model ecosystem, experimentation culture, libraries, and iteration loops are still centered on Python.

This is why PyFlink remains attractive: not because the runtime becomes light, but because the surrounding Python ecosystem can stay closer to the streaming layer.

That matters when the alternative is forcing teams into one of these patterns:

re-implementing logic in Java
exporting models through formats like ONNX and accepting the translation burden
splitting the system so aggressively that the serving boundary becomes the architecture

None of these are invalid. But all of them are real costs, and in many teams they are the actual costs driving interest in PyFlink.

If the same raw features are calculated in one language for training and another for live prediction, you do not just inherit maintenance overhead. You inherit doubt. When a prediction looks wrong, is the model wrong, is the data wrong, or did one side normalise, round, parse, or order something differently? That uncertainty is corrosive, and it slows every feedback loop around the system.

2. It Meets Python-Heavy Teams Where They Already Work

If your data and ML teams already live in Python, PyFlink reduces one major source of organisational friction.

That does not mean everyone suddenly gets to ignore distributed systems. But it does mean:

feature logic can stay closer to the surrounding Python estate
model-adjacent transformations feel more natural
experimentation paths from notebook thinking to streaming execution become less culturally awkward

For some organisations, that is a very big deal.

The wrong reaction here is to sneer and say “just learn Java.” Sometimes that is the right answer. Often it is just a lazy one.

3. It Makes Flink More Reachable Without Hiding Flink

Good language bindings should not pretend the platform underneath does not exist.

PyFlink is useful when it gives Python teams access to Flink’s real strengths: state, checkpoints, event-time semantics, long-running streaming jobs, and broader dataflow capabilities. If that is what you are buying, then the Python layer can be a practical bridge.

That is especially true for teams whose work already mixes ETL, feature pipelines, and model-centric logic.

4. There Is A Real Connector Surface

This is another place where the older blanket criticism needs updating.

The current PyFlink docs and examples do show Kafka, Pulsar, and Elasticsearch examples in Python:

PyFlink connector examples

So it would be wrong to say that the connector story is absent.

But it would also be wrong to say that it feels like a pure Python ecosystem.

That brings me to the real downside.

Why PyFlink Is Still Not “Flink, But Easy”

The strongest criticism from the old evaluation still holds:

PyFlink reduces language friction, but it does not remove runtime friction.

1. You Still Have To Think In Two Worlds

The installation and FAQ pages make this clear if you read them carefully.

You have to think about:

Python interpreter version
Python packaging and archives
where Python executes
how dependencies are shipped
JAR dependencies for connectors or Java-side integration

That earlier review made this painfully concrete. Getting local execution into a sane state meant lining up:

the right Java version
the right Python version
the right connector JARs
the right Python dependencies

That list is not just setup trivia. It is the operating model announcing itself early.

That is not a small footnote. It is the day-to-day ergonomics of the platform:

This is why I would resist overselling PyFlink to a Python team as “just write Python and the rest disappears.”

It does not disappear.

It relocates.

2. The Connector Story Still Leaks JVM Reality

The connector examples are useful, but they also reveal the real shape of things: adding JARs, managing connector dependencies, and living with the fact that some integration points are still fundamentally JVM-shaped.

Even the current Kafka connector docs explicitly talk about bringing connector dependencies yourself for PyFlink jobs:

Flink Kafka connector docs

That is not a deal-breaker. It is just not the same experience as working inside a native Python framework whose extension model is Python all the way down.

It also shows up in deployment. In that earlier review, the easiest workable path for local standalone deployment was not “package a Python app and run it.” It was closer to:

start from a vanilla Flink image
add the Python dependencies
mount the repo or bundle the code carefully
run the Python entrypoint from inside the live container

That is a perfectly workable path. It is also a strong reminder that the deployment experience is still shaped by Flink’s runtime model, not by Python’s usual ergonomics.

3. Debugging Still Tells You What The System Really Is

The current debugging docs are better than before, but they are also revealing.

They distinguish between client-side logging and TaskManager-side logging. They discuss local debug, remote debug, and profiling Python UDFs. That is helpful, but it also tells you that when things go wrong, you are not debugging a simple Python program. You are debugging Python inside a distributed Flink runtime:

PyFlink debugging

In practice, that means some classes of issue still feel cross-boundary by nature:

packaging bugs
dependency mismatches
behavioural differences between local and cluster execution
performance bottlenecks around Python execution paths

This is not PyFlink being uniquely bad. It is just the cost of the abstraction being honest.

4. Native Python Models Are Not An Automatic Architectural Win

This was one of the more useful parts of the earlier review, because it is exactly the kind of point people skip when they are trying to justify a new stack.

Yes, being able to interact with model code directly inside a PyFlink job is a real plus. It can simplify some flows and avoid a network hop.

But that is not the same as saying it is always the better architecture.

Once the model is served behind a proper boundary, you often gain things that matter a lot in production:

safer zero-downtime upgrades
cleaner readiness and health semantics
independent model scaling behind a load balancer
a clearer separation between streaming orchestration and serving concerns

So, yes, native execution can save some overhead. But it can also collapse boundaries that were doing useful work for you.

The reason I still take the native path seriously is not hand-wavy elegance. It is that model-as-a-service also comes with a bill:

every prediction path now pays a network round trip
the serving tier becomes another system you need to scale for throughput and protect with its own SLA
raw feature generation has to stay perfectly aligned across the caller and the served model boundary

If demand is modest, teams can live with that for a long time. Once prediction volume rises, that architecture stops being an abstract diagram and starts showing up as latency, capacity planning, and operational drag.

5. The Performance Question Never Fully Goes Away

I would be very careful here not to pretend a benchmark I have not run.

But I am comfortable saying something narrower and more useful: if your workload is highly latency-sensitive, connector-heavy, or operationally unforgiving, the JVM path still deserves to be the default starting point.

PyFlink can absolutely be the right choice. I just would not choose it because I wanted to avoid understanding the Java side of Flink.

That is not how this platform works.

So When Would I Use It?

I would take PyFlink seriously when these conditions hold:

the team is materially more fluent in Python than in Java
the reason for adopting Flink is the runtime model, not fashion
the jobs are important, but not balanced on the sharpest latency edge
I am willing to own environment packaging and connector dependency management as part of the operating model

I would lean back toward Java Flink when:

connector maturity dominates the problem
the hot path is extremely performance-sensitive
the team already has strong JVM strength
I expect deep platform integration and want the least surprising execution path

If You Want To Try It

If this post pushed you toward experimenting rather than debating in the abstract, I put together a small starter page here:

PyFlink starter archetype and agent prompt

It is intentionally minimal. The goal is not to hand you a grand framework. The goal is to give you a sensible first project shape and an agent prompt that can get a small Python-first streaming scaffold off the ground without immediate chaos.

The Practical Takeaway

What matters here is not whether PyFlink is “good” or “bad.”

That is far too vague to help anyone.

The better question is this:

Do I want Python as the working language for a Flink system badly enough to own the extra operational boundary that comes with it?

If the answer is yes, PyFlink is now mature enough to be a serious option.

If the answer is no, then Java Flink is still the cleaner way to get the full benefits of Flink without pretending the JVM underneath is someone else’s problem.

That, at least, is the view I would hold today.

From Model Validation To Pipeline Validation

2024-07-15T00:00:00+00:00

Originally published on Medium on July 15, 2024. Lightly edited for the ML-Affairs archive.

Imagine making a decision today with the knowledge of tomorrow.

Sounds like an unfair advantage, right?

In machine learning, it is often a trap.

As an ML engineer at Vortexa, a lot of my work has lived in the space between abstract models and production tools that people can actually depend on. Over the years, my team and I have built and maintained data pipelines that feed downstream decisions in the energy domain. These systems do not just provide a snapshot of the market. They also provide signals that customers may use inside their own analysis, models, and decision workflows.

That creates a very natural retrospective question:

Had we incorporated Vortexa's predictions back in 2018, would the outcomes have been better?

That question is simple to ask and surprisingly easy to answer badly.

The business question is retrospective. The validation problem is temporal.

The Future Leakage Paradox

The usual temptation is to “travel back in time” by applying today’s model to historical scenarios.

That sounds reasonable until you notice the contradiction. If the model was trained using data that includes what happened after the period we are evaluating, then it is not really predicting the past. It is replaying the past with knowledge it should not have had.

Put differently:

A model should not be asked to predict an outcome from a past it has already learned.

That is not a small modelling detail. It changes the meaning of the whole evaluation. The model is no longer being tested as a prediction system. It is being tested as a memory system.

I started calling this the Future Leakage Paradox, or FLiP: a situation where future information seeps into a past prediction and makes the retrospective evaluation look more realistic than it really is.

Future leakage is subtle because the evaluation still looks technical. The problem is that the timeline is wrong.

Why This Matters In A Real Domain

Take vessel destination prediction as an example.

Suppose we want to evaluate how well a model would have predicted vessel destinations in 2018. The energy and shipping domains are volatile. Trade routes, demand patterns, sanctions, operational behaviour, and geopolitical constraints all change over time.

If a model trained after those changes is used to predict 2018, the retrospective result becomes misleading.

In a domain like shipping, time is not just an index column. It is part of the system.

Consider COVID-19. The lockdowns in 2020 triggered a major drop in oil demand and changed shipping behaviour. If this information leaks into a model used to retrospectively evaluate 2018 predictions, the model can assign importance to patterns that were not available in the pre-pandemic world.

The same applies to the war in Ukraine and the subsequent sanctions on Russia. Those events affected vessel movements and trade flows. A model trained after those changes may encode relationships that did not exist, or were not knowable, in 2018.

That is the practical danger. Future leakage can make retrospective predictions look strong for the wrong reason.

The model may look informed. The issue is that it is informed by events the historical model could not have known.

The Shift: Validate The Pipeline

This is where I think the conversation should move from model validation to pipeline validation.

Taken too literally, that may sound provocative. Of course model performance matters. But as an engineer, I do not only care about whether one model trained once looks good. I care about whether the training pipeline can repeatedly produce good models under the constraints of time, data freshness, and production reality.

That distinction matters because retrospective prediction should not usually be done with one model.

If we have shipping data from 2016 onward and we want to predict 2018, one sensible approach is:

train on 2016 and 2017
predict 2018
incorporate what actually happened in 2018
train a new model for 2019
repeat this process through later years

There are then two common strategies:

an expanding window, where the training data grows over time
a sliding window, where the model is trained on a fixed recent period

In both cases, the evaluation target has changed. We are no longer asking, “Is this one model good?” We are asking, “Can this pipeline keep producing reliable models as time moves forward?”

Rolling windows force the validation process to respect the timeline instead of flattening history into one training set.

Model drift and new data will always push teams toward retraining. That means the training pipeline deserves the same level of care we already give to production ETL pipelines.

This is not just a nuance. It changes the engineering standard.

The objective is not to produce one impeccable model in isolation. The objective is to prove that the ML pipeline can generate a sequence of useful, traceable, reproducible models.

A model is an output. The pipeline is the production capability.

What Pipeline Validation Needs To Prove

Once multiple models become the norm, several engineering properties become central.

Idempotence And Determinism

Given a specific data snapshot and configuration, the pipeline should produce the same model, or at least an equivalent one, every time.

This matters because data scientists and engineers need to separate the impact of a code change from the noise of an unstable training process. If the same input can produce meaningfully different outputs without explanation, debugging becomes guesswork.

Consistency

The models produced across different windows should be held to a consistent standard.

One strong year is not enough. If the pipeline performs well only when the data is favourable, then the system is fragile. Pipeline validation should expose that fragility instead of hiding it inside aggregate metrics.

Temporal Stability

Performance over time matters.

If recent windows behave very differently from older windows, that may reveal changes in the domain, gaps in the feature set, data quality issues, or a pipeline that no longer captures the right signal.

Temporal instability is not always bad. Sometimes the world really has changed. But the pipeline should make that visible.

The Quest For Temporal Stability

Temporal stability is influenced by both the domain and the computational setup.

Nature Of Data Changes

In the energy domain, the structure of the data can evolve. Geopolitical events, operational shifts, and changes in trade flows can all affect the patterns a model needs to learn.

If the world is changing quickly, a sliding window may be more appropriate because it gives more weight to recent data. If there are longer-term cyclic patterns, an expanding window may provide a clearer view.

Business Objectives

If the goal is to understand long-term patterns, an expanding window may be the better fit. If the goal is to respond quickly to market changes, a sliding window may be more useful.

This is not only a data science choice. It is a product and business choice as well.

Computational Costs

As the available data grows, training on all historical data becomes more expensive.

If resources are constrained, a sliding window may be more practical because the dataset size stays bounded. That trade-off is not purely technical either. It affects how often the pipeline can run and how quickly the team can iterate.

The Model’s Ability To Forget

Some model classes can retain old patterns even when newer data suggests the world has moved on.

In those cases, a sliding window can help force the model to shed outdated patterns. An expanding window, by contrast, may overemphasise history that is no longer representative.

Sliding Vs Expanding Windows

There is no universal answer. The right choice depends on the problem, the data-generating process, and the cost of being wrong.

1. Sliding Window

A sliding window trains on a fixed-size recent period. For example, train on 2017-2018 to predict 2019, then slide forward and train on 2018-2019 to predict 2020.

The main advantage is temporal relevance. The model is always trained on recent data, which is useful in fast-changing environments.

The drawbacks are also real:

it can miss longer-term patterns
it can produce more variable results across windows
it may discard useful historical context too aggressively

2. Expanding Window

An expanding window grows over time. For example, train on 2017-2018 to predict 2019, then train on 2017-2019 to predict 2020, and so on.

The main advantage is historical context. The model sees more of the past and may capture longer-term patterns.

The drawbacks are:

computational cost grows over time
old data may become less relevant
the model may become slower to adapt to structural change

3. Hybrid Approaches

In some systems, a hybrid approach is more appropriate.

For example, an expanding window can be used up to a certain point, after which a sliding window keeps the training set bounded. Another option is a weighted expanding window, where recent data carries more weight but older data is not fully discarded.

The windowing strategy is part of the system design. It encodes assumptions about how much the past should matter.

Measuring Pipeline Effectiveness

Once the pipeline is the target, the metrics also need to widen.

Aggregate Metrics

Evaluate models across multiple periods and then look at aggregate metrics such as accuracy, precision, recall, F1 score, median performance, and variance.

The variance matters. A high median with unstable windows may still be operationally risky. A lower but more stable model may sometimes be more useful, depending on the product.

Adaptability

Data sources change. Feature sets evolve. Domain conditions shift.

A strong ML pipeline should adapt to these changes without silently degrading. That means versioning, traceability, and clear ownership of feature logic are not optional.

Data Leakage Detection

Data leakage is a silent killer in retrospective analysis.

Performance that looks too good to be true often is. Suspicious correlations, unrealistic jumps in performance, or features that depend on future outcomes should trigger investigation.

Some practical safeguards:

Feature construction: features must not be calculated using future data.
External data alignment: external datasets must obey the same temporal restrictions as the primary data.
Shuffling care: random shuffling can destroy the meaning of time-series evaluation.
Time-aware cross-validation: conventional cross-validation is usually the wrong tool for sequential data.
Feature engineering per window: cleaning, normalisation, standardisation, and feature engineering should be re-executed for each data window.

The last point is easy to underestimate. If normalisation statistics are computed across the full dataset and then used inside older windows, future information has already leaked into the past.

Periodic Validation Applies To Live Models Too

The same principles apply to live models.

Retrospective validation makes the timeline problem obvious, but live models face the same pressure. Data changes, external conditions move, and the model’s assumptions age.

For neural networks, validation is often discussed around epochs. But the broader need for regular validation is not specific to neural networks. Any model that operates in a changing domain needs periodic checks that respect time.

Time-series cross-validation is useful because it tests performance across chronological splits. It helps expose overfitting, leakage, and temporal brittleness.

The goal is not only to keep a model fresh. The goal is to keep the validation story honest.

Efficiency And Traceability

Efficiency metrics are also part of the picture.

If training gets slower every time the data grows, the pipeline may become too expensive to run frequently enough. If traceability is weak, the team may not know which data, features, code, and hyperparameters produced a given model.

That lineage matters.

When multiple models are generated periodically, each one needs a clear record:

data snapshot
feature definitions
training code version
hyperparameters
evaluation window
output artefact

This is not bureaucracy. It is how teams make iteration explainable.

Without traceability, improvement becomes folklore. With traceability, each refinement builds on something the team can actually understand.

Last Words

Machine learning in the energy sector keeps evolving, as it does everywhere else. But the core lesson here is broader than one domain.

If the system needs to make claims about historical predictions, the validation process must respect history.

That means moving beyond a narrow question of whether one model performs well. The more useful question is whether the pipeline can repeatedly produce reliable, traceable, temporally honest models as the world changes around it.

In practice, that is the shift from model validation to pipeline validation.

And for production ML, that shift is not cosmetic. It is the difference between a model that looks good in retrospect and a system that could actually have made the prediction at the time.

Harmonizing Avro and Python: A Dance of Data Classes

2023-11-07T00:00:00+00:00

Reposting from the Vortexa medium blog

In the realm of data engineering, managing data types and schemas efficiently is of paramount importance. The crux of the matter? When data schemas are poorly managed, a myriad of issues arise, ranging from data incompatibility to runtime errors. What I am aiming for in this article is to introduce Apache Avro, a binary serialization format born from the Apache Hadoop project, through which I hope to highlight the significance of Avro schemas in data engineering. Finally, I will provide you with a hands-on guide on converting Avro files into Python data classes. By the end of this read, you’ll grasp the fundamentals of Avro schemas, understand the advantages of using them, and be equipped with a practical example of generating Python data classes from these schemas.

The Issue at Hand

Imagine the following scenario:

Your application’s new update starts crashing for a specific set of users.
Upon investigation, you discover the root cause: a mismatch between the expected data format and the actual data sent from the backend.
Such mismatches can occur due to several reasons — maybe a field was renamed, or its data type got changed without proper communication to all stakeholders.
These are real-world problems arising from the lack of efficient schema management.
So, how can Apache Avro and particularly Avro schemas help deal with these predicaments?

Avro… what now?

In the grand scheme of data engineering and big data, one might compare the efficient storage and transmission of data to the very lifeblood of the show. Now, if this show needed a backstage hero, it would be Apache Avro. This binary serialization format, conceived in the heart of the Apache Hadoop project, is swift, concise, and unparalleled in dealing with huge data loads. When the curtain rises for powerhouses like Data Lakes, Apache Kafka, and Apache Hadoop, it’s Avro that steals the limelight.

The Evolution of Data Serialization

Before diving into the tapestry of data’s history, let’s demystify a foundational concept here: serialization. At its core, serialization is the process of converting complex data structures or objects into a format that can be easily stored or transmitted and later reconstructed. Imagine packing for a trip; you organize and fold your clothes (data) into a suitcase (a serialized format) so that they fit neatly and can be effortlessly unpacked at your destination.

With that in mind, the story of data storage and transmission is a dynamic saga filled with innovation, challenges, and breakthroughs. Cast your mind back to the times of simple flat files–text files abiding to a specific structure. They were the humble beginning, like parchment scrolls in a digital era. But as data grew in complexity, our digital scrolls evolved into intricate relational databases, swift NoSQL solutions, and vast data lakes.

Now, imagine various systems, microservices, or extract-transform-load (ETL) pipelines, trying to communicate with one another by attempting to read unfamiliar data formats. It’s like trying to read a book when you don’t know the language it’s written in. To solve this, data had to be serialized–essentially translating complex data structures into a universally understood format. The early translators in this world were XML and JSON. Effective? Yes. Efficient? Not quite. They often felt like scribes painstakingly inking each letter, especially when handling vast amounts of data. The world needed a faster scribe; one that was both concise and precise.

Enter Avro. Inspired by the bustling highways of big data scenarios–from the lightning speed of Kafka to the vastness of Hadoop–Avro was born to ensure that data packets glided smoothly without unexpected stops. It became the guardian of data integrity and compatibility.

What’s in a POJO?

So, integrity is the keyword here, and in the context of this blog, we care about integrity breaches concerned with schema changes in a service that are not properly propagated to its consumers, rendering them unable to accommodate the new schema of the data they consume–like reading a book in a foreign language 😉.

The Dawn of the POJO Era

In the realm of programming, particularly within Java, a hero emerged named the Plain Old Java Object (POJO). This simple, unadorned object didn’t extend or implement any specific Java framework or class, allowing it to represent data without any preset behaviors or constraints. Imagine a Person POJO, detailing fields like name, age, and address without binding rules on how you should engage with these fields. Simple and elegant.

public class Person {

    private String name;
    private int age;
    private String address;

    // Default constructor
    public Person() {
    }

    // Constructor with parameters
    public Person(String name, int age, String address) {
        this.name = name;
        this.age = age;
        this.address = address;
    }

    // Getters and setters for each field

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    @Override
    public String toString() {
        return "Person{" +
               "name='" + name + '\'' +
               ", age=" + age +
               ", address='" + address + '\'' +
               '}';
    }
}

However, as data complexity increased and systems multiplied, ensuring that these straightforward representations, our POJOs, maintained their integrity when transmitted or stored across varying systems became a challenge. Manual serialization, translating each POJO for different systems, wasn’t just laborious — it was a minefield of potential errors.

Enter the need for an efficient and consistent serialization mechanism. One that could not only describe these POJOs but also seamlessly encode and decode them, ensuring data looked and felt the same everywhere.

Apache Avro & the Magic of Schemas

Amidst this backdrop, Apache Avro took centre stage. While the POJO painted the picture, Avro became the artist’s brush, allowing the artwork to be replicated without losing its original essence. Integral to Avro’s magic were its schemas. These files, with their unique .avsc extension, were a form of a blueprint, dictating the structure of an entity, data types, and nullable fields or default values. (see the Person.avsc as an example here).

{
  "type": "record",
  "name": "Person",
  "namespace": "com.example",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "address",
      "type": "string"
    }
  ]
}

Pairing the intuitive design of POJOs with the precision of Avro schemas, developers had a formidable toolkit. Now, data could be managed, shuttled, and transformed without ever losing its core essence or structure. But what if these changes weren’t properly communicated amongst interacting systems?

Challenges in Schema Communication

Imagine two services: Service A (the Producer) that creates and sends data, and Service B (the Consumer) that receives and processes it. Service A updates its schema — perhaps it added a new field or modified an existing one. But if Service B is unaware of this change, it might end up expecting apples and receiving oranges.

The Domino Effect: Let’s say Service A, our producer, changes a field from being a number to a string. Service B, expecting a number, might crash or perform incorrect operations when it encounters a string. In a real-world scenario, this could mean misinterpretation of important metrics, corrupted databases, or application failures.
Versioning Nightmares: If every schema change requires updating the application logic in both the producer and consumer, this can quickly spiral into a versioning nightmare. How does one ensure that Service B is always compatible with Service A’s data, especially when they are updated at different intervals?
Enter the Schema Registry: A centralized Schema Registry can be the saviour in this scenario. Instead of letting every service decide how to send or interpret data, the Schema Registry sets the standard.
Registration & Validation: When Service A wishes to update its schema, it first registers the new schema with the registry. The registry validates this schema, ensuring backward compatibility with its previous versions.
Schema Sharing: Service B, before processing any data, checks with the registry to get the most recent schema. This ensures it knows exactly how to interpret the data it receives.
Library Generation: On successful registration, the producer can then trigger a script to create or update the corresponding POJO or Python data class. This automatically generated class can be used directly, ensuring that the code aligns with the latest schema.

Artifact Repository & Versioning

The generated data classes need a home. An Artifact Repository acts as this home. Whenever there’s a change, the updated class is given a new version and stored in this repository. Service B can then reference the specific version of the class it needs, ensuring data compatibility.

Producers, Consumers, and their Interaction: Once the schema changes are validated and registered, and the respective classes are updated, both the producer and consumer know exactly how to interact. They can reliably share data, knowing that both sides understand the data’s structure and meaning.

In essence, a centralised schema management system, paired with a robust registry and an efficient artifact repository, ensures that such data incompatibility issues are rendered not possible!

Generating Python Data Classes from `*.avsc` files

Avro, by its design and origin, has a strong affinity for the Java ecosystem. Apache Avro’s project comes with built-in tools and libraries tailored for Java, which makes generating POJOs straightforward. But when working with Python, things aren’t as easy.

Historically, it is worth noting that the introduction of data classes, which brought a feature similar to Java’s POJOs, came with Python 3.7. It, however, necessitated reliance on external libraries, such as dataclasses_avroschema, for schema-based generation. While these libraries are effective, their unofficial status can raise concerns about long-term reliability. Moreover, their utilization often depends on well-documented and clear examples, which might sometimes be ambiguous or lacking altogether. Furthermore, Python’s dynamic type system, though offering flexibility, poses challenges in maintaining data representation consistency when interfacing with Avro’s static schemas.

In this blog post, I hope to provide a clear example for data class-autogeneration, using an easy-to-understand script. So, let’s dive into an example.

Suppose, as we have already iterated, that we have the Person.avsc:

{
  "type": "record",
  "name": "Person",
  "namespace": "com.example",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "address",
      "type": "string"
    }
  ]
}

Before providing the script, let’s discuss the sample project structure, which can help clarify why, later on, I state that the generated files must be read-only.

Sample Project Structure

Your project structure might look like this:

project/
│
├── resources/
│   └── schemas/
│       └── Person.avsc
├── src/
│   └── types/
│       └── Person.py
├── scripts/
│   └── generate_dataclasses.py
└── Makefile

resources/schemas/: This directory contains the Avro schema files (.avsc)
src/types/: This directory will contain the generated Python data classes (.py).
scripts/generate_dataclasses.py: This script generates the Python data classes from the Avro schemas
Makefile: This file contains the make command to run the script.
Now, you can use the following Python script to generate a Python data class from this Avro schema:

import json
import os
import subprocess

from dataclasses_avroschema.model_generator.generator import ModelGenerator

def main():
    print("Starting script...")
    model_generator = ModelGenerator()

    # Ensure the output directory exists
    output_dir = "../src/types"
    os.makedirs(output_dir, exist_ok=True)

    # Scan the directory for .avsc files
    for root, _, files in os.walk("../resources/schemas"):
        for _file in files:
            print(f"Generating DataClass for: {_file}")
            if _file.endswith(".avsc"):
                schema_file = os.path.join(root, _file)
                output_file = os.path.join(output_dir, _file.replace(".avsc", ".py"))

                # Load the schema
                with open(schema_file) as sf:
                    schema = json.load(sf)

                # Generate the python code for the schema
                result = model_generator.render(schema=schema)

                # Open the output file
                with open(output_file, mode="w") as f:
                    # Write a comment at the top of the file
                    f.write("# This is an autogenerated python class\n\n")
                    # Write the imports to the output file
                    f.write("from dataclasses_avroschema import AvroModel\n")
                    f.write("import dataclasses\n\n")

                    # Remove the imports from the result because we have already written them to the output file
                    result = result.replace("from dataclasses_avroschema import AvroModel\n", "")
                    result = result.replace("import dataclasses\n", "")

                    # Write the generated python code to the output file
                    f.write(result)

                # Format the output file using isort and black
                subprocess.run(["isort", output_file])
                subprocess.run(["black", output_file])

                # Make the file read-only
                os.chmod(output_file, 0o444)

                print(f"Generated {output_file} from {schema_file}")


if __name__ == "__main__":
    main()

This script will generate a Python file Person.py in the ../src/types directory with the following content:

# This is an autogenerated python class
from dataclasses_avroschema import AvroModel
import dataclasses

@dataclasses.dataclass
class Person(AvroModel):
name: str
age: int
address: str

    class Meta:
        namespace = "com.example"

Why Read-Only?

The generated Python files are made read-only to prevent accidental modifications. Since these files are autogenerated, any changes should be made in the Avro schema files, and then the Python files should be regenerated.

Conclusion

The integration of Avro files with Python data classes streamlines the complexities of data handling. It’s a union that empowers the data engineering toolkit, delivering precise type-checking, user-friendly code suggestions, rigorous validation, and crystal-clear readability. With the solid foundation provided by the schema registry, the integrity of your data remains uncompromised, no matter how intricate the data operations become. And while the magic lies in the technology and techniques discussed, the real art is in the consistent, reliable data flow it facilitates. As you delve deeper into the vast world of data, know that tools like these are pivotal in weaving the seamless narrative of your data story.

Stay tuned, as more insights await in follow-up discussions, where we’ll further dissect the intricacies of a comprehensive schema management ecosystem.

Remember to like my post and re-share it (if you really liked it)!

See you soon!

Agile In Action: Bridging Data Science and Engineering

2023-10-31T00:00:00+00:00

Picture taken from National Gallery, London

A few weeks ago, Bill Raymond invited me onto his Agile in Action podcast after reading an older post of mine on doing data science the Agile way.

I said yes because this topic has followed me through most of my career.

I started as a data scientist. Then I spent years watching perfectly respectable prototypes fail to become products. By the time I reached Vortexa, I was leading a team of data scientists and engineers and living right in the middle of the tension I had been talking about for years.

That is the version of Agile I wanted to discuss in the episode. Not the clean whiteboard version. The one that appears when a model has to leave a Python notebook, survive production, and still make sense to the people who have to operate it.

The real gap in ML teams is rarely enthusiasm. It is the distance between a model that works once and a system that can be trusted repeatedly.

Why This Topic Stayed With Me

Part of the reason this topic matters so much to me is that I learned it the frustrating way.

At Data Reply, I worked on one prototype after another. We would explore a problem, build something promising, show strong results, and then hit the same wall: the client liked the idea, but the system never really made it into production. Sometimes the missing piece was infrastructure. Sometimes it was culture. Sometimes it was simply that nobody owned the hard part after the demo.

That started to change for me at UBS.

For the first time, I heard the sentence I had wanted to hear for years: “Great. Now how do we put this into production?”

I was paired with an experienced engineer, and that changed the direction of my career. I stopped seeing engineering as the final packaging step after the interesting work was done. I started seeing it as part of the thinking itself.

That shift is still with me today.

The Real Gap Between Data Science And Engineering

When people talk about cross-functional ML teams, they often make the collaboration sound natural. In practice, it is not.

Data scientists are usually optimising for learning:

trying ideas quickly
testing hypotheses
moving fast through a messy search space

Engineers are usually optimising for control:

reproducibility
determinism
maintainability
safe change over time

Both instincts are valid.

The problem is that they are protecting the system from different failure modes.

The issue is not that data scientists are messy and engineers are rigid. The issue is that both are right about different kinds of breakage.

Take a simple pricing model. A data scientist can build a strong prototype in a notebook, engineer the features, train the model, and prove the concept. But once that model becomes part of a product, somebody has to make sure the production path transforms the raw input in exactly the same way. If the training pipeline and the prediction pipeline drift apart, the system lies even when the model itself is good.

That is why the gap matters so much.

It is not about user interfaces or wrapping code nicely. It is about making sure the system that predicts tomorrow behaves like the system that was validated yesterday.

What Agile Actually Helped With

When I say Agile helped here, I do not mean that Scrum ceremonies somehow solved the problem.

What helped was having a way to make uncertainty legible.

For me, that meant three things.

1. Making experiments explicit

In ML work, “we are exploring” is too vague.

An experiment becomes useful when the team can answer:

what assumption are we testing?
what would count as useful evidence?
what result would tell us to stop?

That sounds simple, but it changes the conversation completely. It stops research from turning into open-ended wandering and gives product and engineering a clearer way to understand what the team is actually learning.

2. Creating shared visibility

At Vortexa, one of the most useful habits we built was a regular data science catch-up where engineers and data scientists could present what they were doing, why they were doing it, and where the risks were.

This was not code review. It was not a status ritual either.

It was a way to keep everyone on the same mental map.

That mattered because a lot of problems in ML systems do not come from one catastrophic mistake. They come from small drifts in understanding. A feature is computed one way in training, another way in production. An assumption about data quality goes unchallenged. A result sounds promising, but nobody else can reproduce it.

Communication is not a soft add-on here.

It is part of the control surface of the system.

3. Putting discipline around handoffs

The teams I trust most are not the ones with the nicest process diagrams. They are the ones that make handoffs visible and expensive enough that people try to remove them.

If the data scientist can disappear after training a model and the engineer is left to guess the rest, the system will eventually reflect that fracture.

If the engineer is never exposed to how experimental the work really is, they will overestimate how stable the solution already is.

Agile helped when it forced us to confront those boundaries earlier.

What ML Teams Still Underestimate

One of the themes that came up in the podcast is that many teams still underestimate how much work starts after the model looks good.

You do not just need versioned code. You need versioned data and a credible way to tie the two together.

You do not just need a model in production. You need monitoring, drift detection, and a practical way to replace the model without breaking the product.

You do not just need experimentation. You need a path from experimentation to something deterministic enough to support.

This is why I often say that notebooks are wonderful research tools and terrible places to leave an idea if you want a system around it to survive.

The Lesson I Was Trying To Communicate

When Bill asked what Agile meant to me in this context, the answer I wanted to give was not especially fashionable.

It was this:

In ML, Agile is useful when it helps the team learn quickly without losing control of the system.

That is really the heart of it.

Not velocity in the abstract.

Not ceremony for its own sake.

Not pretending that uncertainty can be planned away.

Just a disciplined way to:

test assumptions early
expose the right risks
keep engineers and data scientists in sync
and make sure the thing you learned can actually survive contact with production

That was my view then, and I still think it was the right thing to say.

The Podcast

If you prefer the conversation version, the episode is below.

Dynamic(i/o) Why you should start your ML-Ops journey with wrapping your I/O

2022-05-31T00:00:00+00:00

If you call yourself an ML Engineer then you ‘ve been there–you ‘ve seen this before. To productionise your ML pipeline; well, that’s surely a challenge.

I have worked for many years as a Data Science consultant, and I can surely confirm the statement that “…more that 87% of Data Science projects never make it to production”. There is a reason why the first rule of doing Machine Learning is to really be sure you need to do ML! Surely, many reasons play into this challenge:

lack of the right leadership;
no or limited access to data in siloed organisations;
lack of the necessary tooling or infrastructure support, and even;
lack of a research-driven culture.

But there is one more beast to be tamed out there; the gap between Data Science and ML Engineering. And this is a gap you can perceive both in terms of the two practitioners in each field of work–data scientists and SWE–but also in terms of literally getting from a prototype to a production ready ML pipeline.

Simply put, to put a model into production is one thing; but to maintain that model, properly monitor it to identify possible drifts and streamline the process of re-training it or updating it in a robust and reproducible way, supported by a clean CI/CD process, is daunting task! If anything, I ‘d dare say that ML-Engineering, as a domain, fully encapsulates SWE in addition to many more challenges (highly recommend reading Hidden Technical Debt in Machine Learning Systems), for some of which we are still trying to standardise how we work in terms of best tooling or practices.

In many cases, organisations are forced to come up with their own ways of working to accommodate the unique challenges of their custom use-cases. Then again, it all comes down to the requirements of a project. Netflix has streamlined the process of putting python notebooks into production using papermil. Others, go as far as to standardise the whole ML Engineering process using tools like Airflow or Kubeflow, relying on AI pipelines (on GCP) or SageMaker (on AWS), etc.

So what do we do…?

At Vortexa, we are heavy users of Airflow and have recently embarked into a journey to include Kubeflow into our tech stack. As an ML-Engineer, my job usually concerns receiving a successful prototype of a model and implementing a complete end-to-end ML pipeline out of it; one that can be easily maintained and reused. In many ways, this process is very common to a traditional SWE project, only more complex, since ML projects come with more requirements and a strong dependency on data. Hence, it easily follows that everything one cares to implement for a SWE project needs to also be implemented for an ML-Engineering (MLE) project; and more.

But let’s start simple…

Here is my notebook! I am done; your turn now!

So you are handed a notebook, and you inspect it; you spend time with the Data Scientist and understand all crucial aspects in the procedural logic, and you start splitting the process into various tasks. You, usually, end up with something like this:

You think about the structure of your codebase, about how everything will be deployed, how you want to decouple orchestration from the logic of your ML-pipeline, and then you start thinking about domain driven development (DDD). You start thinking about abstractions and encapsulation, about testing and data validation. That’s when it hits you–testing; you can unit test most things and build a robust pipeline, but you also want fast feedback for when you introduce changes and improvements to your pipeline (shifting to the left)! What if you wanted to run a local regression test? With all data being read from external resources (databases, object storage service) you ‘ll have to mock all these calls (doable, but takes time) and replace actual data with sample input. And, finally, what about schema and data validations? How do you guarantee after data ingestion that all your expectations on the input are respected?

You have a look at the code again. Filled with various I/O operations. Sometimes it’s csv, others parquet, and others it’s json, sometimes you read from a database and others from an object storage service (s3 or gcp). Different libraries used to facilitate all these: gcsfs, s3fs and fsspec, boto3 sql-alchemy, tables; and pandas, of course, sits at the core of this process. As if that’s not enough, each file comes with a series of peculiar set of requirements supported through the use of kwargs; in your python code: orientation of json files, row-group-sizes for parquet files, coercions on certain timestamp columns–the list keeps going… This won’t be the last time you need to do this either!

It’s just too many details–way too many details–for you to worry about. A clear violation of the dependency inversion principle:

Business logic (high level code) should not be implemented in a way that “depends” on technical details (low level code, e.g., I/O in our case); instead both should be managed through abstractions!

You need abstractions to facilitate the flexibility to easily introduce changes. More often than not, business needs will require high-level modules to be modified. Low level code, on the other hand, is usually more cumbersome and difficult to change. The two should be independent; a database migration or a switch to an object storage service should have no impact on your work to generate a new valuable feature for your model, and vice-versa. Abstracting both of these using distinct layers can achieve this!

As David Wheeler said:

All problems in computer science can be solved by adding a layer of indirection.

What is `dynamicio` then?

Wouldn’t it be great if you could:

have an abstraction that encapsulates all I/O logic;
be able to seamlessly handle reading or writing from and to different resource types or data types;
have an interface that is easy to understand and use with minimum configuration;
respect your expectations on schema types and data quality;
automatically generate metrics that would be used to leverage further insights, and more importantly;
be able to seamlessly switch between local, dev, staging and prod environments, performing dynamic I/O against different datasets and effectively supporting development, testing and qa use cases?

Well, dynamic(i/o) is exactly that; a layer of indirection for pandas I/O operations.

If you want to find out more about it then register to attend this year’s ODSC and attend the presentation by myself and my colleague Tyler Ferguson on dynamic(i/o). Come and learn about how its implementation and adoption has helped us go beyond just achieving consistency across our ML repos, effectively dealing with glue code and keeping our code-bases DRY, but also acting as an interface between different teams.

Remember to like my post and re-share it (if you really liked it)!

See you soon!

Complete Guide to Python Envs (MacOS)

2021-02-14T00:00:00+00:00

Configuring Python on your machine for the first time is a definite headache for any software engineer that decides to delve into the world of Python. Doing it properly confuses a lot of people and can prove to be very challenging.

It is often the case that many developers have numerous interpreters configured on their machines, without knowing where they live.

Most common ways of setting up Python

Firstly, there is a Python version that ships with macOS, but it is usually v2.7, which is not just out of date but also deprecated.

So, commonly, most users will download the latest Python release and move it to their $PATH or use brew install python3 (which does this for them).

Both of these solutions can cause many problems that will not be evident straight away. The main challenge, is usually not knowing, at any given time, what is the “default Python” that your system is using. Ideally, this is something you shouldn’t care about, but if you don’t set up things properly, you end up installing packages for the wrong environment or the wrong active Python interpreter, un-intentionally created from the wrong Python distribution and… well, you get the point (…this is pretty much summed up in the xkcd image above).

To find out more details, read this excellent December 2020, post, by Matthew Broberg.

How to avoid all these?

The short answer is “use pyenv”. pyenv will enable you to not only setup python properly on your machine, but to also manage different versions and python environments in a simple and straightforward way. As explained on the package’s github page:

“It’s simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well.”

For me, its main benefits are:

It depends on Python itself, i.e. since it was made from pure shell scripts, there is no bootstrap problem of Python.
It manages the need to be loaded into your shell though pyenv’s shim approach, which adds a directory to your $PATH.
It manages virtual environments, though I recommend using pyenv-virtualenv to automate the process.

Let’s get to it

Before you do anything make sure you start with a clean sheet. To do so, uninstall or remove any python distributions you already have. I strongly advise you to follow this link.

Now, assuming you have brew installed on your machine, do:

$ brew update
$ brew install pyenv

We will now need pyenv-virtualenv. pyenv-virtualenv is a pyenv plugin that provides features to manage virtualenvs and conda environments for Python on UNIX-like systems.

$ brew install pyenv-virtualenv

Setting up your `global` interpreter

So, the first thing you want to do is set up your global interpreter. This is the python environment that will be used by default by your system, unless you dictate otherwise.

If you run:

$ pyenv install --list
Available versions:
  2.1.3
...
  3.10-dev
  activepython-2.7.14
...
  activepython-3.6.0
  anaconda-1.4.0
...
  anaconda3-2020.07
  graalpython-20.1.0
  graalpython-20.2.0
  ironpython-dev
...
  ironpython-2.7.7
  jython-dev
...
  jython-2.7.2
  micropython-dev
...
  miniconda-latest
...
  miniconda3-4.7.12
  pypy-c-jit-latest
...
  pypy3.6-7.3.1
  pyston-0.5.1
...
  pyston-0.6.1
  stackless-dev
...
  stackless-3.7.5

You will see the full list of Python distributions available for installation.

Choose the one you want and do, e.g. 3.9.0:

$ pyenv install 3.9.0
python-build: use openssl@1.1 from homebrew
python-build: use readline from homebrew
Downloading Python-3.9.0.tar.xz...
-> https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tar.xz
Installing Python-3.9.0...
python-build: use readline from homebrew
python-build: use zlib from xcode sdk
Installed Python-3.9.0 to /Users/<username>/.pyenv/versions/3.9.0

Once installation is complete, you can set this version as your global:

$ pyenv global 3.9.0

At this point, one should be able to find the full executable path to each of these using pyenv version.

$ pyenv version
3.9.0 (set by /Users/<username>/.pyenv/version)

Creating and managing virtual environments automatically

This is a standard practice when working with Python. The idea is to keep different environments isolated. Each Python environment can be associated to multiple projects, but it is generally better to just go for a one to one mapping.

Why you say? Well, for starters, this helps you maintain your system clean by not installing system-wide libraries that you are only going to need in a small project. It also allows you to use a certain version of a library for one project and a different version for another. Finally, it helps make your project reproducible and ensures it is configured in an identical manner across local environments amongst collaborating developers.

Let’s go through an example.

Suppose you have a github root directory where you clone and maintain all your projects and it looks like this:

GitHub
├── project_a
└── project_b

What you want to do is setup a different python virtual environment per project. What’s more is that you would like to automatically activate that virtual environment by means of simply accessing (cd-ing) into that project. Let’s see how we can do that.

First, I ‘ll assume you are using .zshrc as your default shell and have configured oh-my-zsh. If not, then just set it up. Note that this is not a pre-requisite; it’s more of a personal preference, but using oh-my-zsh does come with many benefits, like showing the current active python environment on your console, which is why I am recommending it.

In order to enable the above automations, we will need two pre-requisites. The first, is to include 2 files in each project (you can version control these files). The first is .python-version and the second is .python-virtualenv, as per the below tree:

GitHub
├── project_a
│   ├── .python-version
│   └── .python-virtualenv
└── project_b
    ├── .python-version
    └── .python-virtualenv

In each of these files you just add a line at the very top of the file with:

the python version you want to use
the name of the virtual environment you want to create.

For example, the contents of

GitHub
├── project_a
│   ├── .python-version 

can be:

3.9.0

and:

GitHub
├── project_a
│   └── .python-virtualenv

can be:

project-a-venv

similarly, for project b you can have 3.8.2 and project-b-venv.

Now, on to your .zshrc. Do:

$ vi ~/.zshrc

and add the following script:

# Define your $PATH
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"

# Automatic venv activation
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
export PYENV_VIRTUALENV_DISABLE_PROMPT=1

# Undo any existing alias for `cd`
unalias cd 2>/dev/null

# Method that verifies all requirements and activates the virtualenv
hasAndSetVirtualenv() {
  # .python-version is mandatory for .python-virtualenv but not vice versa
  if [ -f .python-virtualenv ]; then
    if [ ! -f .python-version ]; then
      echo "To use .python-virtualenv you need a .python-version"
      return 1
    fi
  fi

  # Check if pyenv has the Python version needed.
  # If not (or pyenv not available) exit with code 1 and the respective instructions.
  if [ -f .python-version ]; then
    if [ -z "`which pyenv`" ]; then
      echo "Install pyenv see https://github.com/yyuu/pyenv"
      return 1
    elif [ -n "`pyenv versions 2>&1 | grep 'not installed'`" ]; then
      # Message "not installed" is automatically generated by `pyenv versions`
      echo 'run "pyenv install"'
      return 1
    fi
  fi

  # Create and activate the virtualenv if all conditions above are successful
  # Also, if virtualenv is already created, then just activate it.
  if [ -f .python-virtualenv ]; then
    VIRTUALENV_NAME="`cat .python-virtualenv`"
    PYTHON_VERSION="`cat .python-version`"
    MY_ENV=$PYENV_ROOT/versions/$PYTHON_VERSION/envs/$VIRTUALENV_NAME
    ([ -d $MY_ENV ] || virtualenv $MY_ENV -p `which python`) && \
    source $MY_ENV/bin/activate
  fi
}

pythonVirtualenvCd () {
  # move to a folder + run the pyenv + virtualenv script
  cd "$@" && hasAndSetVirtualenv
}

# Every time you move to a folder, run the pyenv + virtualenv script
alias cd="pythonVirtualenvCd"

Save your changes, return to your terminal and either restart your terminal or do:

$ source ~/.zshrc

Now, let’s assume that you are in GitHub directory:

$ pwd
/Users/<username>/Github

Then, if you do:

~/GitHub $ cd project_a
created virtual environment CPython3.9.0.final.0-64 in 448ms
  creator CPython3Posix(dest=/Users/<username>/.pyenv/versions/3.9.0/envs/project-a-venvo, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/<username>/Library/Application Support/virtualenv)
    added seed packages: pip==20.3.1, setuptools==51.3.3, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
(project-a-venv) --------------------------------------------------------------------------------
~/GitHub/project_a $

and, if you come out of it and change to project b:

$ cd ../project_b
created virtual environment CPython3.8.2.final.0-64 in 932ms
  creator CPython3Posix(dest=/Users/<username>/.pyenv/versions/3.8.2/envs/project-b-venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/<username>/Library/Application Support/virtualenv)
    added seed packages: pip==20.3.1, setuptools==51.3.3, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
(project-b-venv) --------------------------------------------------------------------------------
~/GitHub/project_b $  

Now, two new virtual environments have been created:

$ pyenv versions
system
* 3.8.2 (set by /Users/<username>/GitHub/project_b/.python-version)
  3.8.2/envs/project-b-venv
  3.9.0
  3.9.0/envs/project-a-venv

and every time you cd into these directories, your pyenv will switch automatically.

Note 1: You may face some issues with python 3.8.7.
Note 2: To uninstall a python env, do: pyenv uninstall 3.8.2/envs/project-b-venv

Using `jupyter notebook` or `jupyter lab` with a virtual environment of your choice

Finally, suppose you want to use a python environment with a jupyter notebook. This is not as straightforward as one would think. Here is how you would do it.

Let’s continue from where we left things in the previous section. You are in:

GitHub
├── project_a

and you have project-a-venv activated:

(project-a-venv) --------------------------------------------------------------------------------
~/Github/project_a $

First thing you need to do is install ipykernel using pip:

$ pip install ipykernel

Next, you need to install a new kernel:

ipython kernel install --user --name=project-a-venv

Finally, assuming you have jupyter or jupyterlab installed, you can start jupyter, create a new notebook and select the kernel that lives inside your environment.

$ jupyter notebook

Final notes

I really hope that this was a helpful post and if you are new to python, I hope that I have helped you disambiguate some confusing aspects of configuring python at the start of your journey!

The bellow references were very helpful for putting together this post:

Remember to like my post and re-share it (if you really liked it)!

See you soon!

A BREXIT NLP Dataset!

2020-09-02T00:00:00+00:00

So here is the thing… I love discussing politics; I think that everyone should, at least occasionally, bother themselves with what is happening in their country’s political scenery.

Regardless of whether you are into politics or not, it would be practically impossible to escape debating BREXIT back in the summer of 2016. At the time, I had just been hired by Data Reply UK and the company’s annual XChange conference was around the corner.

My boss at the time, wanted to us to come up with something interesting and eye catching for our demo pod at the conference. So, being that BREXIT was a trending and highly debated topic, I thought that maybe I can come up with a way to predict peoples’ political stance by means of their social activity.

The idea

The idea was simple:

Provided one’s twitter @handle, try to infer their political views on BREXIT.

The original approach was to:

Collect people’s tweets through the twitter API;
Label tweets related to BREXIT as either PRO or CON;
Calculate a ratio between the 2 and produce a number that would represent their political stance.

After experimenting a bit, I figured out that using one’s own tweets would not be enough. Many twitter users don’t tweet that often and when they do, they are not really concerned with the EU or BREXIT. So I thought that maybe we can use the tweets of the people that one follows. This draws from social science and ideas behind tribalism:

“…you are likely to be ideologically aligned with the positions of your peers [or of those you follow on twitter ;)]!”

The dataset

In order to be able to label tweets, I had to develop an NLP ML model. To do so, I needed a relatively big corpus of labelled tweets.

I turned to an article by BBC at the time, which categorised MPs according to the public stance on BREXIT. Using a twitter list that had the twitter handles of 449 MPs at the time and using the twitter API, I accumulated a corpus of 60,941 tweets from 449 UK MPs (at the time). Tweets had one or more of the following keywords:

key_words = ['European union', 'European Union', 'european union', 'EUROPEAN UNION',
    'Brexit', 'brexit', 'BREXIT',
    'euref', 'EUREF', 'euRef', 'eu_ref', 'EUref',
    'leaveeu', 'leave_eu', 'leaveEU', 'leaveEu',
    'borisvsdave', 'BorisVsDave',
    'StrongerI', 'strongerI', 'strongeri', 'strongerI',
    'votestay', 'vote_stay', 'voteStay',
    'votein', 'voteout', 'voteIn', 'voteOut', 'vote_In', 'vote_Out',
    'referendum', 'Referendum', 'REFERENDUM']

and were automatically labelled based on the views of the MP who tweeted them.

You can find more details on how I worked to generate the ML model and how the demo solution worked if you follow this github repository.

Dataset now available on Kaggle

It took me some time to publish it, but the dataset is now available to everyone to use on Kaggle. You can find it if you follow this link.

I hope that the ML community will make good use of it. It’s 4 years after the referendum but BREXIT is yet to really happen and unfortunately it remains a concerning issue. So, who knows, maybe someone wants to use this dataset in some other equally interesting way.

Remember to like my post and re-share it (if you really liked it)!

See you soon!

Style Transfer in Heraklion

2020-08-15T00:00:00+00:00

I am currently in Crete for my annual get away. Crete is an amazing island with many beautiful places to visit and a vast history that goes all the way back to the Minoans in 3500 BC.

One of the things I love doing whenever I am here is strolling around the city of Heraklion and taking pictures of the many hidden alleys, which reveal an amazing graffiti culture! I really wanted to write about it in my blog and I thought that maybe I can do so by using some amazing images I gathered just last week in a style-transfer post. So this is it: Style Transfer in Heraklion.

A bit of history: A Neural Algorithm of Artistic Style

Neural Style Transfer (NST) is a class of algorithms that process images to adopt the visual style of another image. A seminal paper that introduced this concept was “A Neural Algorithm of Artistic Style” by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge. In their work, the authors emphasize that:

“…representations of content and style in Neural Networks are seperable”.

This is the foundation of this work, since if these two notions are indeed separable then provided two images you can get the style of the first, the content of the second and merge them together. So, how is this done exactly?

Delving into the details

The first figure in the paper shows the original setup and how a pre-trained NN, referred to as VGG19, was modified to do NST. What is VGG19? Well, the basic building blocks of traditional convolutional networks are the following layers:

a convolutional layer (with padding to maintain the resolution);
a non-linear activation layer such as a ReLU, and;
a pooling layer such as a max pooling layer.

A VGG block consists of a sequence of convolutional layers, followed by a max pooling layer for spatial down-sampling. What we are interested in is how this network will respond to the inputs.

Retrieving the content

Notice that the authors prefer to use paintings for the style and a random image; it seems like these combinations work best. The main idea is abstracting the content and putting more emphasis on the style!

At the top left you see “The Starry Night” by Vincent van Gogh and below it is just a random content image; let’s start with the latter.

Provided both an input (style) image and a content image, each neuron and respectively each layer in the NN will either activate or it won’t. Each image is processed, or better yet filtered, in a different way (by nature of the activation or not of different neurons). Looking at how the content image is gradually filtered in the above image you will notice that the first layer leaves the image seemingly intact. But looking all the way to the last filtered output you see that this is not the case at all; shapes are there but the inside is not so much the same. This is because of how the resulting high-level features are generated on earlier abstractions of the same image produced by previous layers. This is the intended behaviour to retrieve the content.

Retrieving the style

So, for the style, the authors explain that they have built a new feature-space, which focuses on the style of an input image on top of the original CNN representations. The style representation computes correlations between the different features in different layers of the CNN. They reconstruct the style of the input image from style representations built on different subsets of CNN layers and this results in images that match the style of the input on an increasing scale while discarding information of the arrangement of the scene.

It’s all in the formulas (or formul$ae$)

The authors also discuss the impact of the number of layers used to infer the style or the content of images before they are merged (visually depicted in Figure 3 of the paper). In the first row (A) only one layer is used in contrast to 5 layers used at the bottom row where the result is much better.

To generate the images which are a mixture of the content of an image-A with the style of another (image-B) the authors explain that they jointly minimise the distance of a “white noise” image from the content representation of image-A in one layer of the network and the style representation of image-B in a number of layers of the CNN. This is gracefully captured by the below loss function:

where $\overrightarrow{p}$ is image-A (usually a photograph where we care about the content) and $\overrightarrow{a}$ is image-B (usually a painting where we care to retrieve the style). Then $\alpha$ and $\beta$ respectively concern weighting factors for content and style reconstruction.

Going back to Figure 3 of the paper, looking at it from left to right we see what happens when we tweak these weighting factors ($\alpha$ and $\beta$). The left-most column concerns cases where $\alpha$ is low compared to $\beta$ and the right-most layer is the other way around. These two factors are practically the optimisers of the content and style errors respectively. If $\alpha$ is high, it means that content error is more important and vice-versa for increasing $\beta$.

The objective of the formula is to minimize $\mathcal{L}_{total}$. $\overrightarrow{x}$ is the image that we are gradually building through multiple iterations and it initially comes either from the photograph ($\overrightarrow{p}$) or it is initialised as white noise. $\alpha$ and $\beta$ are the weights that we need to set, and they are basically our hyper-parameters in this problem.

What is now left is understanding $\mathcal{L}_{content}$ and $\mathcal{L}_{style}$.

$\mathcal{L}_{content}$

Here is where everything gets a bit complicated but at the same time, you get to piece everything together nicely.

$\mathcal{L}_{content}$ is described as the squared-error loss between two feature representations: one concerned with the random photograph $\overrightarrow{p}$ and the generated image $\overrightarrow{x}$ which is originally a white noise image. $P^l$ and $F^l$ are the respective feature representations for the two images in layer $l$. The authors used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG Network. Here, $F^l$ represents an activation function ($F$) at a given layer $l$ or, plainly, a bank of non-linear filters for that layer. The complexity of these filters increases with the position of the layer in the network. $F$ is practically a matrix of size $N\times M$ where $N$ is the number of filters within a given layer with $N_l$ feature maps of size $M_l$; the latter is the height $\times$ width if the feature map.

So, a given input image $\overrightarrow{x}$ is encoded in each layer of the CNN by the filter responses to that image. To visualise the image information that is encoded at different layers of the hierarchy the authors perform gradient descent on the white noise image to find another image that matches the feature responses of the original image. So, the approach is to gradually changes the initially random image $\overrightarrow{x}$ until it generates the same response in a certain layer of the CNN as the original image.

$\mathcal{L}_{style}$

The style loss function is described by the following equation:
which is basically a sum of the weighted distances between feature correlations across the different filter (layer) responses for two images:

the original image $\overrightarrow{a}$, and;
a white noise image $\overrightarrow{x}$, used to generate a style representation of the original image.

Let’s break this down a bit more; what are these feature correlations? Practically they are a way to express a relationship between a feature map $F$ and the filters ($i$ and $j$) of the different layers ($l$) applied on it. This is beautifully expressed as a matrix of all possible inner products between the generated set of feature vectors, called a “Gram matrix $G$”, as per the below equation:

One such matrix is generated for each of the two images (the original $\overrightarrow{a}$ and $\overrightarrow{x}$), namely $A_{ij}^l$ and $G_{ij}^l$, and a squared distance is calculated between these two. The objective is to minimise the distance. So, practically, as with every ML problem, what we have is an optimisation problem and a cost function! Minimising this distance can be achieved through the application of gradient descent using standard error back-propagation to adjust the weight values of equation $5$.

Putting it all together

Finally, in order to generate the final image with the style transfer, we return to equation 7, which practically jointly minimises the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting in a number of layers of the CNN. The authors also note that:

For image synthesis they found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing results.

That’s it! So, what’s left now is getting our hands dirty!

Using `PyTorch` for Style transfer

If you following this link to the official PyTorch website you will find a very well written tutorial on how to apply style transfer with PyTorch. I provide my own take of it $\rightarrow$here$\leftarrow$. You can follow the link to the python notebook and copy-paste the code to give it a try.

I intend to work on creating a package for it and will provide an updated post on it once I do (it will be developed in the same repo as the link). The intention is to be able to style images through the packages through an intuitive api that would take the image to be styled as the input and a choice between famous images that will be available through the package (provided as a text parameter) to produce the desired output (along with some other flags and side parameters). Something like:

import pytorch_style_transfer as pst

pst.generate(
    input_image_path="path_to_input_image", 
    style="starry_night", 
    resolution=128, 
    output_dir="path/to/output")

Enjoy some of the outputs:

Here are some of the results of this work. I tried blending the fortress of Koules with 4 different grafittis I was able to photograph.

The original picture:

The result is not always great, but it was still very interesting to try:

That’s it!

Remember to like my post and re-share it (if you really liked it)!

See you soon!

Agile Data Science

2020-08-11T00:00:00+00:00

Re-posting from https://www.iunera.com/

Around two weeks ago I was approached by Dr. Tim Frey, General Manager at Iunera GmbH & Co. KG. I was quite surprised to read his message:

Hi Christos, We met at the mind mastering machines conference in London. We operate a company blog (https://iunera.com/kraken ) and one of our writers wrote about agile in Data Science. I liked your talk two years ago and I thought she can approach you to ask a few questions like kind of an in-article interview with an expert. Hope that is fine with you. Would be super glad to get your insights.

I must admit this was a first for me! Then again, that talk in 2018 was quite an interesting one for me too.

How it all happened…

You see, 3 years ago I was asked to join an exceptional team over at UBS to help with a graph analytics project. If you asked me then I would proudly tell you that “…I am a Data Scientist”; that is how I saw myself. However, that was bound to change forever.

The first three months were amazing. I worked with a vast amount of data and revealed some very interesting insights. So, inevitably, my project manager approached me and asked “…how about we take this work of yours into production”!

I didn’t have a clue about what that meant in reality, but I was about to find out. He said: “Well, don’t worry, we will pair you with an engineer and you both can get started on it”. So we did!

This is basically the story about how I was exposed to software engineering and the Agile way of working–about how I was converted into an ML Engineer. Two years later I decided to take my learnings from this experience and share it with my community, and so I did at mCubed London in 2018:

That’s where I also met Tim. Turns out that a year and half later a colleague of Tim’s (Dhanhyaashri Mahendran) was doing a bit iof research on “Doing Data Science the Agile way” and Tim suggested that she gets in touch with me to ask me some questions, which I welcomed.

Some very interesting questions were thrown my way…

I really liked the questions that Dhanhyaashri had prepared. She had obviously done her research. I did my best to respond and two weeks later the interview was published on the Iunera blog. You can read it here but I also felt like re-posting the interview on my personal blog too.

Besides the cutting of time-consuming planning and quicker turnaround of projects, what other benefits are there in applying the Agile approach in data science?

For the community, I would say that that would be the emergence of new Data-Science-oriented practices that will drive the application of Agile in the research domain.

The problem with applying Agile in Data Science is that, traditionally, Agile is practiced in software development projects where experimentation, testing and tuning are minimum (usually dealt as spikes). The focus there is about delivering business requirements, in the form of features and products, fast in a volatile, constantly evolving environment. To support this, a number of underpinning practices have been developed, covering areas like modelling and design, coding and testing, risk handling and quality assurance. But all these, focus primarily on feature delivery (backlogs, user-stories, CI/CD, TDD or BDD to name a few). Some of these underpinning practices can directly be transferred in the Data Science world (e.g. user stories and backlogs, timeboxing and retrospectives) but others, not so much; for instance, how can TDD be useful when experimenting with what is the optimal k with which to cluster customer datasets? So, a clear benefit of trying to apply Agile in Data Science is that gradually, similar Data Science-specific underpinning practices will eventually be developed and these will, of course, be based on the same Agile drives: adaptive planning, evolutionary development, early delivery and continual improvement, and more generally, flexible responses to change.

For the Data Scientists I would say it is mostly about adjusting to the requirement of working in a way to deliver business value from their experimentation.

The feature-oriented focus that Agile is characterised by in the software development world is not so familiar to data scientists and researchers. What’s more is that “value”—business value—is perceived in very different ways across these two worlds as well. Have you ever discussed the “value” of an experiment with a project manager? Not an easy task I assure you! My experience tells me that most of the time this comes down to project managers fearing that no tangible outputs will be produced through experimentation. This is completely wrong, but only as long as experiments are well-structured and well-thought. To me, Agile Data science is all about iterative hypothesis testing. Proving or disproving a hypothesis is always useful; it minimises the risk of failure and increases decision awareness when choosing what needs to be prioritised! But these outcomes can only be achieved when Data Scientists know exactly what they are trying to prove, discover or disprove and how that would be valuable to their team’s objective. Gradually, Data Scientists become better at it and this benefits both themselves as well as their teams.

What are the downsides of Agile in data science? What can we do about these downsides?

Agile is a set of values and principles; as such, I can’t really say that there is something wrong with it. What is surely wrong is to assume that Agile is the only way that a team can work and be productive—it’s not. Ever since Agile emerged—in the concrete form that we know it today through the Agile manifesto—many hurried to undermine the effectiveness of other development models, e.g. Waterfall.

There is nothing wrong with the Waterfall model either; the real question is whether these practices or models are fit for purpose! There are surely research projects as well as business requirements around the delivery of software that could potentially be delivered through the Waterfall approach or maybe through a combination of the two. What project managers and teams should strive for is increasing their effectiveness and efficiency. If that can be done by building on top of the Agile values then great; if not, then maybe they will need to try and come up with a different formula.

Project managers focusing too much on what Agile is and what is not—if it needs to be Scrum or Kanban or if too much documentation or too much time spent in design is not Agile—are bound to make mistakes.

Do you think that the imposition of Agile on teams (the Agile Industrial Complex) is defeating the purpose of Agile in finding what works best for teams in working adaptively?

In similar spirit to my previous response, I do! Once more, I can’t stress enough how there is no single perfect development model. Project managers need to always assess what is fit for purpose. Primarily though, they should focus on the underpinning values and principles that Agile or other development models are characterised by. When they do, a recurrent mistake that I have experienced through my consulting career is the oversimplification of Agile as an anti-methodology, anti-documentation and anti-planning development model. I appreciate that this makes understanding Agile much easier, but at the same time it is a very unfair representation of what Agile is! Imposing it on this basis is surely wrong. Equally, practicing Agile is definitely not something that comes through imposition.

I was exposed to the Agile methodology through a passionate software engineer who was an evangelist of Xtreme Programming. To him, the way he worked was a way of seeing the software engineering world and was supported by many more things than just sprints and Jira tickets and user-stories. Knowledge transferring and evolution through an unparallel team spirit and an overall culture to do things in a way that will help everyone grow (people and software) in a fast-paced and fast-evolving world. Empathy was found at the centre of everything he did and his ability to convey this passion was extreme! @tumbarumba, all the best wherever you are!

This is because Agile is, above all, a culture—a way of thinking; a way of caring about the impact and consequences of every individual’s contribution to a team goal. When it is collectively addressed as such, then only good things can come out of it.

Is there a possible reason for many data scientists to not be aware of the Agile manifesto?

I can’t be too sure about this but I if I was to point at anything, that would be how Data Science has, until recently (5 years ago), been so disjoint from the delivery of production-ready solutions. It was more focussed on research and discovery to aid decision making. Lately, the evolution and growth of ML as well as of cost-effective services to support it, necessitates the interaction of the two worlds.

Never before has it been so much the case that ML models are such an integral component of software. Before, Data Scientists did not need to worry about the operationalisation and maintenance of their model. Concepts like versioning, robustness, code-coverage and testing where not so much imposed or needed, let alone challenges related to things like dealing with technical debt and refactoring. The traditional work environment would be a Jupyter notebook with access to a database! So, Data Scientists did not need to be exposed to so many practices to govern how they would work to deliver new insights.

What kind of challenges stand in the way of operational production level DS solutions?

This mostly has to do with bridging the gap between software engineers and data scientists. Software engineers not exposed to data science can’t really do this because they fail to appreciate how exactly to maintain ML-pipelines. Note that in contrast to traditional software pipelines, there are many more issues that need to be addressed; I would refer your readers to the 2014, NIPS seminal paper on the “Hidden Technical Debt in Machine Learning Systems”. Equally, Data Scientists don’t appreciate the complexity of developing and maintaining code-bases and software solutions in a flexible and robust way to allow for things like CI/CD to be supported. This gap is now partially addressed through the emergence of a new paradigm: the ML engineer, a hybrid data scientist and software engineer, equipped with the knowledge to deal with challenges from both worlds. However, that is not enough to account for everything. What is also necessary is the emergence of appropriate tooling to support the development and maintenance of ML pipelines. A good example is Apache KubeFlow, AWS Sagemaker and the less mature but fast evolving Google AI platform.

What is surely not helpful is the bad practice of finding ways to schedule and run python notebooks in production, and I purposely changed paragraphs to highlight this! I can’t stress enough how many times I have dealt with this in my career! Python notebooks are not made to be run as part of production pipelines—yet so many companies just do so!

This is a plea to every project manager running an ML project out there: This is madness! Please stop it!

This. Is. Sparta! GIF from 300 GIFs

In your opinion, what is the most important factor in making ML-Ops agile?

I think that the answer to this question is “culture”. ML-Ops are here to help cultivate collaboration between data scientists and engineers to support the ML-lifecycle. They are a manifestation of Agile for Data Science in a way! What’s needed is for this mentality towards the development of production level ML solutions to be supported by practitioners, project managers and stakeholders the same. Everyone needs to take risks and own responsibility. Data Scientists need to develop the courage of supporting their experiments even if they may appear to delay production; they need to help stakeholders and project managers appreciate the actual value of experimentation. This will often prove to be very challenging; loss aversion will eventually kick in and when it does people will be more reluctant to change, and they will want to stick to what they know. But this is to be expected! It is natural human behaviour, and this is what we, as a community, are up against.

At the end of the day, we need to remember that it is almost impossible to find the right balance or get it perfectly right. There is no formula for it. Nevertheless, value will come simply from trying to get it right, and that is more than enough!

Many thanks again to both Tim and Dhanhyaashri for their time and effort!

Remember to like my post and re-share it (if you really liked it)!

See you soon!

AWS ML Certification

2020-07-29T00:00:00+00:00

I recently took the AWS Certified Machine Learning - Specialty, which remains one of the most demanding AWS certifications. I went through a lot of work in order to adequately prepare for this exam and I can tell you that it is indeed one of the hardest AWS certifications. Nevertheless, with proper preparation and a bit of dedication you should be fine.

How long do I need to study for this?

Well it depends; if you are an experienced Data Scientist and have been applying Data Science for about 3+ years then an hour per day for a month should be enough. This also holds if you are an engineer already exposed to the AWS infrastructure and services but are not familiar with Data Science topics.

You see, this certification is labelled as hard simply because it is not just about AWS services. 50% of it is concerned with purely Data Science topics; the other 50% is about AWS services that support Data Science and ML activities. If you are neither exposed to Data Science nor to the AWS services then at least 2 months of studying is recommended.

What does the exam cover?

Data Engineering covers 20% of the exam, then Exploratory Data Analysis concerns another 24%, modelling is 36% and Machine Learning Implementation and Operations is 20%.

I put together a list below, in an attempt to summarise the content:

Data Concepts:
- Deals with data preparation routines; things like:
  - Feature selection and;
  - Feature engineering
  - PCA,
  - dealing with missing data or unbalanced datasets,
  - labels and one-hot encoding as well as;
  - splitting and randomisation of data.
ML Concepts: Covers:
- Classical ML Categories of Algorithms
- Deep Learning
- The ML-Life-cycle
- Optimisation: Gradient Descent
- Regularisation
- Hyperparameter Tuning
- Cross-Validation
- Record I/O
ML Algorithms: A list of algorithms you should be familiar with:
- Logistic Regression
- Linear Regression
- Support Vector Machine
- Decision Trees
- Random Forests
- K-Means
- K-Nearest Neighbours
- Latent Dirichlet Allocation (LDA) Algorithm
Deep Learning:
- Cover Neural Networks in a general sense
- Convolutional Neural Networks: High-level understanding
- Recurrent Neural Networks: High-level understanding
Model Optimisation:
- Confusion Matrix
- Sensitivity and Specificity
- Accuracy & Precision
- ROC/AUC
- Gini Impurity
- F1-Score
ML Tools & Frameworks: Cover basic ML tools (know what they do and what they are used for)
- Jupyter Notebooks
- Pytorch
- MXNet
- TensorFlow
- Keras
- Scikit-learn
Amazon Serverless Services: Not everything; think about the things that a Data Scientist of ML engineer would need to do.
- Simple Storage Services - S3
- Glue
- Athena
- Amazon Quicksight
- Kinesis, Streams, Firehose, Video & Analytics (S.O.S. this one ;) )
- EMR with Spark
- EC2 for ML
- Lambda Functions
- Step Functions
Amazon Serverless ML Services: These are out-of-the-box ML solutions offered by AWS.
- Rekognition (image/video)
- Amazon Poly (Text-to-Speech)
- Amazon Transcribe (Speech-to-Text)
- Amazon Translate
- Amazon Comprehend (Text Analysis Service)
- Amazon Lex (Conversation Interface Service - Chatbots)
- Amazon Service Chaining with AWS Step Functions
SageMaker: A service that you really need to spend time with!
- What is it exactly?
- Benefits? Advantages?
- Supported Algorithms (huge list; learn most popular ones)
- Building and Pre-processing / Ground Truth
- Training and Data sourcing
- Hyper-parameter Tuning
- Model Servicing (Https endpoints)
- Elastic inference
- Batch Transform

This is by no means an exhaustive list, but you will at least get an idea about what is generally involved.

How should I prepare?

There are many ways to prepare. Myself, I covered the relative course on Linux academy, which I highly recommend.

Ideally I would recommend spending some time with SageMaker and try to interact with services like lambda functions and step-functions as well as Kinesis, Glue and Athena. However, that would take a while to do plus, using these resources does not come for free.

The Linux Academy Course has a number of labs that will help you develop an adequate understanding of these services. You can worry about honing your skills and knowledge at a later point.

How long does the exam last?

The exam consists of 65 multiple-choice, multi-selection questions. It is 3 hours long, which I think is more than enough to answer all questions and then review your responses (…or take a nap while waiting for your colleagues to finish; I do have a colleague who actually did this–myself I can never relax that much when it comes to exams).

In general, AWS exams are taken at authorised exam centers. Due to the COVID-19 lockdown, this was adjusted to satisfy the high demand in exam takers and people can take the test from home. However, the process is equally strict:

You need to provide information about the room you will be sitting in;
Room needs to be completely quiet during the exam session;
You need to be alone in the room;
You need to provide pictures of your surroundings to show you have no notes or anything suspicious close to you;
A proctor will login at the time of the exam and will ask to inspect the space around you (he asked me to show him the back of my computer prior beginning and doing so with my iMac was quite a challenge… so if you have an option go for laptop).
The exam session will be recorded.

Note that, as one would expect, looking away from the screen for more than a couple of seconds might prompt the proctor to give you a notice. To be honest, as soon as the exam began it was quite easy to just focus on the screen. It took me less than an hour to cover all questions and then used all the remaining time reviewing my responses. I received a positive notification that I passed on exam completion, but it was subject to a committee review. I guess that examiners inspect the video of yourself taking the exam to identify if you tried cheating or something. In no more than 3 days I got the official certification.

Any tips? Advice?

Well, tip number one is: “If you don’t know which is the right answer, then just go for the AWS solution in the list of options”. At large, this exam tests whether you are familiar with what is available to you through the AWS platform. If a client wants to use ML for image moderation and you recommend anything other than Rekognition then you clearly don’t know how Rekognition is used! This has generally worked for me as a way of filtering in and out options.

I would definitely recommend covering the SageMaker FAQs which I see as a wonderful source for exam material.

Do cover the official AWS practice exam; it is just 20 questions, but it is enough to give you an idea about what you are up against.

That’s it! I really wish that this article will help you get started with your learning journey and I hope that soon enough you will be joining the AWS Certified Global Community to share your badge with everyone.

Remember to like my post and re-share it (if you really liked it)!

See you soon!

Just do it!

2020-07-28T00:00:00+00:00

The thing about writing a blog-post is that you are exposing yourself to the world; it feels a lot like flying for the first time.

You will be criticised! Some will appreciate your work. Others will say it’s wrong, they ‘ll disagree–which is actually promoting healthy public debates and hence is a good thing–or they will just not care. Ultimately, blogging has nothing to do with being right neither it is about writing the perfect post. Put simply, it is just about doing it.

One of my favourite novels is “The Plague, by A. Camus”. At the moment of writing I can’t recommend this book enough given the global COVID-19 commotion. In this book, Camus’ characters are engaged in helping and saving people in the name of no ideology; people dying so unfairly (especially children) is enough to move anyone to act irrespective of whether this is suported by some moral justification.

There is one particular character, a side-character that came to mind when I sat down to write this post; Joseph Grand. Joseph is a fifty-year-old clerk operating for the city government. He lives an austere life, and in his spare time, he is writing a book. However, he is such a perfectionist that he ends up rewriting the first sentence over and over and never gets to proceed any further. No words are ever good enough! What if meaning can be elevated to a higher level if a different wording is used. He is self-blocking himself feeling helpless and devastated.

We ‘ve all been there–I am sure. If only he could let go of his perfectionism and move on to complete that first paragraph and write the first chapter. What would be the story that he would say? What morals and learnings would be revealed and shared?

I guess we will never find out about Joseph Grand, but my blogging journey begins here and now. Looking forward to hear your thoughts and I welcome all of your comments.

Remember to like my post and re-share it (if you really liked it)!

See you soon!

ML Affairs

When the Map Lies: Why AIS Trust Matters in Contested Waters

The New Vocabulary of Maritime Risk

Hormuz Makes The Problem Concrete

AIS Is A Claim, Not A Fact

What Vortexa Adds

The Trust Layer

Why This Is An ML Systems Problem

What the Gulf Examples Show

Why Vortexa Matters Here

Closing Thought

Further Reading

MCP Finally Clicked: It Is Plumbing. Trust Is The Product.

Why MCP Had To Happen

The Jargon That Tripped Me

A Tool Is Usually An API Call

The Simplest Mental Model

The Context Window Is The Real Tool Problem

Docker MCP Toolkit Is A Manager, Not The Trust Model

Read-Only Is Not A Vibe

Where I Would Start

Closing Thoughts

Useful References

ML Engineering Needs A Taxonomy

The Ambiguity Problem

T-Shaped Is Good. Unlimited Width Is Not.

A Working Taxonomy

1. Software engineers

2. Data engineers

3. Data scientists

4. ML engineers

Where The Industry Gets It Wrong

Why This Matters In Practice

1. It distorts hiring

2. It creates unfair expectations

3. It weakens team design

What We Need Instead

The Real Takeaway

Coding Got Cheap. Verification Did Not.

From Writing To Verification

The Wrong Fix: More Agents

The Better Framing: Verification Systems Design

Review Guarantees, Not Just Diffs

Back To Fundamentals

Force Decomposition At Generation Time

Shift Validation Left Into Machines

1. Property-based testing

2. Static analysis gates

3. Runtime assertions

Add Risk Awareness To Review

Trust Is What Makes Automation Scale

Kafka Streams vs Flink Is The Wrong Question

The Bias I Had To Correct

What I Still Love About Flink

Where Kafka Streams Grew On Me

1. The State Model Is Different, Not Just Worse

2. Kafka-Native Integration Is A Real Strength

Where Flink Still Pulls Away

1. Scaling Stops At The Broker Boundary Much Earlier In Kafka Streams

2. Checkpointing And Recovery Are In A Different League

The Real Trade-Off

The Harder Lesson

What I Would Actually Do

PyFlink In 2026: Better Than Its Reputation, Still Not Frictionless

What Has Improved Since The Older Evaluation

1. The Python Story Is Better Documented

2. The DataStream Story Is No Longer Hand-Wavy

3. Debugging And Packaging Are Better Acknowledged

Why PyFlink Is Genuinely Attractive

1. It Keeps The Streaming Layer Closer To The Actual ML Ecosystem

2. It Meets Python-Heavy Teams Where They Already Work

3. It Makes Flink More Reachable Without Hiding Flink

4. There Is A Real Connector Surface

Why PyFlink Is Still Not “Flink, But Easy”

1. You Still Have To Think In Two Worlds

2. The Connector Story Still Leaks JVM Reality

3. Debugging Still Tells You What The System Really Is

4. Native Python Models Are Not An Automatic Architectural Win

5. The Performance Question Never Fully Goes Away

So When Would I Use It?

Generating Python Data Classes from `*.avsc` files

What is `dynamicio` then?

Setting up your `global` interpreter

Using `jupyter notebook` or `jupyter lab` with a virtual environment of your choice