<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>ML Affairs</title>
        <description>Posts by C.Hadjinikolis</description>
        <link>https://christos-hadjinikolis.github.io</link>
        
        
        <item>
            <title>ML Engineering Needs A Taxonomy</title>
            <description>&lt;p&gt;If you spend enough time reading job descriptions in this space, a pattern starts to feel impossible to ignore.&lt;/p&gt;

&lt;p&gt;Everyone says they want an &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;And honestly, the ambiguity is getting tiring.&lt;/p&gt;

&lt;p&gt;That reaction is not theoretical for me. I started much closer to data science: notebooks, experiments, signal, modelling. Then the work pulled me toward software engineering, data engineering, pipelines, deployment, observability, and production support, because useful &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; does not stop at the point where a model looks promising.&lt;/p&gt;

&lt;p&gt;Over time, that is how I grew into &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineering&lt;/span&gt;: not as a clean title switch, but as a set of responsibilities that kept appearing at the boundary between learning systems and production systems.&lt;/p&gt;

&lt;p&gt;So my issue is not that titles need to be perfect. They never will be. Real work is messy, teams evolve, and people grow across boundaries. Some overlap is not only healthy; it is necessary.&lt;/p&gt;

&lt;p&gt;But there is a difference between healthy overlap and lazy role design. The market keeps blurring the difference, then acts surprised when hiring becomes noisy and delivery becomes uneven.&lt;/p&gt;

&lt;p&gt;Part of this is normal. Almost law-governed, in the sense that it was always going to happen once a young field started expanding quickly. First the work appears. Then the titles arrive. Then everyone tries to sound a bit more complete than they really are.&lt;/p&gt;

&lt;p&gt;That is how you end up with profiles that drift from useful description into theatre:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;AI Strategist&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Data Whisperer&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Full-Stack ML Scientist&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;MLOps Architect Evangelist&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/ml-engineering-needs-a-taxonomy/job-title-theatre.png&quot; alt=&quot;Four ninja engineers posing as exaggerated AI and ML job-title archetypes: AI Strategist, Data Whisperer, Full-Stack ML Scientist, and MLOps Architect Evangelist.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The theatre is funny until it becomes the hiring spec.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Some of that is harmless branding. Some of it is people trying to survive a confusing market. But the same exaggeration that helps people market themselves also makes the field harder to describe clearly.&lt;/p&gt;

&lt;p&gt;And when companies copy that ambiguity into job descriptions, it stops being funny.&lt;/p&gt;

&lt;p&gt;What they often seem to mean is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a data scientist who can productionise models&lt;/li&gt;
  &lt;li&gt;a software engineer who understands modelling&lt;/li&gt;
  &lt;li&gt;a data engineer who can own features and pipelines&lt;/li&gt;
  &lt;li&gt;an ML platform engineer who understands infrastructure, MLOps, serving, observability, reproducibility, and the surrounding ecosystem of tools and practices&lt;/li&gt;
  &lt;li&gt;if senior, then we are talking about someone who can also think about product, experimentation, and stakeholder communication&lt;/li&gt;
  &lt;li&gt;and finally, someone who keeps up with a research landscape that has recently exploded in both depth and width&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, a unicorn.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/ml-engineering-needs-a-taxonomy/unicorn-role-spec.png&quot; alt=&quot;A ninja engineer pointing at an overloaded ML Engineer job specification scroll while a faint unicorn outline appears in the background.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The unicorn appears when one role quietly absorbs five centres of gravity.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This is not just a wording issue.&lt;/p&gt;

&lt;p&gt;It is a &lt;strong&gt;taxonomy failure&lt;/strong&gt;, and taxonomy is how teams coordinate work.&lt;/p&gt;

&lt;p&gt;Role language decides what hiring loops test for, what teams staff for, what people own, and what they are judged against. Once that language becomes vague enough, the problem stops being semantic and becomes operational.&lt;/p&gt;

&lt;p&gt;The field has become too wide for vague labels to carry this much responsibility.&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;When taxonomy fails, organisations do not just misname work. They mis-coordinate it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-ambiguity-problem&quot;&gt;The Ambiguity Problem&lt;/h2&gt;

&lt;p&gt;Terms like &lt;em&gt;data scientist&lt;/em&gt;, &lt;em&gt;applied data scientist&lt;/em&gt;, &lt;em&gt;data engineer&lt;/em&gt;, &lt;em&gt;software engineer&lt;/em&gt;, and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt; are all used inconsistently.&lt;/p&gt;

&lt;p&gt;Sometimes that is understandable. Real teams evolve. Smaller companies need broader people. Titles drift.&lt;/p&gt;

&lt;p&gt;But the ambiguity is now large enough to create real problems:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;hiring managers ask for impossible overlap&lt;/li&gt;
  &lt;li&gt;candidates do not know what success in the role actually means&lt;/li&gt;
  &lt;li&gt;teams build fuzzy ownership boundaries&lt;/li&gt;
  &lt;li&gt;delivery slows down because responsibilities are unclear&lt;/li&gt;
  &lt;li&gt;people get judged against expectations nobody made explicit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When roles are unclear, &lt;strong&gt;ownership fragments&lt;/strong&gt;. Work gets duplicated, responsibilities fall through gaps, and accountability becomes negotiable. The system does not just slow down; it becomes harder to reason about.&lt;/p&gt;

&lt;p&gt;That is where this stops being semantics and starts becoming an organisational failure mode.&lt;/p&gt;

&lt;h2 id=&quot;t-shaped-is-good-unlimited-width-is-not&quot;&gt;T-Shaped Is Good. Unlimited Width Is Not.&lt;/h2&gt;

&lt;p&gt;I am strongly in favour of people being &lt;strong&gt;T-shaped&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That matters. Breadth improves collaboration. It reduces blind spots. It helps teams understand each other’s constraints.&lt;/p&gt;

&lt;p&gt;My own path depended on that kind of breadth. Moving from data science into software and data engineering was uncomfortable at times, but it made me a better engineer. It taught me why a beautiful model can still be useless if the features are unstable, the pipeline is fragile, the deployment path is unclear, or nobody can explain the production behaviour six months later.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/ml-engineering-needs-a-taxonomy/t-shaped-vs-unlimited-width.png&quot; alt=&quot;A ninja engineer comparing a balanced T-shaped ML systems profile with an overloaded unlimited-width profile carrying too many responsibilities.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;Breadth helps. Unlimited width collapses into role design theatre.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;But there is a difference between healthy breadth and role collapse.&lt;/p&gt;

&lt;p&gt;There is a point where &lt;em&gt;“be cross-functional”&lt;/em&gt; turns into &lt;em&gt;“please cover four disciplines badly enough that we can pretend one headcount is enough.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is not maturity.&lt;/p&gt;

&lt;p&gt;That is under-specification disguised as ambition.&lt;/p&gt;

&lt;h2 id=&quot;a-working-taxonomy&quot;&gt;A Working Taxonomy&lt;/h2&gt;

&lt;p&gt;I do not think these roles need perfectly rigid borders, but I do think we need clearer centres of gravity.&lt;/p&gt;

&lt;h3 id=&quot;1-software-engineers&quot;&gt;1. Software engineers&lt;/h3&gt;

&lt;p&gt;Their centre of gravity is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;application design&lt;/li&gt;
  &lt;li&gt;interfaces and boundaries&lt;/li&gt;
  &lt;li&gt;maintainability&lt;/li&gt;
  &lt;li&gt;testing discipline&lt;/li&gt;
  &lt;li&gt;deployment, reliability, and operational quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They may or may not know much about modelling.&lt;/p&gt;

&lt;p&gt;That is not the point of the role.&lt;/p&gt;

&lt;h3 id=&quot;2-data-engineers&quot;&gt;2. Data engineers&lt;/h3&gt;

&lt;p&gt;Their centre of gravity is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;data movement&lt;/li&gt;
  &lt;li&gt;storage and retrieval patterns&lt;/li&gt;
  &lt;li&gt;pipeline reliability&lt;/li&gt;
  &lt;li&gt;batch and streaming data infrastructure&lt;/li&gt;
  &lt;li&gt;data quality and availability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They make sure the data side of the system can actually support the work being asked of it.&lt;/p&gt;

&lt;h3 id=&quot;3-data-scientists&quot;&gt;3. Data scientists&lt;/h3&gt;

&lt;p&gt;Their centre of gravity is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;extracting signal from data&lt;/li&gt;
  &lt;li&gt;experimentation&lt;/li&gt;
  &lt;li&gt;hypothesis testing&lt;/li&gt;
  &lt;li&gt;feature and model development&lt;/li&gt;
  &lt;li&gt;evaluation and interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They sit closest to learning from data and turning uncertainty into useful insight.&lt;/p&gt;

&lt;h3 id=&quot;4-ml-engineers&quot;&gt;4. ML engineers&lt;/h3&gt;

&lt;p&gt;This is the ambiguous one, which is precisely why it needs more care.&lt;/p&gt;

&lt;p&gt;To me, the &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt; owns the boundary between experimentation and production, where models, data, and systems must behave reliably under real-world constraints.&lt;/p&gt;

&lt;p&gt;It is not &lt;em&gt;“person who does some of all of the above.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It is the role that cares about:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;turning model-driven logic into production systems&lt;/li&gt;
  &lt;li&gt;managing the boundary between experimentation and serving&lt;/li&gt;
  &lt;li&gt;making inference, features, deployment, monitoring, rollback, and reproducibility actually work together&lt;/li&gt;
  &lt;li&gt;understanding enough of software, data, and modelling to make the whole thing operationally coherent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is already a serious role.&lt;/p&gt;

&lt;p&gt;It does not need to secretly absorb every adjacent discipline to be legitimate.&lt;/p&gt;

&lt;h2 id=&quot;where-the-industry-gets-it-wrong&quot;&gt;Where The Industry Gets It Wrong&lt;/h2&gt;

&lt;p&gt;The confusion starts when companies use &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt; as a placeholder for unfinished thinking.&lt;/p&gt;

&lt;p&gt;Instead of deciding what the team is missing, they post a role that quietly asks for:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;modelling depth&lt;/li&gt;
  &lt;li&gt;platform depth&lt;/li&gt;
  &lt;li&gt;data pipeline ownership&lt;/li&gt;
  &lt;li&gt;backend engineering&lt;/li&gt;
  &lt;li&gt;experimentation design&lt;/li&gt;
  &lt;li&gt;stakeholder fluency&lt;/li&gt;
  &lt;li&gt;production support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sometimes one person can cover a surprising amount of that.&lt;/p&gt;

&lt;p&gt;In many cases, though, this is not confusion. It is a &lt;strong&gt;team design shortcut&lt;/strong&gt;, trying to compress multiple roles into one headcount.&lt;/p&gt;

&lt;p&gt;But building a role definition around the best-case outlier is not a sound organisational strategy.&lt;/p&gt;

&lt;h2 id=&quot;why-this-matters-in-practice&quot;&gt;Why This Matters In Practice&lt;/h2&gt;

&lt;p&gt;This ambiguity creates at least three practical problems.&lt;/p&gt;

&lt;h3 id=&quot;1-it-distorts-hiring&quot;&gt;1. It distorts hiring&lt;/h3&gt;

&lt;p&gt;If the role is unclear, the interview loop becomes unclear too.&lt;/p&gt;

&lt;p&gt;You end up testing fragments of four disciplines and then pretending the aggregate signal means something precise.&lt;/p&gt;

&lt;h3 id=&quot;2-it-creates-unfair-expectations&quot;&gt;2. It creates unfair expectations&lt;/h3&gt;

&lt;p&gt;People join thinking they were hired for one centre of gravity and then discover they are being evaluated against three others.&lt;/p&gt;

&lt;p&gt;That is bad management, not professional growth.&lt;/p&gt;

&lt;h3 id=&quot;3-it-weakens-team-design&quot;&gt;3. It weakens team design&lt;/h3&gt;

&lt;p&gt;When roles are vague, interfaces become vague too.&lt;/p&gt;

&lt;p&gt;And when interfaces are vague, teams stop designing good collaboration boundaries and start relying on heroic overlap. That creates ownership gaps, blame diffusion, duplicated work, burnout, and systems that degrade because nobody clearly owns end-to-end quality.&lt;/p&gt;

&lt;p&gt;That does not scale well.&lt;/p&gt;

&lt;h2 id=&quot;what-we-need-instead&quot;&gt;What We Need Instead&lt;/h2&gt;

&lt;p&gt;This is not really an argument about job titles.&lt;/p&gt;

&lt;p&gt;It is an argument about how work is partitioned in complex systems: where the interfaces are, where ownership sits, and where accountability lands.&lt;/p&gt;

&lt;p&gt;We need a clearer taxonomy and a more honest way of describing overlap.&lt;/p&gt;

&lt;p&gt;Not rigid boxes.&lt;/p&gt;

&lt;p&gt;But explicit definitions, shared language, and a better sense of where one role’s centre of gravity ends and another begins.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/ml-engineering-needs-a-taxonomy/taxonomy-as-interfaces.png&quot; alt=&quot;Ninja engineers using a Venn diagram to define ownership, collaboration, and boundaries between software, data, modelling, and ML systems roles.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The goal is not rigid boxes. The goal is honest interfaces.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The interesting part is not eliminating overlap. The interesting part is naming it properly.&lt;/p&gt;

&lt;p&gt;That is where tools like &lt;strong&gt;Venn diagrams&lt;/strong&gt;, capability maps, and role definitions actually help. They force teams to say:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;what this role owns&lt;/li&gt;
  &lt;li&gt;what it touches&lt;/li&gt;
  &lt;li&gt;what it collaborates with&lt;/li&gt;
  &lt;li&gt;what it is not expected to carry alone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is healthier for hiring and much healthier for delivery.&lt;/p&gt;

&lt;h2 id=&quot;the-real-takeaway&quot;&gt;The Real Takeaway&lt;/h2&gt;

&lt;p&gt;I do not think the industry needs fewer broad engineers.&lt;/p&gt;

&lt;p&gt;I think it needs more honesty about what breadth costs.&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;T-shaped growth is real.&lt;/p&gt;
  &lt;p&gt;But asking one person to fully cover software engineering, data engineering, data science, and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineering&lt;/span&gt; is usually not ambition. It is taxonomy failure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Clearer role definitions are not admin. They are part of how complex work gets partitioned.&lt;/p&gt;

&lt;p&gt;Once role definitions become vague enough, organisations stop designing teams and start relying on exceptional individuals. That might work occasionally. It does not scale.&lt;/p&gt;

&lt;p&gt;A taxonomy is not bureaucracy. It is how the work gets divided clearly enough for systems, teams, and people to survive contact with production.&lt;/p&gt;
</description>
            <pubDate>2026-04-14T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2026/04/14/ml-engineering-needs-a-taxonomy.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2026/04/14/ml-engineering-needs-a-taxonomy.html</guid>
        </item>
        
        
        
        <item>
            <title>Coding Got Cheap. Verification Did Not.</title>
            <description>&lt;p&gt;Right now, the loudest claim around &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;LLM&lt;/span&gt; coding tools is that coding is becoming a commodity.&lt;/p&gt;

&lt;p&gt;I think that is directionally right. What I do not think follows automatically is the part people usually jump to next: that software delivery will therefore speed up by the same factor. The more I use these tools, the less convinced I am by that leap.&lt;/p&gt;

&lt;p&gt;Yes, they can write routine code quickly; they can refactor at a pace that would have felt absurd not long ago. But one &lt;strong&gt;friction point&lt;/strong&gt; keeps getting sharper every time:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;We have increased &lt;span class=&quot;blog-highlight blog-highlight--signal&quot;&gt;write throughput&lt;/span&gt;.&lt;/p&gt;
  &lt;p&gt;We have not increased &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification throughput&lt;/span&gt; at the same rate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the part I think many teams are about to feel much more acutely: &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review friction&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;At least, that was obvious in my own team within a week of all of us adopting &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;LLM&lt;/span&gt; CLIs more seriously in our workflow. Code was appearing faster. Refactors were cheaper. Experiments were easier to try. But the moment those changes started piling up, the real constraint showed itself again: someone still had to understand them, &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; them, and decide whether they were safe to merge.&lt;/p&gt;

&lt;p&gt;And while this is easiest to see with &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;LLM&lt;/span&gt; CLIs and all the current code-vibing enthusiasm, I do think the point extends to &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt; too.&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;Agents&lt;/span&gt; do not have the &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agency&lt;/span&gt; they would need to make software delivery scale in a production environment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They can generate code. They can propose plans. They can widen the search space. But they do not own production risk. They do not carry on-call duty. They do not defend the change in front of a customer. They do not absorb the cost of being wrong.&lt;/p&gt;

&lt;p&gt;That responsibility is still human.&lt;/p&gt;

&lt;p&gt;And because that responsibility is still human, the bottleneck has moved.&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/llm-clis-have-a-new-friction-point/write-throughput-vs-verification-bottleneck.png&quot; alt=&quot;Ninja engineers generating pull requests faster than a slower verification station can review, verify, and merge them.&quot; /&gt;
  &lt;p class=&quot;image-credit&quot;&gt;The new imbalance is simple: code generation is accelerating faster than &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; and &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;from-writing-to-verification&quot;&gt;From Writing To Verification&lt;/h2&gt;

&lt;p&gt;For a while, most of the conversation around coding &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt; was about output:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;how many files they can touch&lt;/li&gt;
  &lt;li&gt;how quickly they can scaffold&lt;/li&gt;
  &lt;li&gt;how much code they can produce in one go&lt;/li&gt;
  &lt;li&gt;whether coding itself is becoming a commodity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is no longer enough as a way of thinking.&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
If code generation gets ten times faster while &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt;, integration, and &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; stay roughly flat, the system does not become ten times faster.

It becomes unstable.
&lt;/blockquote&gt;

&lt;p&gt;What used to be scarce was code production. What is scarce now is trust.  And trust is slower.  It lives inside:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; bandwidth&lt;/li&gt;
  &lt;li&gt;change understanding&lt;/li&gt;
  &lt;li&gt;test quality&lt;/li&gt;
  &lt;li&gt;integration sequencing&lt;/li&gt;
  &lt;li&gt;rollback confidence&lt;/li&gt;
  &lt;li&gt;the ability to explain why a change is safe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why I do not find &lt;em&gt;“these tools make engineers faster”&lt;/em&gt; a very useful claim on its own. Faster at producing diffs is not the same thing as faster at delivering software.  Worse, if you leave the system unchanged, the imbalance compounds:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;more code appears&lt;/li&gt;
  &lt;li&gt;reviewers get overloaded&lt;/li&gt;
  &lt;li&gt;&lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; quality drops&lt;/li&gt;
  &lt;li&gt;defects move downstream&lt;/li&gt;
  &lt;li&gt;rollback frequency rises&lt;/li&gt;
  &lt;li&gt;trust in generated changes starts to erode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So no, the bottleneck did not disappear; it moved from writing code to trusting code.&lt;/p&gt;

&lt;h2 id=&quot;the-wrong-fix-more-agents&quot;&gt;The Wrong Fix: More Agents&lt;/h2&gt;

&lt;p&gt;I think many teams are still responding to this with the wrong instinct.&lt;/p&gt;

&lt;p&gt;If generation is cheap, they assume the answer is to introduce even more &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt;, even more automatic change, even more output.&lt;/p&gt;

&lt;p&gt;But more &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt; do not solve a trust bottleneck; they amplify it. Without strong engineering constraints, cheap generation gives you:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;bigger pull requests because exploration is cheap&lt;/li&gt;
  &lt;li&gt;noisier pull requests because changing code is cheap&lt;/li&gt;
  &lt;li&gt;more speculative diffs because rewriting is cheap&lt;/li&gt;
  &lt;li&gt;slower &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;reviews&lt;/span&gt; because understanding still costs the same&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not scale; it is faster chaos. If teams do not build a stronger trust system around these tools, they will not really scale AI-assisted development. They will just generate more change than they can responsibly absorb.&lt;/p&gt;

&lt;h2 id=&quot;the-better-framing-verification-systems-design&quot;&gt;The Better Framing: Verification Systems Design&lt;/h2&gt;

&lt;p&gt;This is why I think the right framing is not &lt;em&gt;“how do we optimise the PR process?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;but:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;How do we design a &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; system that can keep up with generated change?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Smaller PRs matter. Merge queues matter. I believe that strongly. But they are not enough on their own. They improve the shape of change.  They do not automatically make change trustworthy.&lt;/p&gt;

&lt;p&gt;If you want AI-assisted development to scale, you need a system that turns fast code generation into &lt;em&gt;verifiable, reviewable, bounded&lt;/em&gt; progress. That means moving from &lt;em&gt;reviewing code&lt;/em&gt; to &lt;em&gt;reviewing guarantees&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; system is not just a pile of checks. It is a structured way of turning change into bounded, testable, explainable units of risk.&lt;/p&gt;

&lt;h2 id=&quot;review-guarantees-not-just-diffs&quot;&gt;Review Guarantees, Not Just Diffs&lt;/h2&gt;

&lt;p&gt;Right now, too many AI-assisted workflows still look like this:&lt;/p&gt;

&lt;div class=&quot;blog-flow&quot;&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Tool writes code&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Human reviews diff&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Human approves&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step blog-flow__step--warning&quot;&gt;Hope nothing subtle broke&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;That does not scale; it just shifts cognitive load onto the reviewer.&lt;/p&gt;

&lt;p&gt;The better pattern is to require every serious change to state clearly:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;what changed;&lt;/li&gt;
  &lt;li&gt;what must remain true;&lt;/li&gt;
  &lt;li&gt;how we know it works;&lt;/li&gt;
  &lt;li&gt;what failure modes were considered.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that information is missing, the reviewer is being asked to &lt;em&gt;reconstruct intent from the diff, infer risk from context, and simulate behaviour in their head&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That is expensive, and that is exactly the kind of &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review friction&lt;/span&gt; we should be trying to remove. The important part is to make those guarantees tangible. For example:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;this transformation preserves ordering invariants;&lt;/li&gt;
  &lt;li&gt;this refactor is behaviorally equivalent under property tests;&lt;/li&gt;
  &lt;li&gt;this change cannot affect downstream state transitions because the boundary remains unchanged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once a reviewer sees that kind of claim backed by evidence, the whole exercise changes. They stop scanning raw volume and start checking &lt;strong&gt;bounded risk&lt;/strong&gt;.&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/llm-clis-have-a-new-friction-point/review-guarantees-not-just-diffs.png&quot; alt=&quot;Ninja engineers reviewing guarantees, invariants, tests, and failure modes instead of just scanning raw diffs.&quot; /&gt;
  &lt;p class=&quot;image-credit&quot;&gt;A better &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; model is not “read more diff.” It is “check stronger guarantees.”&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;back-to-fundamentals&quot;&gt;Back To Fundamentals&lt;/h2&gt;

&lt;p&gt;This is the part I find slightly amusing. Once you follow the argument through, the answer starts sounding strangely old-fashioned. If &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review friction&lt;/span&gt; is the bottleneck, then we do not get out of it with more theatrical tooling.&lt;/p&gt;

&lt;p&gt;We get out of it by returning to fundamentals:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;smaller PRs;&lt;/li&gt;
  &lt;li&gt;clearer intent;&lt;/li&gt;
  &lt;li&gt;narrower scope;&lt;/li&gt;
  &lt;li&gt;better tests;&lt;/li&gt;
  &lt;li&gt;merge queues, and;&lt;/li&gt;
  &lt;li&gt;easier rollback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not because these are fashionable process ideas. It is because they reduce the cost of &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; and &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;Large PRs force reviewers into archaeology. They have to reverse-engineer intent, infer boundaries, and simulate outcomes in their head.&lt;/p&gt;

&lt;p&gt;Small PRs let them ask a much narrower question:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Is this one change understandable, bounded, and safe to merge?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a real throughput advantage.&lt;/p&gt;

&lt;p&gt;In an &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agent&lt;/span&gt;-assisted workflow, this matters even more. The natural temptation is to let the tool range widely and submit one impressive diff. That is exactly the wrong shape of change if trust is the bottleneck.&lt;/p&gt;

&lt;p&gt;So yes, smaller PRs, stacked changes, narrow intent, and one decision per &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; unit, become a must! They are no longer, simply, about hygiene. It is part of the &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; system.&lt;/p&gt;

&lt;p&gt;This is also where a simple &lt;strong&gt;test-driven&lt;/strong&gt; instinct helps a lot. For example, if someone wants to do a refactor, one very clean pattern is:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;first PR&lt;/strong&gt;: add tests and increase coverage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;second PR&lt;/strong&gt;: do the refactor&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The separation matters.&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;In the first PR, the intent is obvious: we are improving confidence.&lt;/li&gt;
    &lt;li&gt;In the second PR, the tests stay fixed, which makes the claim much narrower: behaviour should stay the same.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That lowers cognitive load immediately.&lt;/p&gt;

&lt;p&gt;The same principle generalises. If a change is behavioural, keep the scope small. If a feature is large, deliver it in steps. The hardest work is usually restructuring, and that is exactly where thinking hard about incremental delivery matters most.&lt;/p&gt;

&lt;p&gt;If you want something practical to adapt for your own team, I put together a reusable reference here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/references/pr-template-for-ai-assisted-delivery/&quot;&gt;&lt;strong&gt;PR template for higher-trust AI-assisted delivery&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;force-decomposition-at-generation-time&quot;&gt;Force Decomposition At Generation Time&lt;/h2&gt;

&lt;p&gt;This is where I would push the workflow harder. Do not wait until &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; time to discover that the diff is too large. Force decomposition earlier.&lt;/p&gt;

&lt;p&gt;The correct shape is:&lt;/p&gt;

&lt;div class=&quot;blog-flow&quot;&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Task&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Plan&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Substeps&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;PR sequence&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Not:&lt;/p&gt;

&lt;div class=&quot;blog-flow&quot;&gt;
  &lt;div class=&quot;blog-flow__step&quot;&gt;Task&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step blog-flow__step--warning&quot;&gt;Giant AI diff&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step blog-flow__step--warning&quot;&gt;Panic review&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;This is one of the most useful things these tools can do, by the way. They should not just write code. They should help propose the incremental delivery plan by which the code can be introduced safely.&lt;/p&gt;

&lt;p&gt;That is a much better use of an &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agent&lt;/span&gt; than simply asking it for more implementation.&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/llm-clis-have-a-new-friction-point/small-prs-and-merge-queue.png&quot; alt=&quot;Ninja engineers breaking a large feature into small pull requests that move through CI, checks, review, and merge in an orderly queue.&quot; /&gt;
  &lt;p class=&quot;image-credit&quot;&gt;Small PRs are not tidiness theatre. They are one of the cleanest ways to lower &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review friction&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;shift-validation-left-into-machines&quot;&gt;Shift Validation Left Into Machines&lt;/h2&gt;

&lt;p&gt;If humans remain the primary validators of AI-generated code, I do not think the model scales very far.&lt;/p&gt;

&lt;p&gt;Humans should still own risk. But they should not be forced to simulate execution in their head for every meaningful change.  That means stronger machine-side &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt;.&lt;/p&gt;

&lt;h3 id=&quot;1-property-based-testing&quot;&gt;1. Property-based testing&lt;/h3&gt;

&lt;p&gt;I think &lt;strong&gt;property-based testing&lt;/strong&gt; is one of the most underused tools here.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Well, because many AI-generated bugs are not obvious syntax bugs. They are edge-case bugs. Boundary bugs. &lt;em&gt;“This looked correct for three examples and broke on the fourth”&lt;/em&gt; bugs.&lt;/p&gt;

&lt;p&gt;Property-based testing helps because it checks invariants across many generated inputs instead of blessing one or two happy-path examples.&lt;/p&gt;

&lt;p&gt;A few practical cases (skip these if you get the point):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a parser should round-trip valid inputs without losing structure&lt;/li&gt;
  &lt;li&gt;a serialization layer should preserve data after encode/decode&lt;/li&gt;
  &lt;li&gt;a ranking function should preserve ordering invariants you care about&lt;/li&gt;
  &lt;li&gt;a pricing or allocation function should never produce negative totals or violate conservation constraints&lt;/li&gt;
  &lt;li&gt;a stream transformation should preserve event counts when it is not supposed to drop or duplicate events&lt;/li&gt;
  &lt;li&gt;an aggregate that should only grow as more events arrive should remain monotonic&lt;/li&gt;
  &lt;li&gt;a pipeline that depends on arrival order should preserve event ordering where that contract is supposed to hold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters because it turns &lt;em&gt;“I read the diff and it seemed fine”&lt;/em&gt; into &lt;em&gt;“the core property stayed true under many cases.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is a better &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; signal.&lt;/p&gt;

&lt;h3 id=&quot;2-static-analysis-gates&quot;&gt;2. Static analysis gates&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Static analysis&lt;/strong&gt; is another place where teams should be more aggressive.&lt;/p&gt;

&lt;p&gt;Not static analysis theatre. Not one more badge in CI. Real gates.&lt;/p&gt;

&lt;p&gt;Practical examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;type errors should fail fast;&lt;/li&gt;
  &lt;li&gt;nullability violations should fail fast;&lt;/li&gt;
  &lt;li&gt;unsafe imports or forbidden dependencies should fail fast;&lt;/li&gt;
  &lt;li&gt;obvious dead code or unhandled branches should fail fast, and;&lt;/li&gt;
  &lt;li&gt;insecure patterns or dangerous API usage should fail fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more routine structural mistakes a machine can reject automatically, the less human energy gets wasted on basic hygiene.&lt;/p&gt;

&lt;p&gt;That leaves humans freer to &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; the part that actually matters: design, guarantees, and risk.&lt;/p&gt;

&lt;h3 id=&quot;3-runtime-assertions&quot;&gt;3. Runtime assertions&lt;/h3&gt;

&lt;p&gt;I am much less enthusiastic about &lt;strong&gt;runtime assertions&lt;/strong&gt; than about tests, validation, or stronger system boundaries.&lt;/p&gt;

&lt;p&gt;Most of the time, if you need an assertion, it is worth asking whether the system should have prevented that state earlier through better design, clearer contracts, or stricter validation.&lt;/p&gt;

&lt;p&gt;In other words, I would not treat assertions as a primary &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; strategy.&lt;/p&gt;

&lt;p&gt;They still have a narrow place, though, around internal invariants that should be impossible if the rest of the system is behaving correctly. For example:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a state machine reaches an illegal transition;&lt;/li&gt;
  &lt;li&gt;two mutually exclusive internal flags are both true;&lt;/li&gt;
  &lt;li&gt;an event-ordering assumption inside one component is suddenly broken, and;&lt;/li&gt;
  &lt;li&gt;an internal contract is violated in a way that risks silent corruption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where a loud failure can be better than quietly propagating bad state.&lt;/p&gt;

&lt;p&gt;So ok, assertions can help, but only as a last line of defence. I would much rather prevent bad states than merely notice them at runtime.&lt;/p&gt;

&lt;h2 id=&quot;add-risk-awareness-to-review&quot;&gt;Add Risk Awareness To Review&lt;/h2&gt;

&lt;p&gt;Another thing I think teams need is a more explicit notion of &lt;strong&gt;change risk&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not every AI-generated change should go through the same &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; path.&lt;/p&gt;

&lt;p&gt;There is a difference between:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a local refactor;&lt;/li&gt;
  &lt;li&gt;a business-logic change;&lt;/li&gt;
  &lt;li&gt;a concurrency change;&lt;/li&gt;
  &lt;li&gt;a stateful systems change, or;&lt;/li&gt;
  &lt;li&gt;a distributed recovery or integration change.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those should not all be treated as the same kind of review object.&lt;/p&gt;

&lt;p&gt;What I would want is some form of confidence or risk scoring:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;🟢 low-risk cosmetic or local changes get a lighter path&lt;/li&gt;
  &lt;li&gt;🟠 medium-risk logic changes get stronger automated evidence&lt;/li&gt;
  &lt;li&gt;🔴 high-risk stateful or distributed changes get narrower scope and deeper human scrutiny&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Right now, most teams still treat this too uniformly:&lt;/p&gt;

&lt;div class=&quot;blog-flow&quot;&gt;
  &lt;div class=&quot;blog-flow__step blog-flow__step--warning&quot;&gt;Open PR&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step blog-flow__step--warning&quot;&gt;Assign reviewer&lt;/div&gt;
  &lt;div class=&quot;blog-flow__arrow&quot; aria-hidden=&quot;true&quot;&gt;→&lt;/div&gt;
  &lt;div class=&quot;blog-flow__step blog-flow__step--warning&quot;&gt;Hope for the best&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;That is not mature enough for the level of change velocity these tools can produce.&lt;/p&gt;

&lt;h2 id=&quot;trust-is-what-makes-automation-scale&quot;&gt;Trust Is What Makes Automation Scale&lt;/h2&gt;

&lt;p&gt;If there is one broader point underneath all of this, it is that:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
Automation does not scale on capability alone; it scales on trust.
&lt;/blockquote&gt;

&lt;p&gt;If an AI system is not trustworthy, people will hesitate to adopt it, hesitate to depend on it, and ultimately refuse to give it real responsibility. That is true whether we are talking about coding tools, &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt;, or any other form of automation.&lt;/p&gt;

&lt;p&gt;And trust does not appear by magic. It comes from being able to explain what the system is doing, trace why it did it, bound the risk, and verify that it is behaving safely enough to rely on.&lt;/p&gt;

&lt;p&gt;That is why &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; matters so much. A strong &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; system is how an organisation turns output into trust.&lt;/p&gt;

&lt;p&gt;The self-driving cars example makes that point clear.&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;The problem with self-driving was never just whether people would emotionally accept the absence of a driver.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can put a human in the driver’s seat and solve part of the problem for a while. That gives you supervision, and maybe enough trust to experiment. But it also shows the limit immediately: you still have not built enough trust into the system for automation to carry the responsibility on its own.&lt;/p&gt;

&lt;p&gt;To unlock the real benefit, you need a &lt;strong&gt;validation system&lt;/strong&gt; strong enough to make the absence of a driver trustworthy.&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;Simulation mattered.&lt;/li&gt; 
    &lt;li&gt;Certification mattered.&lt;/li&gt; 
    &lt;li&gt;Safety cases mattered.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;Verification&lt;/span&gt; pipelines mattered.&lt;/p&gt;

&lt;p&gt;We did not start trusting self-driving because models improved. We trusted it only to the extent that validation systems became industrial.&lt;/p&gt;

&lt;p&gt;We do not need &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt; with mystical &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agency&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;We need enough trust in their output that automation can carry more of the load without a human having to re-derive everything from scratch.&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/llm-clis-have-a-new-friction-point/autonomous-delivery-path.png&quot; alt=&quot;An autonomous delivery car navigating a structured path through tests, review, small PRs, and production while a ninja engineer observes from the side.&quot; /&gt;
  &lt;p class=&quot;image-credit&quot;&gt;Automation starts to scale when trust is built into the delivery path itself, not when a human has to keep rescuing the system from the driver’s seat.&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;If I were designing for this bottleneck deliberately, I would want something closer to this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A task is decomposed into a sequence of narrow changes before major implementation begins.&lt;/li&gt;
  &lt;li&gt;Each change states intent, invariants, and how correctness will be validated.&lt;/li&gt;
  &lt;li&gt;Automated checks do the first line of trust work: tests, static analysis, diff classification, CI.&lt;/li&gt;
  &lt;li&gt;Reviewers focus mostly on boundary decisions, guarantees, and system fit.&lt;/li&gt;
  &lt;li&gt;Merge queues and rollback paths keep integration disciplined and stop trust from being wasted in merge thrash.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is a much more serious model than &lt;em&gt;“AI writes, human skims, merge and pray.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The practical takeaway is not to resist &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt;. It is to build an engineering system where &lt;span class=&quot;blog-highlight blog-highlight--review&quot;&gt;review&lt;/span&gt; and &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; can keep up with them.&lt;/p&gt;

&lt;p&gt;The real unit of speed is not how quickly code appears in a branch. It is how quickly a team can move a change from idea to trusted production without losing control of the system.&lt;/p&gt;

&lt;p&gt;That is the metric that matters. And once you define speed that way, the answer stops sounding futuristic. It becomes strangely familiar:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;smaller PRs;&lt;/li&gt;
  &lt;li&gt;clearer intent;&lt;/li&gt;
  &lt;li&gt;stronger guarantees;&lt;/li&gt;
  &lt;li&gt;better tests;&lt;/li&gt;
  &lt;li&gt;static analysis gates;&lt;/li&gt;
  &lt;li&gt;selective runtime assertions;&lt;/li&gt;
  &lt;li&gt;merge queues, and;&lt;/li&gt;
  &lt;li&gt;low-friction rollback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not bureaucratic leftovers from a slower era.  They are what make faster tooling usable.&lt;/p&gt;

&lt;p&gt;If &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;LLM&lt;/span&gt; tooling keeps improving, the teams that win will not be the ones that generate the most code.&lt;/p&gt;

&lt;p&gt;They will be the ones that turn trust into a system.&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;If coding is becoming a commodity, &lt;span class=&quot;blog-highlight blog-highlight--verification&quot;&gt;verification&lt;/span&gt; is not.&lt;/p&gt;
  &lt;p&gt;And if &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agents&lt;/span&gt; do not have &lt;span class=&quot;blog-highlight blog-highlight--agent&quot;&gt;agency&lt;/span&gt;, the burden of trust still sits with us.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Many teams are about to discover that the next productivity battle is not about writing code at all. It is about whether their engineering system can metabolise AI-generated change without losing control.&lt;/p&gt;

&lt;p&gt;The best prompt in the world will not save a team that cannot review, verify, and integrate change with discipline.&lt;/p&gt;

&lt;p&gt;That is a much less theatrical advantage. It is also the real one.&lt;/p&gt;
</description>
            <pubDate>2026-04-08T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2026/04/08/llm-clis-have-a-review-speed-problem.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2026/04/08/llm-clis-have-a-review-speed-problem.html</guid>
        </item>
        
        
        
        <item>
            <title>Kafka Streams vs Flink Is The Wrong Question</title>
            <description>&lt;p&gt;I am not neutral about &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;I have spent years advocating for it, using it anywhere I could, organizing London meetups around it before COVID, and talking to anyone who would listen about why the dataflow model is such a good way to think. I still love that model. I love how naturally event-driven systems can align to a domain: &lt;em&gt;a ship enters a port, this state changes, that downstream action happens next.&lt;/em&gt; Both &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; and &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; let you express stateful processes in a way that can stay close to business reality.&lt;/p&gt;

&lt;p&gt;And that is exactly why this lesson was useful for me.&lt;/p&gt;

&lt;p&gt;When I joined a later role, I found myself surrounded by repositories built with &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;. My first instinct was simple: replace them with &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. Some of those repos were chaotic, under-loved, and far away from the kind of streaming architecture I like to build. I felt outside my waters. I wanted to modernize, refactor, migrate, clean the slate.&lt;/p&gt;

&lt;p&gt;But over time, after giving those systems the attention they deserved, I learned something more valuable than another framework argument:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;The useful question is not whether &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is &lt;em&gt;&quot;better&quot;&lt;/em&gt; than &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;.&lt;/p&gt;
  &lt;p&gt;The useful question is when your streaming problem stops being an application concern and becomes a platform concern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is still the line I care about most. But now I care about it with much more respect for both sides.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/application-vs-platform-crossroads.png&quot; alt=&quot;A hand-drawn ninja engineer at a crossroads between rewriting toward Flink and curating a Kafka Streams system into a platform-aware architecture.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is the real fork in the road: not which mascot wins, but whether the system is still application-shaped or is becoming a platform concern.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;the-bias-i-had-to-correct&quot;&gt;The Bias I Had To Correct&lt;/h2&gt;

&lt;p&gt;There is a recurring engineering mistake hiding in this topic: &lt;em&gt;you inherit a system that feels old, untidy, or unfashionable, and you start reaching for the framework you know better.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I have had to relearn this lesson more than once in my career. It is almost embarrassing how often it comes back, which is probably proof of how important it is.&lt;/p&gt;

&lt;p&gt;I originally wanted to replace those &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; solutions largely because I was more fluent in &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. That fluency gave me clarity in one framework and discomfort in the other, and I briefly mistook that feeling for architecture.&lt;/p&gt;

&lt;p&gt;That is a dangerous mistake.&lt;/p&gt;

&lt;p&gt;Once I slowed down, cleaned up the code, made the domain model clearer, and brought more disciplined engineering practices to those codebases, I ended up with a much less dramatic conclusion:&lt;/p&gt;

&lt;p&gt;if you give an existing streaming system enough love, enough structure, and enough respect for the underlying model, you can get very far without rewriting it.&lt;/p&gt;

&lt;p&gt;That does not make &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; less good. It just makes engineering judgment less theatrical.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/rewrite-or-repair.png&quot; alt=&quot;A hand-drawn ninja engineer illustration showing the temptation to rewrite a messy Kafka Streams system while a cleaner architectural repair path is explained.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The urge to rewrite is strong. The better question is whether the system is structurally wrong or simply under-engineered.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;div class=&quot;blog-insight&quot;&gt;
  &lt;span class=&quot;blog-insight__label&quot;&gt;The Lesson&lt;/span&gt;
  &lt;p&gt;&lt;strong&gt;Framework preference is not architecture.&lt;/strong&gt; My first instinct was to rewrite messy &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; systems into &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. The better answer was to clean the model first, then decide whether the runtime was actually the problem.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;what-i-still-love-about-flink&quot;&gt;What I Still Love About Flink&lt;/h2&gt;

&lt;p&gt;Let me be clear: I am still a very strong &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; advocate.&lt;/p&gt;

&lt;p&gt;I still think the &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; dataflow model is one of the cleanest ways to reason about stateful stream processing. Operator boundaries are explicit. State feels local to the operator that owns it. Checkpointing, recovery, repartitioning, and event-time semantics feel like first-class runtime concepts instead of side effects of a library attached to a broker.&lt;/p&gt;

&lt;p&gt;That is a big deal to me, because I care a lot about how easily a streaming system can be explained.&lt;/p&gt;

&lt;p&gt;When a framework makes the flow of state and events easy to communicate, it usually also makes the system easier to maintain.&lt;/p&gt;

&lt;p&gt;But none of that comes for free.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; asks you to pay an upfront complexity tax in operations, onboarding, debugging, and platform maturity. Misconfigured jobs are not charming. They are expensive. The model feels cleaner once you have paid that tax, not before.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/flink-complexity-tax.png&quot; alt=&quot;A hand-drawn ninja engineer facing a Flink complexity tax toll booth before entering a powerful streaming platform city.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is the part many framework comparisons skip: the platform is powerful, but you do pay for the privilege of operating it well.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This is why I still reach for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; eagerly when the runtime itself needs to be a serious part of the design.&lt;/p&gt;

&lt;h2 id=&quot;where-kafka-streams-grew-on-me&quot;&gt;Where Kafka Streams Grew On Me&lt;/h2&gt;

&lt;p&gt;What changed for me was not that I stopped liking &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. What changed is that I learned to appreciate where &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is more enabling than I first allowed.&lt;/p&gt;

&lt;h3 id=&quot;1-the-state-model-is-different-not-just-worse&quot;&gt;1. The State Model Is Different, Not Just Worse&lt;/h3&gt;

&lt;p&gt;One of the things that threw me off at first was the ergonomics of state in &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; gives you state stores, changelog-backed recovery, and table-oriented patterns that can feel more globally available than &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s cleaner operator-local state style. The processor API is very explicit that processors interact with attached state stores, and those stores are fault-tolerant by default. In practice, the default persistent path is a local &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;RocksDB&lt;/span&gt; store backed by a compacted changelog topic. On top of that, table abstractions and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GlobalKTable&lt;/code&gt;-style patterns can make shared reference data or queryable state feel very convenient in the application model.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kafka.apache.org/40/streams/architecture/&quot;&gt;Kafka Streams architecture&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kafka.apache.org/42/streams/developer-guide/processor-api/&quot;&gt;Kafka Streams processor API and state stores&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That convenience comes with real trade-offs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;local &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;RocksDB&lt;/span&gt; state is fast and useful, but fault tolerance still depends on changelogs&lt;/li&gt;
  &lt;li&gt;restore times can still become painful at scale, especially when local state is lost and the store must rebuild from the changelog&lt;/li&gt;
  &lt;li&gt;the relationship between topology code and materialized state can become messy in under-disciplined repos&lt;/li&gt;
  &lt;li&gt;the convenience of reachable state can encourage poor habits if the model is not kept clear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But convenience is still convenience. There are use cases where having easier access to shared or queryable state is genuinely useful, and it would be dishonest to pretend otherwise.&lt;/p&gt;

&lt;p&gt;My instinct, because of my &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; background, was to push &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; code toward a more operator-local way of thinking anyway: make state ownership clearer, keep logic close to the transform that really owns it, and avoid turning the topology into a stateful soup. That discipline improved those codebases a lot.&lt;/p&gt;

&lt;p&gt;But that is exactly the point: bringing some &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;-style discipline into &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; made the code better. It did not prove that the whole system needed to become &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;.&lt;/p&gt;

&lt;h3 id=&quot;2-kafka-native-integration-is-a-real-strength&quot;&gt;2. Kafka-Native Integration Is A Real Strength&lt;/h3&gt;

&lt;p&gt;I am not even talking here about the obvious ecosystem point in a lazy way. Yes, &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; lives naturally inside the &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; ecosystem. Yes, it works comfortably with keyed messages, schemas, topics, and the usual surrounding tooling. Yes, schema-registry-oriented flows often feel more straightforward there.&lt;/p&gt;

&lt;p&gt;That matters. Not because &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; cannot do these things. It can. But because being native to the ecosystem reduces friction when the whole world around the application is already shaped like &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;You should not dismiss that as a minor detail. It is part of the operating model.&lt;/p&gt;

&lt;h2 id=&quot;where-flink-still-pulls-away&quot;&gt;Where Flink Still Pulls Away&lt;/h2&gt;

&lt;p&gt;This is where my original instincts still hold up.&lt;/p&gt;

&lt;h3 id=&quot;1-scaling-stops-at-the-broker-boundary-much-earlier-in-kafka-streams&quot;&gt;1. Scaling Stops At The Broker Boundary Much Earlier In Kafka Streams&lt;/h3&gt;

&lt;p&gt;The scaling constraint in &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is tightly tied to partitions, tasks, and instances. That is not a bug. It is the design. It is also why the system stays so close to &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; itself.&lt;/p&gt;

&lt;p&gt;But it has consequences.&lt;/p&gt;

&lt;p&gt;There comes a point where adding more application instances does not really solve the problem because the partitioning boundary is already telling you how far you can go cleanly. You can absolutely scale &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;, but the broker topology keeps exerting a much stronger influence on the application topology.&lt;/p&gt;

&lt;p&gt;At that point, scaling stops being primarily demand-driven and starts becoming topology-constrained.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;, by contrast, is still constrained at the source when consuming from &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;, but once records are inside the runtime it has far more freedom to repartition, redistribute work, and run operators at a different parallelism from the source. I would not call that infinite scaling. I would call it a materially more flexible runtime.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/&quot;&gt;Stateful stream processing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/overview/&quot;&gt;Flink concepts overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That difference becomes major once traffic spikes, repartition pressure, or uneven workloads start shaping your architecture.&lt;/p&gt;

&lt;h3 id=&quot;2-checkpointing-and-recovery-are-in-a-different-league&quot;&gt;2. Checkpointing And Recovery Are In A Different League&lt;/h3&gt;

&lt;p&gt;This is still one of the clearest differentiators for me.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s checkpointing model is part of the platform. Recovery is an explicit runtime capability, not just the consequence of rebuilding local state from changelogs. The barrier-based snapshotting model, savepoints, and state redistribution semantics are exactly the kind of thing that make &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; feel like an engine rather than a library.&lt;/p&gt;

&lt;p&gt;In &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;, the picture is a little more nuanced than &lt;em&gt;“it always has to read the whole changelog again.”&lt;/em&gt; If the local state store still exists, the runtime can replay from the previously checkpointed offset and catch up from there. If local state is gone, it has to rebuild from the changelog from the beginning of the retained data. That is meaningfully better than a naive full replay every time, and it is one of the reasons the &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;RocksDB&lt;/span&gt; path works as well as it does in practice.&lt;/p&gt;

&lt;p&gt;But the deeper point still holds: fault tolerance and task migration are still anchored in changelog restoration, and on large stateful applications that can become one of the dominant operational pain points. Retention choices matter. Restore time matters. Recovery becomes less predictable under failure. Operational patience starts turning into architecture.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kafka.apache.org/41/streams/developer-guide/running-app/&quot;&gt;Running Streams applications and state restoration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/restore-and-recovery.png&quot; alt=&quot;A hand-drawn comparison of Kafka Streams changelog restoration and Flink checkpoint-based restore and recovery.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;At smaller scale this looks like an implementation detail. At larger scale it starts deciding how painful failure and recovery really feel in production.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;That is the point where &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; stops being a nice architectural preference and starts becoming a serious operational advantage.&lt;/p&gt;

&lt;h2 id=&quot;the-real-trade-off&quot;&gt;The Real Trade-Off&lt;/h2&gt;

&lt;p&gt;So, here is the trade in one sentence:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is a very good way to build &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-native streaming applications.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is a very good way to operate stateful dataflows as a platform concern.&lt;/p&gt;

&lt;p&gt;Those are not the same problem, even if the diagrams sometimes look similar.&lt;/p&gt;

&lt;p&gt;And this is why I do not buy generic advice like &lt;em&gt;“use &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; if you need scale”&lt;/em&gt; or &lt;em&gt;“use &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; if you want simplicity.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Both statements are misleading. They sound practical, but they hide the real failure modes, encourage cargo-cult architecture, and make comfort-driven rewrites sound more principled than they are.&lt;/p&gt;

&lt;p&gt;The better rule is this:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote blog-pullquote--compact&quot;&gt;
  &lt;p&gt;If your system is still primarily an application that processes &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; topics, &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is often the right engineering choice.&lt;/p&gt;
  &lt;p&gt;If your system is becoming a stateful processing layer that needs explicit control over time, state, replay, recovery, and heterogeneous I/O, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; starts to justify its existence very quickly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-harder-lesson&quot;&gt;The Harder Lesson&lt;/h2&gt;

&lt;p&gt;This is the part I most wanted to say personally.&lt;/p&gt;

&lt;p&gt;I am still a huge &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; proponent. That has not changed.&lt;/p&gt;

&lt;p&gt;What has changed is that I now trust myself less when my first reaction is &lt;em&gt;“we should rewrite this in the framework I prefer.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That reaction is often just comfort seeking.&lt;/p&gt;

&lt;p&gt;Sometimes you really should migrate. Sometimes the runtime boundary is wrong, recovery is too painful, scaling is too constrained, and &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is the more honest architecture.&lt;/p&gt;

&lt;p&gt;But sometimes the better engineering decision is to love the existing system properly: clarify the model, clean the state boundaries, improve the abstractions, respect the domain flow, and stop assuming that old means wrong.&lt;/p&gt;

&lt;p&gt;That was the lesson here for me.&lt;/p&gt;

&lt;p&gt;If I had followed my first instinct blindly, I would have replaced some systems for the wrong reason.&lt;/p&gt;

&lt;h2 id=&quot;what-i-would-actually-do&quot;&gt;What I Would Actually Do&lt;/h2&gt;

&lt;p&gt;If I were starting with a &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-centric JVM team, modest operational requirements, and clean &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-in/&lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-out topologies, I would still be very happy with &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;I would move toward &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; once one or more of these became persistently true:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;stateful jobs became expensive to recover or rescale&lt;/li&gt;
  &lt;li&gt;I needed a broader processing platform rather than a library&lt;/li&gt;
  &lt;li&gt;event-time and replay behaviour started driving design choices&lt;/li&gt;
  &lt;li&gt;the system stopped being comfortably &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-shaped&lt;/li&gt;
  &lt;li&gt;operability and runtime visibility became a daily concern rather than an occasional debugging aid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the moment &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; stops being overkill and starts being the more honest architecture.&lt;/p&gt;

&lt;p&gt;And that brings me back to where I started.&lt;/p&gt;

&lt;p&gt;I still love &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. I still think its model is easier to reason about once runtime concerns become serious. I still think it is the stronger platform when state, recovery, and rescaling dominate the design.&lt;/p&gt;

&lt;p&gt;Many rewrites begin as comfort and only later get dressed up as architecture.&lt;/p&gt;

&lt;p&gt;That is the part I understand better now, and it is probably the most useful thing this comparison taught me.&lt;/p&gt;
</description>
            <pubDate>2026-04-01T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2026/04/01/when-flink-earns-its-complexity-over-kafka-streams.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2026/04/01/when-flink-earns-its-complexity-over-kafka-streams.html</guid>
        </item>
        
        
        
        <item>
            <title>PyFlink In 2026: Better Than Its Reputation, Still Not Frictionless</title>
            <description>&lt;p&gt;I do not think teams reach for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; because &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; feels nicer to type.&lt;/p&gt;

&lt;p&gt;They reach for it when they have already paid the cost of splitting one &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; system across two ecosystems.&lt;/p&gt;

&lt;p&gt;I have seen that pain in the most annoying way possible: training and experimentation lived in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, but the prediction path had to live in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;. On paper that sounds manageable. In practice it meant subtle differences in floating-point behavior, parsing choices, and even heading-angle calculations were enough to create inconsistent predictions. We lost months chasing what looked like model problems and turned out to be feature mismatches.&lt;/p&gt;

&lt;p&gt;That is the part many architecture discussions understate. Once training is in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; and prediction is in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;, the real problem is no longer just inference. It becomes feature parity, interface parity, and the feedback loop between two runtimes that each have their own libraries, their own defaults, and their own ways of being &lt;em&gt;almost&lt;/em&gt; the same.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/pyflink-pros-cons-in-2026/training-vs-prediction-drift.png&quot; alt=&quot;A hand-drawn illustration of Python training and Java prediction pipelines drifting apart in subtle but painful ways.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is the real tax of cross-language serving paths: not dramatic failure, but endless small mismatches that make the system harder to trust.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;You can try to escape that with &lt;span class=&quot;blog-highlight blog-highlight--onnx&quot;&gt;ONNX&lt;/span&gt;. You can rebuild parts of the feature logic in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;. You can expose the model behind a service boundary and call it remotely. All of these are reasonable patterns. None of them are free.&lt;/p&gt;

&lt;p&gt;Four years ago, &lt;span class=&quot;blog-highlight blog-highlight--onnx&quot;&gt;ONNX&lt;/span&gt; was not mature enough for the kinds of models and custom ops we cared about. The easy story broke precisely where real systems stop being toy examples. The fallback was the pattern most teams know well: deploy the model as a service and call it over REST. That works, but now your prediction pipeline owns an extra network hop, another SLA, another scaling surface, and one more place where raw features must remain perfectly aligned.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/pyflink-pros-cons-in-2026/model-service-tradeoffs.png&quot; alt=&quot;A hand-drawn illustration of a model service boundary with a load balancer, showing clean scaling but also latency and operational trade-offs.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;Model-as-a-service is often the sensible compromise. It is also where clean separation starts charging rent in latency, SLAs, and feature-parity work.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This is why I think the case for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; should be stated more bluntly than it usually is:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;If the real source of friction in your system is that your training, feature logic, and model-adjacent code live naturally in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, then &lt;em&gt;&quot;just use &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;&quot;&lt;/em&gt; is not a neutral suggestion.&lt;/p&gt;
  &lt;p&gt;It is an architectural trade, and often an expensive one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the real driver for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; adoption.&lt;/p&gt;

&lt;p&gt;I went back to an older &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; review recently because I did not want to turn one painful period into a permanent opinion. Some of those frustrations had aged well. Some had not. And &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is exactly the kind of technology people form a durable opinion about after one painful quarter and then never revisit.&lt;/p&gt;

&lt;p&gt;That would have been lazy here, because the story has moved. &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is in a better place now than many engineers assume. The official docs cover installation, packaging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environments, debugging, a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API, and connector examples. That is already a more serious platform story than the older dismissive take that it is simply immature.&lt;/p&gt;

&lt;p&gt;But the core trade-off has not disappeared.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is now real enough to take seriously, but it still does not let you forget that &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is fundamentally a JVM-first distributed runtime. That is the part people need to hold in their head at the same time as the improvements.&lt;/p&gt;

&lt;h2 id=&quot;what-has-improved-since-the-older-evaluation&quot;&gt;What Has Improved Since The Older Evaluation&lt;/h2&gt;

&lt;p&gt;The first thing worth saying is that some of the older criticisms are now too blunt.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is no longer just a thin curiosity around the Table API. The current docs cover installation, a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API, debugging, dependency management, packaging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environments for cluster execution, and connector examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/installation/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; installation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream/intro_to_datastream_api/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/faq/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; FAQ&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/debugging/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; debugging&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/api/python/examples/datastream/connectors.html&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; connector examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is already a materially better story than the one many engineers still carry around in their heads.&lt;/p&gt;

&lt;p&gt;A few concrete improvements stand out:&lt;/p&gt;

&lt;h3 id=&quot;1-the-python-story-is-better-documented&quot;&gt;1. The &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; Story Is Better Documented&lt;/h3&gt;

&lt;p&gt;The installation docs now state clear &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; version requirements. At the time of writing, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; requires &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; 3.9, 3.10, 3.11 or 3.12:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/installation/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; installation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds minor, but it is not. One of the easiest ways to waste time with cross-language frameworks is by discovering environment assumptions too late. The current docs at least acknowledge that this is a real part of the user experience.&lt;/p&gt;

&lt;h3 id=&quot;2-the-datastream-story-is-no-longer-hand-wavy&quot;&gt;2. The DataStream Story Is No Longer Hand-Wavy&lt;/h3&gt;

&lt;p&gt;One of the old reasons people dismissed &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; was that serious low-level streaming work still felt like &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; territory.&lt;/p&gt;

&lt;p&gt;That is less true now. The &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API is documented, examples exist, and the API surface is real enough that you can reason about it as a deliberate part of the platform rather than a side alley:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream/intro_to_datastream_api/&quot;&gt;Intro to the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would still be careful not to confuse &lt;em&gt;“documented”&lt;/em&gt; with &lt;em&gt;“equally frictionless as the JVM path,”&lt;/em&gt; but the old complaint that &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is barely there is no longer a fair description.&lt;/p&gt;

&lt;h3 id=&quot;3-debugging-and-packaging-are-better-acknowledged&quot;&gt;3. Debugging And Packaging Are Better Acknowledged&lt;/h3&gt;

&lt;p&gt;The older review spent a lot of energy on setup, environment pain, and debugging awkwardness.&lt;/p&gt;

&lt;p&gt;Those pains have not disappeared, but the current docs are more honest about them. They cover packaging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environments, adding JARs, client-side versus TaskManager-side logging, local debugging, remote debugging, and profiling:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/faq/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; FAQ&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/debugging/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; debugging&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because it tells you something important about the maturity of the ecosystem: it now documents the pain instead of pretending it is not there.&lt;/p&gt;

&lt;p&gt;That is progress, even if it is not magic.&lt;/p&gt;

&lt;h2 id=&quot;why-pyflink-is-genuinely-attractive&quot;&gt;Why &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; Is Genuinely Attractive&lt;/h2&gt;

&lt;p&gt;Despite the caveats, I do think &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; has a very real value proposition.&lt;/p&gt;

&lt;h3 id=&quot;1-it-keeps-the-streaming-layer-closer-to-the-actual-ml-ecosystem&quot;&gt;1. It Keeps The Streaming Layer Closer To The Actual &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; Ecosystem&lt;/h3&gt;

&lt;p&gt;This is the point I think most comparisons understate, and it is the one that matters most to me.&lt;/p&gt;

&lt;p&gt;The strongest argument for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is not merely &lt;em&gt;“our team prefers &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;.”&lt;/em&gt; The stronger argument is that the surrounding model ecosystem, experimentation culture, libraries, and iteration loops are still centered on &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/pyflink-pros-cons-in-2026/pyflink-same-ecosystem.png&quot; alt=&quot;A hand-drawn illustration showing PyFlink as a serious streaming platform that lets Python-native model and feature logic stay closer together.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is why &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; remains attractive: not because the runtime becomes light, but because the surrounding &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; ecosystem can stay closer to the streaming layer.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;That matters when the alternative is forcing teams into one of these patterns:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;re-implementing logic in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;&lt;/li&gt;
  &lt;li&gt;exporting models through formats like &lt;span class=&quot;blog-highlight blog-highlight--onnx&quot;&gt;ONNX&lt;/span&gt; and accepting the translation burden&lt;/li&gt;
  &lt;li&gt;splitting the system so aggressively that the serving boundary becomes the architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are invalid. But all of them are real costs, and in many teams they are the &lt;em&gt;actual&lt;/em&gt; costs driving interest in &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;If the same raw features are calculated in one language for training and another for live prediction, you do not just inherit maintenance overhead. You inherit doubt. When a prediction looks wrong, is the model wrong, is the data wrong, or did one side normalise, round, parse, or order something differently? That uncertainty is corrosive, and it slows every feedback loop around the system.&lt;/p&gt;

&lt;h3 id=&quot;2-it-meets-python-heavy-teams-where-they-already-work&quot;&gt;2. It Meets &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;-Heavy Teams Where They Already Work&lt;/h3&gt;

&lt;p&gt;If your data and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; teams already live in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; reduces one major source of organisational friction.&lt;/p&gt;

&lt;p&gt;That does not mean everyone suddenly gets to ignore distributed systems. But it does mean:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;feature logic can stay closer to the surrounding &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; estate&lt;/li&gt;
  &lt;li&gt;model-adjacent transformations feel more natural&lt;/li&gt;
  &lt;li&gt;experimentation paths from notebook thinking to streaming execution become less culturally awkward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For some organisations, that is a very big deal.&lt;/p&gt;

&lt;p&gt;The wrong reaction here is to sneer and say &lt;em&gt;“just learn &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;.”&lt;/em&gt; Sometimes that is the right answer. Often it is just a lazy one.&lt;/p&gt;

&lt;h3 id=&quot;3-it-makes-flink-more-reachable-without-hiding-flink&quot;&gt;3. It Makes &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; More Reachable Without Hiding &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;&lt;/h3&gt;

&lt;p&gt;Good language bindings should not pretend the platform underneath does not exist.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is useful when it gives &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; teams access to &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s real strengths: state, checkpoints, event-time semantics, long-running streaming jobs, and broader dataflow capabilities. If that is what you are buying, then the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; layer can be a practical bridge.&lt;/p&gt;

&lt;p&gt;That is especially true for teams whose work already mixes ETL, feature pipelines, and model-centric logic.&lt;/p&gt;

&lt;h3 id=&quot;4-there-is-a-real-connector-surface&quot;&gt;4. There Is A Real Connector Surface&lt;/h3&gt;

&lt;p&gt;This is another place where the older blanket criticism needs updating.&lt;/p&gt;

&lt;p&gt;The current &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; docs and examples do show &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;, Pulsar, and Elasticsearch examples in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/api/python/examples/datastream/connectors.html&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; connector examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So it would be wrong to say that the connector story is absent.&lt;/p&gt;

&lt;p&gt;But it would also be wrong to say that it feels like a pure &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; ecosystem.&lt;/p&gt;

&lt;p&gt;That brings me to the real downside.&lt;/p&gt;

&lt;h2 id=&quot;why-pyflink-is-still-not-flink-but-easy&quot;&gt;Why &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; Is Still Not &lt;em&gt;“&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;, But Easy”&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;The strongest criticism from the old evaluation still holds:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; reduces language friction, but it does not remove runtime friction.&lt;/p&gt;

&lt;h3 id=&quot;1-you-still-have-to-think-in-two-worlds&quot;&gt;1. You Still Have To Think In Two Worlds&lt;/h3&gt;

&lt;p&gt;The installation and FAQ pages make this clear if you read them carefully.&lt;/p&gt;

&lt;p&gt;You have to think about:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; interpreter version&lt;/li&gt;
  &lt;li&gt;&lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; packaging and archives&lt;/li&gt;
  &lt;li&gt;where &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; executes&lt;/li&gt;
  &lt;li&gt;how dependencies are shipped&lt;/li&gt;
  &lt;li&gt;JAR dependencies for connectors or &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;-side integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That earlier review made this painfully concrete. Getting local execution into a sane state meant lining up:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the right &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; version&lt;/li&gt;
  &lt;li&gt;the right &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; version&lt;/li&gt;
  &lt;li&gt;the right connector JARs&lt;/li&gt;
  &lt;li&gt;the right &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That list is not just setup trivia. It is the operating model announcing itself early.&lt;/p&gt;

&lt;p&gt;That is not a small footnote. It is the day-to-day ergonomics of the platform:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/installation/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; installation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/faq/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; FAQ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why I would resist overselling &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; to a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; team as &lt;em&gt;“just write &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; and the rest disappears.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It does not disappear.&lt;/p&gt;

&lt;p&gt;It relocates.&lt;/p&gt;

&lt;h3 id=&quot;2-the-connector-story-still-leaks-jvm-reality&quot;&gt;2. The Connector Story Still Leaks JVM Reality&lt;/h3&gt;

&lt;p&gt;The connector examples are useful, but they also reveal the real shape of things: adding JARs, managing connector dependencies, and living with the fact that some integration points are still fundamentally JVM-shaped.&lt;/p&gt;

&lt;p&gt;Even the current &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; connector docs explicitly talk about bringing connector dependencies yourself for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; jobs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/connectors/datastream/kafka/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; connector docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a deal-breaker. It is just not the same experience as working inside a native &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; framework whose extension model is &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; all the way down.&lt;/p&gt;

&lt;p&gt;It also shows up in deployment. In that earlier review, the easiest workable path for local standalone deployment was not &lt;em&gt;“package a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; app and run it.”&lt;/em&gt; It was closer to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;start from a vanilla &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; image&lt;/li&gt;
  &lt;li&gt;add the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; dependencies&lt;/li&gt;
  &lt;li&gt;mount the repo or bundle the code carefully&lt;/li&gt;
  &lt;li&gt;run the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; entrypoint from inside the live container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a perfectly workable path. It is also a strong reminder that the deployment experience is still shaped by &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s runtime model, not by &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;’s usual ergonomics.&lt;/p&gt;

&lt;h3 id=&quot;3-debugging-still-tells-you-what-the-system-really-is&quot;&gt;3. Debugging Still Tells You What The System Really Is&lt;/h3&gt;

&lt;p&gt;The current debugging docs are better than before, but they are also revealing.&lt;/p&gt;

&lt;p&gt;They distinguish between client-side logging and TaskManager-side logging. They discuss local debug, remote debug, and profiling &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; UDFs. That is helpful, but it also tells you that when things go wrong, you are not debugging a simple &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; program. You are debugging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; inside a distributed &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; runtime:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/debugging/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; debugging&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, that means some classes of issue still feel cross-boundary by nature:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;packaging bugs&lt;/li&gt;
  &lt;li&gt;dependency mismatches&lt;/li&gt;
  &lt;li&gt;behavioural differences between local and cluster execution&lt;/li&gt;
  &lt;li&gt;performance bottlenecks around &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; execution paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; being uniquely bad. It is just the cost of the abstraction being honest.&lt;/p&gt;

&lt;h3 id=&quot;4-native-python-models-are-not-an-automatic-architectural-win&quot;&gt;4. Native &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; Models Are Not An Automatic Architectural Win&lt;/h3&gt;

&lt;p&gt;This was one of the more useful parts of the earlier review, because it is exactly the kind of point people skip when they are trying to justify a new stack.&lt;/p&gt;

&lt;p&gt;Yes, being able to interact with model code directly inside a &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; job is a real plus. It can simplify some flows and avoid a network hop.&lt;/p&gt;

&lt;p&gt;But that is not the same as saying it is always the better architecture.&lt;/p&gt;

&lt;p&gt;Once the model is served behind a proper boundary, you often gain things that matter a lot in production:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;safer zero-downtime upgrades&lt;/li&gt;
  &lt;li&gt;cleaner readiness and health semantics&lt;/li&gt;
  &lt;li&gt;independent model scaling behind a load balancer&lt;/li&gt;
  &lt;li&gt;a clearer separation between streaming orchestration and serving concerns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, yes, native execution can save some overhead. But it can also collapse boundaries that were doing useful work for you.&lt;/p&gt;

&lt;p&gt;The reason I still take the native path seriously is not hand-wavy elegance. It is that model-as-a-service also comes with a bill:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;every prediction path now pays a network round trip&lt;/li&gt;
  &lt;li&gt;the serving tier becomes another system you need to scale for throughput and protect with its own SLA&lt;/li&gt;
  &lt;li&gt;raw feature generation has to stay perfectly aligned across the caller and the served model boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If demand is modest, teams can live with that for a long time. Once prediction volume rises, that architecture stops being an abstract diagram and starts showing up as latency, capacity planning, and operational drag.&lt;/p&gt;

&lt;h3 id=&quot;5-the-performance-question-never-fully-goes-away&quot;&gt;5. The Performance Question Never Fully Goes Away&lt;/h3&gt;

&lt;p&gt;I would be very careful here not to pretend a benchmark I have not run.&lt;/p&gt;

&lt;p&gt;But I am comfortable saying something narrower and more useful: if your workload is highly latency-sensitive, connector-heavy, or operationally unforgiving, the JVM path still deserves to be the default starting point.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; can absolutely be the right choice. I just would not choose it because I wanted to avoid understanding the &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; side of &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;That is not how this platform works.&lt;/p&gt;

&lt;h2 id=&quot;so-when-would-i-use-it&quot;&gt;So When Would I Use It?&lt;/h2&gt;

&lt;p&gt;I would take &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; seriously when these conditions hold:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the team is materially more fluent in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; than in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;&lt;/li&gt;
  &lt;li&gt;the reason for adopting &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is the runtime model, not fashion&lt;/li&gt;
  &lt;li&gt;the jobs are important, but not balanced on the sharpest latency edge&lt;/li&gt;
  &lt;li&gt;I am willing to own environment packaging and connector dependency management as part of the operating model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would lean back toward &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;connector maturity dominates the problem&lt;/li&gt;
  &lt;li&gt;the hot path is extremely performance-sensitive&lt;/li&gt;
  &lt;li&gt;the team already has strong JVM strength&lt;/li&gt;
  &lt;li&gt;I expect deep platform integration and want the least surprising execution path&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;if-you-want-to-try-it&quot;&gt;If You Want To Try It&lt;/h2&gt;

&lt;p&gt;If this post pushed you toward experimenting rather than debating in the abstract, I put together a small starter page here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/references/pyflink-agent-starter/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; starter archetype and agent prompt&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is intentionally minimal. The goal is not to hand you a grand framework. The goal is to give you a sensible first project shape and an agent prompt that can get a small &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;-first streaming scaffold off the ground without immediate chaos.&lt;/p&gt;

&lt;h2 id=&quot;the-practical-takeaway&quot;&gt;The Practical Takeaway&lt;/h2&gt;

&lt;p&gt;What matters here is not whether &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is &lt;em&gt;“good”&lt;/em&gt; or &lt;em&gt;“bad.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is far too vague to help anyone.&lt;/p&gt;

&lt;p&gt;The better question is this:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote blog-pullquote--compact&quot;&gt;
  &lt;p&gt;Do I want &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; as the working language for a &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; system badly enough to own the extra operational boundary that comes with it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the answer is yes, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is now mature enough to be a serious option.&lt;/p&gt;

&lt;p&gt;If the answer is no, then &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is still the cleaner way to get the full benefits of &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; without pretending the JVM underneath is someone else’s problem.&lt;/p&gt;

&lt;p&gt;That, at least, is the view I would hold today.&lt;/p&gt;
</description>
            <pubDate>2026-03-27T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2026/03/27/pyflink-pros-cons-in-2026.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2026/03/27/pyflink-pros-cons-in-2026.html</guid>
        </item>
        
        
        
        <item>
            <title>From Model Validation To Pipeline Validation</title>
            <description>&lt;p&gt;Originally published on Medium on July 15, 2024. Lightly edited for the ML-Affairs archive.&lt;/p&gt;

&lt;p&gt;Imagine making a decision today with the knowledge of tomorrow.&lt;/p&gt;

&lt;p&gt;Sounds like an unfair advantage, right?&lt;/p&gt;

&lt;p&gt;In machine learning, it is often a trap.&lt;/p&gt;

&lt;p&gt;As an &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; engineer at Vortexa, a lot of my work has lived in the space between abstract models and production tools that people can actually depend on. Over the years, my team and I have built and maintained data pipelines that feed downstream decisions in the energy domain. These systems do not just provide a snapshot of the market. They also provide signals that customers may use inside their own analysis, models, and decision workflows.&lt;/p&gt;

&lt;p&gt;That creates a very natural retrospective question:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;Had we incorporated Vortexa&apos;s predictions back in 2018, would the outcomes have been better?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question is simple to ask and surprisingly easy to answer badly.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/retrospective-validation-question.png&quot; alt=&quot;A visual introducing retrospective validation and the question of whether historical predictions would have changed past decisions.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The business question is retrospective. The validation problem is temporal.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;the-future-leakage-paradox&quot;&gt;The Future Leakage Paradox&lt;/h2&gt;

&lt;p&gt;The usual temptation is to “travel back in time” by applying today’s model to historical scenarios.&lt;/p&gt;

&lt;p&gt;That sounds reasonable until you notice the contradiction. If the model was trained using data that includes what happened after the period we are evaluating, then it is not really predicting the past. It is replaying the past with knowledge it should not have had.&lt;/p&gt;

&lt;p&gt;Put differently:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote blog-pullquote--compact&quot;&gt;
  &lt;p&gt;A model should not be asked to predict an outcome from a past it has already learned.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not a small modelling detail. It changes the meaning of the whole evaluation. The model is no longer being tested as a prediction system. It is being tested as a memory system.&lt;/p&gt;

&lt;p&gt;I started calling this the &lt;strong&gt;Future Leakage Paradox&lt;/strong&gt;, or FLiP: a situation where future information seeps into a past prediction and makes the retrospective evaluation look more realistic than it really is.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/future-leakage-paradox.png&quot; alt=&quot;A visual explaining the Future Leakage Paradox, where future knowledge leaks into historical prediction.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;Future leakage is subtle because the evaluation still looks technical. The problem is that the timeline is wrong.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;why-this-matters-in-a-real-domain&quot;&gt;Why This Matters In A Real Domain&lt;/h2&gt;

&lt;p&gt;Take vessel destination prediction as an example.&lt;/p&gt;

&lt;p&gt;Suppose we want to evaluate how well a model would have predicted vessel destinations in 2018. The energy and shipping domains are volatile. Trade routes, demand patterns, sanctions, operational behaviour, and geopolitical constraints all change over time.&lt;/p&gt;

&lt;p&gt;If a model trained after those changes is used to predict 2018, the retrospective result becomes misleading.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/vessel-destination-prediction.png&quot; alt=&quot;A visual describing vessel destination prediction as a temporally sensitive machine learning problem.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;In a domain like shipping, time is not just an index column. It is part of the system.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Consider COVID-19. The lockdowns in 2020 triggered a major drop in oil demand and changed shipping behaviour. If this information leaks into a model used to retrospectively evaluate 2018 predictions, the model can assign importance to patterns that were not available in the pre-pandemic world.&lt;/p&gt;

&lt;p&gt;The same applies to the war in Ukraine and the subsequent sanctions on Russia. Those events affected vessel movements and trade flows. A model trained after those changes may encode relationships that did not exist, or were not knowable, in 2018.&lt;/p&gt;

&lt;p&gt;That is the practical danger. Future leakage can make retrospective predictions look strong for the wrong reason.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/future-events-skew-validation.png&quot; alt=&quot;A visual showing how later world events can distort retrospective model validation.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The model may look informed. The issue is that it is informed by events the historical model could not have known.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;the-shift-validate-the-pipeline&quot;&gt;The Shift: Validate The Pipeline&lt;/h2&gt;

&lt;p&gt;This is where I think the conversation should move from &lt;strong&gt;model validation&lt;/strong&gt; to &lt;strong&gt;pipeline validation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Taken too literally, that may sound provocative. Of course model performance matters. But as an engineer, I do not only care about whether one model trained once looks good. I care about whether the training pipeline can repeatedly produce good models under the constraints of time, data freshness, and production reality.&lt;/p&gt;

&lt;p&gt;That distinction matters because retrospective prediction should not usually be done with one model.&lt;/p&gt;

&lt;p&gt;If we have shipping data from 2016 onward and we want to predict 2018, one sensible approach is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;train on 2016 and 2017&lt;/li&gt;
  &lt;li&gt;predict 2018&lt;/li&gt;
  &lt;li&gt;incorporate what actually happened in 2018&lt;/li&gt;
  &lt;li&gt;train a new model for 2019&lt;/li&gt;
  &lt;li&gt;repeat this process through later years&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are then two common strategies:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;an &lt;strong&gt;expanding window&lt;/strong&gt;, where the training data grows over time&lt;/li&gt;
  &lt;li&gt;a &lt;strong&gt;sliding window&lt;/strong&gt;, where the model is trained on a fixed recent period&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In both cases, the evaluation target has changed. We are no longer asking, “Is this one model good?” We are asking, “Can this pipeline keep producing reliable models as time moves forward?”&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/rolling-backtest-windows.png&quot; alt=&quot;A visual showing rolling backtest windows for historical model training and prediction.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;Rolling windows force the validation process to respect the timeline instead of flattening history into one training set.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Model drift and new data will always push teams toward retraining. That means the training pipeline deserves the same level of care we already give to production ETL pipelines.&lt;/p&gt;

&lt;p&gt;This is not just a nuance. It changes the engineering standard.&lt;/p&gt;

&lt;p&gt;The objective is not to produce one impeccable model in isolation. The objective is to prove that the &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; pipeline can generate a sequence of useful, traceable, reproducible models.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/pipeline-validation-over-model-validation.png&quot; alt=&quot;A visual contrasting single model validation with validating the whole machine learning pipeline.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;A model is an output. The pipeline is the production capability.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;what-pipeline-validation-needs-to-prove&quot;&gt;What Pipeline Validation Needs To Prove&lt;/h2&gt;

&lt;p&gt;Once multiple models become the norm, several engineering properties become central.&lt;/p&gt;

&lt;h3 id=&quot;idempotence-and-determinism&quot;&gt;Idempotence And Determinism&lt;/h3&gt;

&lt;p&gt;Given a specific data snapshot and configuration, the pipeline should produce the same model, or at least an equivalent one, every time.&lt;/p&gt;

&lt;p&gt;This matters because data scientists and engineers need to separate the impact of a code change from the noise of an unstable training process. If the same input can produce meaningfully different outputs without explanation, debugging becomes guesswork.&lt;/p&gt;

&lt;h3 id=&quot;consistency&quot;&gt;Consistency&lt;/h3&gt;

&lt;p&gt;The models produced across different windows should be held to a consistent standard.&lt;/p&gt;

&lt;p&gt;One strong year is not enough. If the pipeline performs well only when the data is favourable, then the system is fragile. Pipeline validation should expose that fragility instead of hiding it inside aggregate metrics.&lt;/p&gt;

&lt;h3 id=&quot;temporal-stability&quot;&gt;Temporal Stability&lt;/h3&gt;

&lt;p&gt;Performance over time matters.&lt;/p&gt;

&lt;p&gt;If recent windows behave very differently from older windows, that may reveal changes in the domain, gaps in the feature set, data quality issues, or a pipeline that no longer captures the right signal.&lt;/p&gt;

&lt;p&gt;Temporal instability is not always bad. Sometimes the world really has changed. But the pipeline should make that visible.&lt;/p&gt;

&lt;h2 id=&quot;the-quest-for-temporal-stability&quot;&gt;The Quest For Temporal Stability&lt;/h2&gt;

&lt;p&gt;Temporal stability is influenced by both the domain and the computational setup.&lt;/p&gt;

&lt;h3 id=&quot;nature-of-data-changes&quot;&gt;Nature Of Data Changes&lt;/h3&gt;

&lt;p&gt;In the energy domain, the structure of the data can evolve. Geopolitical events, operational shifts, and changes in trade flows can all affect the patterns a model needs to learn.&lt;/p&gt;

&lt;p&gt;If the world is changing quickly, a sliding window may be more appropriate because it gives more weight to recent data. If there are longer-term cyclic patterns, an expanding window may provide a clearer view.&lt;/p&gt;

&lt;h3 id=&quot;business-objectives&quot;&gt;Business Objectives&lt;/h3&gt;

&lt;p&gt;If the goal is to understand long-term patterns, an expanding window may be the better fit. If the goal is to respond quickly to market changes, a sliding window may be more useful.&lt;/p&gt;

&lt;p&gt;This is not only a data science choice. It is a product and business choice as well.&lt;/p&gt;

&lt;h3 id=&quot;computational-costs&quot;&gt;Computational Costs&lt;/h3&gt;

&lt;p&gt;As the available data grows, training on all historical data becomes more expensive.&lt;/p&gt;

&lt;p&gt;If resources are constrained, a sliding window may be more practical because the dataset size stays bounded. That trade-off is not purely technical either. It affects how often the pipeline can run and how quickly the team can iterate.&lt;/p&gt;

&lt;h3 id=&quot;the-models-ability-to-forget&quot;&gt;The Model’s Ability To Forget&lt;/h3&gt;

&lt;p&gt;Some model classes can retain old patterns even when newer data suggests the world has moved on.&lt;/p&gt;

&lt;p&gt;In those cases, a sliding window can help force the model to shed outdated patterns. An expanding window, by contrast, may overemphasise history that is no longer representative.&lt;/p&gt;

&lt;h2 id=&quot;sliding-vs-expanding-windows&quot;&gt;Sliding Vs Expanding Windows&lt;/h2&gt;

&lt;p&gt;There is no universal answer. The right choice depends on the problem, the data-generating process, and the cost of being wrong.&lt;/p&gt;

&lt;h3 id=&quot;1-sliding-window&quot;&gt;1. Sliding Window&lt;/h3&gt;

&lt;p&gt;A sliding window trains on a fixed-size recent period. For example, train on 2017-2018 to predict 2019, then slide forward and train on 2018-2019 to predict 2020.&lt;/p&gt;

&lt;p&gt;The main advantage is &lt;strong&gt;temporal relevance&lt;/strong&gt;. The model is always trained on recent data, which is useful in fast-changing environments.&lt;/p&gt;

&lt;p&gt;The drawbacks are also real:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;it can miss longer-term patterns&lt;/li&gt;
  &lt;li&gt;it can produce more variable results across windows&lt;/li&gt;
  &lt;li&gt;it may discard useful historical context too aggressively&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-expanding-window&quot;&gt;2. Expanding Window&lt;/h3&gt;

&lt;p&gt;An expanding window grows over time. For example, train on 2017-2018 to predict 2019, then train on 2017-2019 to predict 2020, and so on.&lt;/p&gt;

&lt;p&gt;The main advantage is &lt;strong&gt;historical context&lt;/strong&gt;. The model sees more of the past and may capture longer-term patterns.&lt;/p&gt;

&lt;p&gt;The drawbacks are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;computational cost grows over time&lt;/li&gt;
  &lt;li&gt;old data may become less relevant&lt;/li&gt;
  &lt;li&gt;the model may become slower to adapt to structural change&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-hybrid-approaches&quot;&gt;3. Hybrid Approaches&lt;/h3&gt;

&lt;p&gt;In some systems, a hybrid approach is more appropriate.&lt;/p&gt;

&lt;p&gt;For example, an expanding window can be used up to a certain point, after which a sliding window keeps the training set bounded. Another option is a weighted expanding window, where recent data carries more weight but older data is not fully discarded.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2024/from-model-validation-to-pipeline-validation/sliding-vs-expanding-window-tradeoffs.png&quot; alt=&quot;A table comparing sliding windows, expanding windows, and hybrid strategies for retrospective validation.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The windowing strategy is part of the system design. It encodes assumptions about how much the past should matter.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;measuring-pipeline-effectiveness&quot;&gt;Measuring Pipeline Effectiveness&lt;/h2&gt;

&lt;p&gt;Once the pipeline is the target, the metrics also need to widen.&lt;/p&gt;

&lt;h3 id=&quot;aggregate-metrics&quot;&gt;Aggregate Metrics&lt;/h3&gt;

&lt;p&gt;Evaluate models across multiple periods and then look at aggregate metrics such as accuracy, precision, recall, F1 score, median performance, and variance.&lt;/p&gt;

&lt;p&gt;The variance matters. A high median with unstable windows may still be operationally risky. A lower but more stable model may sometimes be more useful, depending on the product.&lt;/p&gt;

&lt;h3 id=&quot;adaptability&quot;&gt;Adaptability&lt;/h3&gt;

&lt;p&gt;Data sources change. Feature sets evolve. Domain conditions shift.&lt;/p&gt;

&lt;p&gt;A strong &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; pipeline should adapt to these changes without silently degrading. That means versioning, traceability, and clear ownership of feature logic are not optional.&lt;/p&gt;

&lt;h3 id=&quot;data-leakage-detection&quot;&gt;Data Leakage Detection&lt;/h3&gt;

&lt;p&gt;Data leakage is a silent killer in retrospective analysis.&lt;/p&gt;

&lt;p&gt;Performance that looks too good to be true often is. Suspicious correlations, unrealistic jumps in performance, or features that depend on future outcomes should trigger investigation.&lt;/p&gt;

&lt;p&gt;Some practical safeguards:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Feature construction:&lt;/strong&gt; features must not be calculated using future data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;External data alignment:&lt;/strong&gt; external datasets must obey the same temporal restrictions as the primary data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Shuffling care:&lt;/strong&gt; random shuffling can destroy the meaning of time-series evaluation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Time-aware cross-validation:&lt;/strong&gt; conventional cross-validation is usually the wrong tool for sequential data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Feature engineering per window:&lt;/strong&gt; cleaning, normalisation, standardisation, and feature engineering should be re-executed for each data window.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last point is easy to underestimate. If normalisation statistics are computed across the full dataset and then used inside older windows, future information has already leaked into the past.&lt;/p&gt;

&lt;h2 id=&quot;periodic-validation-applies-to-live-models-too&quot;&gt;Periodic Validation Applies To Live Models Too&lt;/h2&gt;

&lt;p&gt;The same principles apply to live models.&lt;/p&gt;

&lt;p&gt;Retrospective validation makes the timeline problem obvious, but live models face the same pressure. Data changes, external conditions move, and the model’s assumptions age.&lt;/p&gt;

&lt;p&gt;For neural networks, validation is often discussed around epochs. But the broader need for regular validation is not specific to neural networks. Any model that operates in a changing domain needs periodic checks that respect time.&lt;/p&gt;

&lt;p&gt;Time-series cross-validation is useful because it tests performance across chronological splits. It helps expose overfitting, leakage, and temporal brittleness.&lt;/p&gt;

&lt;p&gt;The goal is not only to keep a model fresh. The goal is to keep the validation story honest.&lt;/p&gt;

&lt;h2 id=&quot;efficiency-and-traceability&quot;&gt;Efficiency And Traceability&lt;/h2&gt;

&lt;p&gt;Efficiency metrics are also part of the picture.&lt;/p&gt;

&lt;p&gt;If training gets slower every time the data grows, the pipeline may become too expensive to run frequently enough. If traceability is weak, the team may not know which data, features, code, and hyperparameters produced a given model.&lt;/p&gt;

&lt;p&gt;That lineage matters.&lt;/p&gt;

&lt;p&gt;When multiple models are generated periodically, each one needs a clear record:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;data snapshot&lt;/li&gt;
  &lt;li&gt;feature definitions&lt;/li&gt;
  &lt;li&gt;training code version&lt;/li&gt;
  &lt;li&gt;hyperparameters&lt;/li&gt;
  &lt;li&gt;evaluation window&lt;/li&gt;
  &lt;li&gt;output artefact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not bureaucracy. It is how teams make iteration explainable.&lt;/p&gt;

&lt;p&gt;Without traceability, improvement becomes folklore. With traceability, each refinement builds on something the team can actually understand.&lt;/p&gt;

&lt;h2 id=&quot;last-words&quot;&gt;Last Words&lt;/h2&gt;

&lt;p&gt;Machine learning in the energy sector keeps evolving, as it does everywhere else. But the core lesson here is broader than one domain.&lt;/p&gt;

&lt;p&gt;If the system needs to make claims about historical predictions, the validation process must respect history.&lt;/p&gt;

&lt;p&gt;That means moving beyond a narrow question of whether one model performs well. The more useful question is whether the pipeline can repeatedly produce reliable, traceable, temporally honest models as the world changes around it.&lt;/p&gt;

&lt;p&gt;In practice, that is the shift from model validation to pipeline validation.&lt;/p&gt;

&lt;p&gt;And for production &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt;, that shift is not cosmetic. It is the difference between a model that looks good in retrospect and a system that could actually have made the prediction at the time.&lt;/p&gt;
</description>
            <pubDate>2024-07-15T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2024/07/15/from-model-validation-to-pipeline-validation.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2024/07/15/from-model-validation-to-pipeline-validation.html</guid>
        </item>
        
        
        
        <item>
            <title>Harmonizing Avro and Python: A Dance of Data Classes</title>
            <description>&lt;p&gt;Reposting from the &lt;a href=&quot;https://medium.com/vortechsa/harmonizing-avro-and-python-a-dance-of-data-classes-d1cc7bf6bb33&quot;&gt;Vortexa medium blog&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the realm of data engineering, managing data types and schemas efficiently is of paramount importance. The crux of the matter? When data schemas are poorly managed, a myriad of issues arise, ranging from data incompatibility to runtime errors. What I am aiming for in this article is to introduce &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt;, a binary serialization format born from the Apache Hadoop project, through which I hope to highlight the significance of Avro schemas in data engineering. Finally, I will provide you with a hands-on guide on converting Avro files into &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; data classes. By the end of this read, you’ll grasp the fundamentals of Avro schemas, understand the advantages of using them, and be equipped with a practical example of generating &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; data classes from these schemas.&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/avro-schema-management/2023-11-07-break-screen.png&quot; alt=&quot;Break-Screen&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;the-issue-at-hand&quot;&gt;The Issue at Hand&lt;/h2&gt;
&lt;p&gt;Imagine the following scenario:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Your application’s new update starts crashing for a specific set of users.&lt;/li&gt;
  &lt;li&gt;Upon investigation, you discover the root cause: a mismatch between the expected data format and the actual data sent from the backend.&lt;/li&gt;
  &lt;li&gt;Such mismatches can occur due to several reasons — maybe a field was renamed, or its data type got changed without proper communication to all stakeholders.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;These are real-world problems arising from the lack of efficient schema management.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;So, how can &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt; and particularly &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt; schemas help deal with these predicaments?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;avro-what-now&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;… what now?&lt;/h2&gt;
&lt;p&gt;In the grand scheme of data engineering and big data, one might compare the efficient storage and transmission of data to the very lifeblood of the show. Now, if this show needed a backstage hero, it would be Apache Avro. This binary serialization format, conceived in the heart of the Apache Hadoop project, is swift, concise, and unparalleled in dealing with huge data loads. When the curtain rises for powerhouses like Data Lakes, Apache Kafka, and Apache Hadoop, it’s Avro that steals the limelight.&lt;/p&gt;

&lt;h3 id=&quot;the-evolution-of-data-serialization&quot;&gt;The Evolution of Data Serialization&lt;/h3&gt;
&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/avro-schema-management/2023-11-07-Package.png&quot; alt=&quot;Package&quot; /&gt;
&lt;/div&gt;
&lt;p&gt;Before diving into the tapestry of data’s history, let’s demystify a foundational concept here: serialization. At its core, serialization is the process of converting complex data structures or objects into a format that can be easily stored or transmitted and later reconstructed. Imagine packing for a trip; you organize and fold your clothes (data) into a suitcase (a serialized format) so that they fit neatly and can be effortlessly unpacked at your destination.&lt;/p&gt;

&lt;p&gt;With that in mind, the story of data storage and transmission is a dynamic saga filled with innovation, challenges, and breakthroughs. Cast your mind back to the times of simple flat files–text files abiding to a specific structure. They were the humble beginning, like parchment scrolls in a digital era. But as data grew in complexity, our digital scrolls evolved into intricate relational databases, swift NoSQL solutions, and vast data lakes.&lt;/p&gt;

&lt;p&gt;Now, imagine various systems, microservices, or extract-transform-load (ETL) pipelines, trying to communicate with one another by attempting to read unfamiliar data formats. It’s like trying to read a book when you don’t know the language it’s written in. To solve this, data had to be serialized–essentially translating complex data structures into a universally understood format. The early translators in this world were XML and JSON. Effective? Yes. Efficient? Not quite. They often felt like scribes painstakingly inking each letter, especially when handling vast amounts of data. The world needed a faster scribe; one that was both concise and precise.&lt;/p&gt;

&lt;p&gt;Enter &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;. Inspired by the bustling highways of big data scenarios–from the lightning speed of &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; to the vastness of Hadoop–Avro was born to ensure that data packets glided smoothly without unexpected stops. It became the guardian of data integrity and compatibility.&lt;/p&gt;

&lt;h2 id=&quot;whats-in-a-pojo&quot;&gt;What’s in a POJO?&lt;/h2&gt;
&lt;p&gt;So, integrity is the keyword here, and in the context of this blog, we care about integrity breaches concerned with schema changes in a service that are not properly propagated to its consumers, rendering them unable to accommodate the new schema of the data they consume–like reading a book in a foreign language 😉.&lt;/p&gt;

&lt;h3 id=&quot;the-dawn-of-the-pojo-era&quot;&gt;The Dawn of the POJO Era&lt;/h3&gt;
&lt;p&gt;In the realm of programming, particularly within Java, a hero emerged named the Plain Old Java Object (POJO). This simple, unadorned object didn’t extend or implement any specific Java framework or class, allowing it to represent data without any preset behaviors or constraints. Imagine a Person POJO, detailing fields like name, age, and address without binding rules on how you should engage with these fields. Simple and elegant.&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Person&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// Default constructor&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Person&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// Constructor with parameters&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Person&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// Getters and setters for each field&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getAge&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setAge&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Person{&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;s&quot;&gt;&quot;name=&apos;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\&apos;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;s&quot;&gt;&quot;, age=&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;s&quot;&gt;&quot;, address=&apos;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\&apos;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;sc&quot;&gt;&apos;}&apos;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;However, as data complexity increased and systems multiplied, ensuring that these straightforward representations, our POJOs, maintained their integrity when transmitted or stored across varying systems became a challenge. Manual serialization, translating each POJO for different systems, wasn’t just laborious — it was a minefield of potential errors.&lt;/p&gt;

&lt;p&gt;Enter the need for an efficient and consistent serialization mechanism. One that could not only describe these POJOs but also seamlessly encode and decode them, ensuring data looked and felt the same everywhere.&lt;/p&gt;

&lt;h2 id=&quot;apache-avro--the-magic-of-schemas&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt; &amp;amp; the Magic of Schemas&lt;/h2&gt;
&lt;p&gt;Amidst this backdrop, &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt; took centre stage. While the POJO painted the picture, &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt; became the artist’s brush, allowing the artwork to be replicated without losing its original essence. Integral to &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;’s magic were its schemas. These files, with their unique &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.avsc&lt;/code&gt; extension, were a form of a blueprint, dictating the structure of an entity, data types, and nullable fields or default values. (see the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Person.avsc&lt;/code&gt; as an example here).&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;record&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Person&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;com.example&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;fields&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;int&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;address&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Pairing the intuitive design of POJOs with the precision of &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt; schemas, developers had a formidable toolkit. Now, data could be managed, shuttled, and transformed without ever losing its core essence or structure. But what if these changes weren’t properly communicated amongst interacting systems?&lt;/p&gt;

&lt;h2 id=&quot;challenges-in-schema-communication&quot;&gt;Challenges in Schema Communication&lt;/h2&gt;
&lt;p&gt;Imagine two services: Service A (the Producer) that creates and sends data, and Service B (the Consumer) that receives and processes it. Service A updates its schema — perhaps it added a new field or modified an existing one. But if Service B is unaware of this change, it might end up expecting apples and receiving oranges.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;The Domino Effect&lt;/strong&gt;: Let’s say Service A, our producer, changes a field from being a number to a string. Service B, expecting a number, might crash or perform incorrect operations when it encounters a string. In a real-world scenario, this could mean misinterpretation of important metrics, corrupted databases, or application failures.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Versioning Nightmares&lt;/strong&gt;: If every schema change requires updating the application logic in both the producer and consumer, this can quickly spiral into a versioning nightmare. How does one ensure that Service B is always compatible with Service A’s data, especially when they are updated at different intervals?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enter the Schema Registry&lt;/strong&gt;: A centralized Schema Registry can be the saviour in this scenario. Instead of letting every service decide how to send or interpret data, the Schema Registry sets the standard.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Registration &amp;amp; Validation&lt;/strong&gt;: When Service A wishes to update its schema, it first registers the new schema with the registry. The registry validates this schema, ensuring backward compatibility with its previous versions.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Schema Sharing&lt;/strong&gt;: Service B, before processing any data, checks with the registry to get the most recent schema. This ensures it knows exactly how to interpret the data it receives.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Library Generation&lt;/strong&gt;: On successful registration, the producer can then trigger a script to create or update the corresponding POJO or Python data class. This automatically generated class can be used directly, ensuring that the code aligns with the latest schema.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;artifact-repository--versioning&quot;&gt;Artifact Repository &amp;amp; Versioning&lt;/h2&gt;
&lt;p&gt;The generated data classes need a home. An Artifact Repository acts as this home. Whenever there’s a change, the updated class is given a new version and stored in this repository. Service B can then reference the specific version of the class it needs, ensuring data compatibility.&lt;/p&gt;

&lt;p&gt;Producers, Consumers, and their Interaction: Once the schema changes are validated and registered, and the respective classes are updated, both the producer and consumer know exactly how to interact. They can reliably share data, knowing that both sides understand the data’s structure and meaning.&lt;/p&gt;

&lt;p&gt;In essence, a centralised schema management system, paired with a robust registry and an efficient artifact repository, ensures that such data incompatibility issues are rendered not possible!&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/avro-schema-management/2023-11-07-Example-Architecture.png&quot; alt=&quot;Package&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;generating-python-data-classes-from-avsc-files&quot;&gt;Generating &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; Data Classes from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*.avsc&lt;/code&gt; files&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;, by its design and origin, has a strong affinity for the &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; ecosystem. &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt;’s project comes with built-in tools and libraries tailored for &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;, which makes generating POJOs straightforward. But when working with &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, things aren’t as easy.&lt;/p&gt;

&lt;p&gt;Historically, it is worth noting that the introduction of data classes, which brought a feature similar to &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;’s POJOs, came with &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; 3.7. It, however, necessitated reliance on external libraries, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dataclasses_avroschema&lt;/code&gt;, for schema-based generation. While these libraries are effective, their unofficial status can raise concerns about long-term reliability. Moreover, their utilization often depends on well-documented and clear examples, which might sometimes be ambiguous or lacking altogether. Furthermore, &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;’s dynamic type system, though offering flexibility, poses challenges in maintaining data representation consistency when interfacing with &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;’s static schemas.&lt;/p&gt;

&lt;p&gt;In this blog post, I hope to provide a clear example for data class-autogeneration, using an easy-to-understand script. So, let’s dive into an example.&lt;/p&gt;

&lt;p&gt;Suppose, as we have already iterated, that we have the Person.avsc:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;record&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Person&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;com.example&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;fields&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;int&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;address&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Before providing the script, let’s discuss the sample project structure, which can help clarify why, later on, I state that the generated files must be read-only.&lt;/p&gt;

&lt;h3 id=&quot;sample-project-structure&quot;&gt;Sample Project Structure&lt;/h3&gt;
&lt;p&gt;Your project structure might look like this:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;project/
│
├── resources/
│   └── schemas/
│       └── Person.avsc
├── src/
│   └── types/
│       └── Person.py
├── scripts/
│   └── generate_dataclasses.py
└── Makefile
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;resources/schemas/&lt;/code&gt;: This directory contains the Avro schema files (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.avsc&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/types/&lt;/code&gt;: This directory will contain the generated Python data classes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.py&lt;/code&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scripts/generate_dataclasses.py&lt;/code&gt;: This script generates the Python data classes from the Avro schemas&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Makefile&lt;/code&gt;: This file contains the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt; command to run the script.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Now, you can use the following Python script to generate a Python data class from this Avro schema:&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;json&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;os&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;subprocess&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;dataclasses_avroschema.model_generator.generator&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ModelGenerator&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Starting script...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_generator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ModelGenerator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Ensure the output directory exists
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;../src/types&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;makedirs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;exist_ok&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Scan the directory for .avsc files
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;root&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;files&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;walk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;../resources/schemas&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Generating DataClass for: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endswith&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;.avsc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;schema_file&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;root&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;.avsc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;.py&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Load the schema
&lt;/span&gt;                &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Generate the python code for the schema
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model_generator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Open the output file
&lt;/span&gt;                &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;w&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                    &lt;span class=&quot;c1&quot;&gt;# Write a comment at the top of the file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;# This is an autogenerated python class&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;c1&quot;&gt;# Write the imports to the output file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;from dataclasses_avroschema import AvroModel&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;import dataclasses&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                    &lt;span class=&quot;c1&quot;&gt;# Remove the imports from the result because we have already written them to the output file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;from dataclasses_avroschema import AvroModel&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;import dataclasses&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                    &lt;span class=&quot;c1&quot;&gt;# Write the generated python code to the output file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Format the output file using isort and black
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;subprocess&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;isort&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;subprocess&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;black&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Make the file read-only
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chmod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mo&quot;&gt;0o444&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Generated &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; from &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema_file&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;__main__&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This script will generate a Python file Person.py in the ../src/types directory with the following content:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# This is an autogenerated python class
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;dataclasses_avroschema&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AvroModel&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;dataclasses&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataclasses&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataclass&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Person&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AvroModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;com.example&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h3 id=&quot;why-read-only&quot;&gt;Why Read-Only?&lt;/h3&gt;
&lt;p&gt;The generated Python files are made read-only to prevent accidental modifications. Since these files are autogenerated, any changes should be made in the Avro schema files, and then the Python files should be regenerated.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The integration of Avro files with Python data classes streamlines the complexities of data handling. It’s a union that empowers the data engineering toolkit, delivering precise type-checking, user-friendly code suggestions, rigorous validation, and crystal-clear readability. With the solid foundation provided by the schema registry, the integrity of your data remains uncompromised, no matter how intricate the data operations become. And while the magic lies in the technology and techniques discussed, the real art is in the consistent, reliable data flow it facilitates. As you delve deeper into the vast world of data, know that tools like these are pivotal in weaving the seamless narrative of your data story.&lt;/p&gt;

&lt;p&gt;Stay tuned, as more insights await in follow-up discussions, where we’ll further dissect the intricacies of a comprehensive schema management ecosystem.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;
</description>
            <pubDate>2023-11-07T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2023/11/07/Avro-Schema-Management.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2023/11/07/Avro-Schema-Management.html</guid>
        </item>
        
        
        
        <item>
            <title>Agile In Action: Bridging Data Science and Engineering</title>
            <description>&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/agile-in-action/2023-10-31-Turner.png&quot; alt=&quot;Joseph Mallord William Turner | Dutch Boats in a Gale (&apos;The Bridgewater Sea Piece&apos;) | National Gallery, London&quot; /&gt;
  &lt;p class=&quot;image-credit&quot;&gt;Picture taken from &lt;a href=&quot;https://www.nationalgallery.org.uk/paintings/joseph-mallord-william-turner-dutch-boats-in-a-gale-the-bridgewater-sea-piece&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;National Gallery, London&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;A few weeks ago, Bill Raymond invited me onto his &lt;a href=&quot;https://agileinaction.com/agile-in-action-podcast/2023/10/31/bridging-ai-data-science-and-engineering-a-personal-journey.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Agile in Action podcast&lt;/a&gt; after reading an older post of mine on &lt;a href=&quot;/2020/08/11/agile-data-science.html&quot;&gt;doing data science the Agile way&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I said yes because this topic has followed me through most of my career.&lt;/p&gt;

&lt;p&gt;I started as a data scientist. Then I spent years watching perfectly respectable prototypes fail to become products. By the time I reached Vortexa, I was leading a team of data scientists and engineers and living right in the middle of the tension I had been talking about for years.&lt;/p&gt;

&lt;p&gt;That is the version of &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; I wanted to discuss in the episode. Not the clean whiteboard version. The one that appears when a model has to leave a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; notebook, survive production, and still make sense to the people who have to operate it.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The real gap in &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; teams is rarely enthusiasm. It is the distance between a model that works once and a system that can be trusted repeatedly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;why-this-topic-stayed-with-me&quot;&gt;Why This Topic Stayed With Me&lt;/h2&gt;

&lt;p&gt;Part of the reason this topic matters so much to me is that I learned it the frustrating way.&lt;/p&gt;

&lt;p&gt;At Data Reply, I worked on one prototype after another. We would explore a problem, build something promising, show strong results, and then hit the same wall: the client liked the idea, but the system never really made it into production. Sometimes the missing piece was infrastructure. Sometimes it was culture. Sometimes it was simply that nobody owned the hard part after the demo.&lt;/p&gt;

&lt;p&gt;That started to change for me at UBS.&lt;/p&gt;

&lt;p&gt;For the first time, I heard the sentence I had wanted to hear for years: &lt;em&gt;“Great. Now how do we put this into production?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I was paired with an experienced engineer, and that changed the direction of my career. I stopped seeing engineering as the final packaging step after the interesting work was done. I started seeing it as part of the thinking itself.&lt;/p&gt;

&lt;p&gt;That shift is still with me today.&lt;/p&gt;

&lt;h2 id=&quot;the-real-gap-between-data-science-and-engineering&quot;&gt;The Real Gap Between Data Science And Engineering&lt;/h2&gt;

&lt;p&gt;When people talk about cross-functional &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; teams, they often make the collaboration sound natural. In practice, it is not.&lt;/p&gt;

&lt;p&gt;Data scientists are usually optimising for learning:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;trying ideas quickly&lt;/li&gt;
  &lt;li&gt;testing hypotheses&lt;/li&gt;
  &lt;li&gt;moving fast through a messy search space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Engineers are usually optimising for control:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;reproducibility&lt;/li&gt;
  &lt;li&gt;determinism&lt;/li&gt;
  &lt;li&gt;maintainability&lt;/li&gt;
  &lt;li&gt;safe change over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both instincts are valid.&lt;/p&gt;

&lt;p&gt;The problem is that they are protecting the system from different failure modes.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The issue is not that data scientists are messy and engineers are rigid. The issue is that both are right about different kinds of breakage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Take a simple pricing model. A data scientist can build a strong prototype in a notebook, engineer the features, train the model, and prove the concept. But once that model becomes part of a product, somebody has to make sure the production path transforms the raw input in exactly the same way. If the training pipeline and the prediction pipeline drift apart, the system lies even when the model itself is good.&lt;/p&gt;

&lt;p&gt;That is why the gap matters so much.&lt;/p&gt;

&lt;p&gt;It is not about user interfaces or wrapping code nicely. It is about making sure the system that predicts tomorrow behaves like the system that was validated yesterday.&lt;/p&gt;

&lt;h2 id=&quot;what-agile-actually-helped-with&quot;&gt;What &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; Actually Helped With&lt;/h2&gt;

&lt;p&gt;When I say &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; helped here, I do not mean that Scrum ceremonies somehow solved the problem.&lt;/p&gt;

&lt;p&gt;What helped was having a way to make uncertainty legible.&lt;/p&gt;

&lt;p&gt;For me, that meant three things.&lt;/p&gt;

&lt;h3 id=&quot;1-making-experiments-explicit&quot;&gt;1. Making experiments explicit&lt;/h3&gt;

&lt;p&gt;In &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; work, &lt;em&gt;“we are exploring”&lt;/em&gt; is too vague.&lt;/p&gt;

&lt;p&gt;An experiment becomes useful when the team can answer:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;what assumption are we testing?&lt;/li&gt;
  &lt;li&gt;what would count as useful evidence?&lt;/li&gt;
  &lt;li&gt;what result would tell us to stop?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds simple, but it changes the conversation completely. It stops research from turning into open-ended wandering and gives product and engineering a clearer way to understand what the team is actually learning.&lt;/p&gt;

&lt;h3 id=&quot;2-creating-shared-visibility&quot;&gt;2. Creating shared visibility&lt;/h3&gt;

&lt;p&gt;At Vortexa, one of the most useful habits we built was a regular data science catch-up where engineers and data scientists could present what they were doing, why they were doing it, and where the risks were.&lt;/p&gt;

&lt;p&gt;This was not code review. It was not a status ritual either.&lt;/p&gt;

&lt;p&gt;It was a way to keep everyone on the same mental map.&lt;/p&gt;

&lt;p&gt;That mattered because a lot of problems in &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; systems do not come from one catastrophic mistake. They come from small drifts in understanding. A feature is computed one way in training, another way in production. An assumption about data quality goes unchallenged. A result sounds promising, but nobody else can reproduce it.&lt;/p&gt;

&lt;p&gt;Communication is not a soft add-on here.&lt;/p&gt;

&lt;p&gt;It is part of the control surface of the system.&lt;/p&gt;

&lt;h3 id=&quot;3-putting-discipline-around-handoffs&quot;&gt;3. Putting discipline around handoffs&lt;/h3&gt;

&lt;p&gt;The teams I trust most are not the ones with the nicest process diagrams. They are the ones that make handoffs visible and expensive enough that people try to remove them.&lt;/p&gt;

&lt;p&gt;If the data scientist can disappear after training a model and the engineer is left to guess the rest, the system will eventually reflect that fracture.&lt;/p&gt;

&lt;p&gt;If the engineer is never exposed to how experimental the work really is, they will overestimate how stable the solution already is.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; helped when it forced us to confront those boundaries earlier.&lt;/p&gt;

&lt;h2 id=&quot;what-ml-teams-still-underestimate&quot;&gt;What &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; Teams Still Underestimate&lt;/h2&gt;

&lt;p&gt;One of the themes that came up in the podcast is that many teams still underestimate how much work starts after the model looks good.&lt;/p&gt;

&lt;p&gt;You do not just need versioned code. You need versioned data and a credible way to tie the two together.&lt;/p&gt;

&lt;p&gt;You do not just need a model in production. You need monitoring, drift detection, and a practical way to replace the model without breaking the product.&lt;/p&gt;

&lt;p&gt;You do not just need experimentation. You need a path from experimentation to something deterministic enough to support.&lt;/p&gt;

&lt;p&gt;This is why I often say that notebooks are wonderful research tools and terrible places to leave an idea if you want a system around it to survive.&lt;/p&gt;

&lt;h2 id=&quot;the-lesson-i-was-trying-to-communicate&quot;&gt;The Lesson I Was Trying To Communicate&lt;/h2&gt;

&lt;p&gt;When Bill asked what &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; meant to me in this context, the answer I wanted to give was not especially fashionable.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt;, Agile is useful when it helps the team learn quickly without losing control of the system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is really the heart of it.&lt;/p&gt;

&lt;p&gt;Not velocity in the abstract.&lt;/p&gt;

&lt;p&gt;Not ceremony for its own sake.&lt;/p&gt;

&lt;p&gt;Not pretending that uncertainty can be planned away.&lt;/p&gt;

&lt;p&gt;Just a disciplined way to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;test assumptions early&lt;/li&gt;
  &lt;li&gt;expose the right risks&lt;/li&gt;
  &lt;li&gt;keep engineers and data scientists in sync&lt;/li&gt;
  &lt;li&gt;and make sure the thing you learned can actually survive contact with production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was my view then, and I still think it was the right thing to say.&lt;/p&gt;

&lt;h2 id=&quot;the-podcast&quot;&gt;The Podcast&lt;/h2&gt;

&lt;p&gt;If you prefer the conversation version, the episode is below.&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/LdDasrMOJLs?si=dk-YcjCqW6YpBPWZ&quot; title=&quot;Agile in Action podcast episode&quot; frameborder=&quot;0&quot; loading=&quot;lazy&quot; referrerpolicy=&quot;strict-origin-when-cross-origin&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
</description>
            <pubDate>2023-10-31T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2023/10/31/Agile-In-Action.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2023/10/31/Agile-In-Action.html</guid>
        </item>
        
        
        
        <item>
            <title>Dynamic(i/o) Why you should start your ML-Ops journey with wrapping your I/O</title>
            <description>&lt;p&gt;If you call yourself an &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt; then you ‘ve been there–you ‘ve seen this before. To productionise your &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; pipeline; well, that’s surely a challenge.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center logo-plate post-logo-plate&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/2022-06-01-dynamicio.png&quot; alt=&quot;dynamic(i/o)&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I have worked for many years as a Data Science consultant, and I can surely confirm the statement that &lt;a href=&quot;https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/&quot;&gt;“…more that 87% of Data Science projects never make it to production”&lt;/a&gt;.
There is a reason why the first rule of doing &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;Machine Learning&lt;/span&gt; is to really be sure you need to do &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt;! Surely, many reasons play into this challenge:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;lack of the right leadership;&lt;/li&gt;
  &lt;li&gt;no or limited access to data in siloed organisations;&lt;/li&gt;
  &lt;li&gt;lack of the necessary tooling or infrastructure support, and even;&lt;/li&gt;
  &lt;li&gt;lack of a research-driven culture.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there is one more beast to be tamed out there; the gap between Data Science and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineering&lt;/span&gt;. And this is a gap you can perceive both in terms of the two practitioners in each 
field of work–data scientists and SWE–but also in terms of literally getting from a prototype to a production ready ML pipeline.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/xkcd-data-answers.png&quot; alt=&quot;xkcd - data answers&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Simply put, to put a model into production is one thing; but to maintain that model, properly monitor it to identify possible drifts and streamline the process of re-training it or updating it in a robust and 
reproducible way, supported by a clean CI/CD process, is daunting task! If anything, I ‘d dare say that  ML-Engineering, as a domain, fully encapsulates SWE in addition to many more 
challenges (highly recommend reading &lt;a href=&quot;https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf&quot;&gt;Hidden Technical Debt in Machine Learning Systems&lt;/a&gt;), for some of which we 
are still trying to standardise how we work in terms of best tooling or practices.&lt;/p&gt;

&lt;p&gt;In many cases, organisations are forced to come up with their own ways of working to accommodate the unique challenges of their custom use-cases. Then again, it all comes down to the requirements of a project. 
&lt;a href=&quot;https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6&quot;&gt;Netflix has streamlined the process of putting python notebooks into production using papermil&lt;/a&gt;. 
Others, go as far as to standardise the whole &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineering&lt;/span&gt; process using tools like &lt;span class=&quot;blog-highlight blog-highlight--graph&quot;&gt;Airflow&lt;/span&gt; or &lt;span class=&quot;blog-highlight blog-highlight--graph&quot;&gt;Kubeflow&lt;/span&gt;, relying on AI pipelines (on GCP) or &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;SageMaker&lt;/span&gt; (on AWS), etc.&lt;/p&gt;

&lt;h2 id=&quot;so-what-do-we-do&quot;&gt;So what do we do…?&lt;/h2&gt;
&lt;p&gt;At Vortexa, we are heavy users of Airflow and have recently embarked into a journey to include Kubeflow into our tech stack. 
As an ML-Engineer, my job usually concerns receiving a successful prototype of a model and implementing a complete end-to-end ML pipeline out of it; one that can be easily maintained
and reused. In many ways, this process is very common to a traditional SWE project, only more complex, since ML projects come with more requirements and a strong dependency on data.
Hence, it easily follows that everything one cares to implement for a SWE project needs to also be implemented for an ML-Engineering (MLE) project; and more.&lt;/p&gt;

&lt;p&gt;But let’s start simple…&lt;/p&gt;

&lt;h2 id=&quot;here-is-my-notebook-i-am-done-your-turn-now&quot;&gt;Here is my notebook! I am done; your turn now!&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/xkcd-data-pipelines.png&quot; alt=&quot;xkcd - data pipelines&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;So you are handed a notebook, and you inspect it; you spend time with the Data Scientist and understand all crucial aspects in the procedural logic, and you start splitting the 
process into various tasks. You, usually, end up with something like this:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/data-pipeline.png&quot; alt=&quot;xkcd - data pipelines&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;You think about the structure of your codebase, about how everything will be deployed, how you want to decouple orchestration from the logic of your ML-pipeline, and then you start thinking 
about domain driven development (DDD). You start thinking about abstractions and encapsulation, about testing and data validation. That’s when it hits you–testing; you can unit test 
most things and build a robust pipeline, but you also want fast feedback for when you introduce changes and improvements to your pipeline (shifting to the left)! What if you wanted to run a 
local regression test? With all data being read from external resources (databases, object storage service) you ‘ll have to mock all these calls (doable, but takes time) and replace actual 
data with sample input. And, finally, what about schema and data validations? How do you guarantee after data ingestion that all your expectations on the input are respected?&lt;/p&gt;

&lt;p&gt;You have a look at the code again. Filled with various I/O operations. Sometimes it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;csv&lt;/code&gt;, others &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parquet&lt;/code&gt;, and others it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;json&lt;/code&gt;, sometimes you read from a database and others
from an object storage service (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s3&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcp&lt;/code&gt;). Different libraries used to facilitate all these: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcsfs&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s3fs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fsspec&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;boto3&lt;/code&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sql-alchemy&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tables&lt;/code&gt;; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pandas&lt;/code&gt;, of course, sits at the core
of this process. As if that’s not enough, each file comes with a series of peculiar set of requirements supported through the use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kwargs&lt;/code&gt;; in your python code: orientation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;json&lt;/code&gt;
files, row-group-sizes for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parquet&lt;/code&gt; files, coercions on certain timestamp columns–the list keeps going… This won’t be the last time you need to do this either!&lt;/p&gt;

&lt;p&gt;It’s just too many details–way too many details–for you to worry about. A clear violation of the dependency inversion principle:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Business logic (high level code) should not be implemented in a way that “depends” on technical details (low level code, e.g., I/O in our case); instead both should be managed through abstractions!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You need abstractions to facilitate the flexibility to easily introduce changes. More often than not, business needs will require high-level modules to be modified. Low level code, on the 
other hand, is usually more cumbersome and difficult to change. The two should be independent; a database migration or a switch to an object storage service should have no impact on your
work to generate a new valuable feature for your model, and vice-versa. Abstracting both of these using distinct layers can achieve this!&lt;/p&gt;

&lt;p&gt;As David Wheeler said:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;All problems in computer science can be solved by adding a layer of indirection.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;what-is-dynamicio-then&quot;&gt;What is &lt;span class=&quot;blog-highlight blog-highlight--dynamicio&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dynamicio&lt;/code&gt;&lt;/span&gt; then?&lt;/h2&gt;
&lt;p&gt;Wouldn’t it be great if you could:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;have an abstraction that encapsulates all I/O logic;&lt;/li&gt;
  &lt;li&gt;be able to seamlessly handle reading or writing from and to different resource types or data types;&lt;/li&gt;
  &lt;li&gt;have an interface that is easy to understand and use with minimum configuration;&lt;/li&gt;
  &lt;li&gt;respect your expectations on schema types and data quality;&lt;/li&gt;
  &lt;li&gt;automatically generate metrics that would be used to leverage further insights, and more importantly;&lt;/li&gt;
  &lt;li&gt;be able to seamlessly switch between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;local&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dev&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;staging&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prod&lt;/code&gt; environments, performing dynamic I/O against different datasets and effectively supporting development, testing and qa use cases?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Well, &lt;span class=&quot;blog-highlight blog-highlight--dynamicio&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dynamic(i/o)&lt;/code&gt;&lt;/span&gt; is exactly that; a layer of indirection for pandas I/O operations.&lt;/p&gt;

&lt;p&gt;If you want to find out more about it then &lt;a href=&quot;https://odsc.com/europe/&quot;&gt;register to attend this year’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ODSC&lt;/code&gt;&lt;/a&gt; and &lt;a href=&quot;https://odsc.com/speakers/dynamicio-a-pandas-i-o-wrapper-why-you-should-start-your-ml-ops-journey-with-wrapping-your-i-o/&quot;&gt;attend the presentation by myself and my colleague Tyler Ferguson on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dynamic(i/o)&lt;/code&gt;&lt;/a&gt;. 
Come and learn about how its implementation and adoption has helped us go beyond just achieving consistency across our ML repos, effectively dealing with glue code and keeping our code-bases DRY, but also acting as an interface between different teams.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;
</description>
            <pubDate>2022-05-31T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2022/05/31/dynamicio-at-ODSC.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2022/05/31/dynamicio-at-ODSC.html</guid>
        </item>
        
        
        
        <item>
            <title>Complete Guide to Python Envs (MacOS)</title>
            <description>&lt;p&gt;Configuring &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; on your machine for the first time is a definite headache for any software 
engineer that decides to delve into the world of &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;. Doing it properly confuses a lot of 
people and can prove to be very challenging.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2021/python-envs/python_environment_2x.png&quot; alt=&quot;Python Envs&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It is often the case that many developers have numerous interpreters configured on their machines,
without knowing where they live.&lt;/p&gt;

&lt;h2 id=&quot;most-common-ways-of-setting-up-python&quot;&gt;Most common ways of setting up &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;Firstly, there is a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; version that ships with macOS, but it is usually v2.7, which is not
just out of date but also deprecated.&lt;/p&gt;

&lt;p&gt;So, commonly, most users will download the latest Python release and move it to their &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PATH&lt;/code&gt;
or use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;brew install python3&lt;/code&gt; (which does this for them).&lt;/p&gt;

&lt;p&gt;Both of these solutions can cause many problems that will not be evident straight away. The main 
challenge, is usually not knowing, at any given time, what is the “default Python” that your system 
is using. Ideally, this is something you shouldn’t care about, but if you don’t set up things 
properly, you end up installing packages for the wrong environment or the wrong active &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; interpreter,
un-intentionally created from the wrong &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; distribution and… well, you get the point 
(…this is pretty much summed up in the &lt;a href=&quot;https://xkcd.com/1987/&quot;&gt;xkcd image&lt;/a&gt; above).&lt;/p&gt;

&lt;p&gt;To find out more details, read this excellent &lt;a href=&quot;https://opensource.com/article/19/5/python-3-default-mac&quot;&gt;December 2020, post&lt;/a&gt;,
by Matthew Broberg.&lt;/p&gt;

&lt;h2 id=&quot;how-to-avoid-all-these&quot;&gt;How to avoid all these?&lt;/h2&gt;
&lt;p&gt;The short answer is “use &amp;lt;span class=&quot;blog-highlight blog-highlight–python&quot;&amp;gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt;&amp;lt;/span&amp;gt;”. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt; will enable you to not only setup python properly on your machine, but
to also manage different versions and python environments in a simple and straightforward way. As
explained on the &lt;a href=&quot;https://github.com/pyenv/pyenv&quot;&gt;package’s github page&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“It’s simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For me, its main benefits are:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;It depends on Python itself, i.e. since it was made from pure shell scripts, there is no bootstrap problem of Python.&lt;/li&gt;
  &lt;li&gt;It manages the need to be loaded into your shell though &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt;’s shim approach, which adds a directory to your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PATH&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;It manages virtual environments, though I recommend using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv-virtualenv&lt;/code&gt; to automate the process.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;lets-get-to-it&quot;&gt;Let’s get to it&lt;/h2&gt;
&lt;p&gt;Before you do anything make sure you start with a clean sheet. To do so, uninstall or remove any python distributions
you already have. I strongly advise you to follow this &lt;a href=&quot;https://www.macupdate.com/app/mac/5880/python/uninstall&quot;&gt;link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, assuming you have &lt;a href=&quot;unintently&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;brew&lt;/code&gt;&lt;/a&gt; installed on your machine, do:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;brew update
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;pyenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We will now need &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv-virtualenv&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv-virtualenv&lt;/code&gt; is a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt; plugin that provides features
to manage &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virtualenvs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conda&lt;/code&gt; environments for &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UNIX-like&lt;/code&gt; systems.&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;pyenv-virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;setting-up-your-global-interpreter&quot;&gt;Setting up your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;global&lt;/code&gt; interpreter&lt;/h2&gt;
&lt;p&gt;So, the first thing you want to do is set up your global interpreter. This is the python environment
that will be used by default by your system, unless you dictate otherwise.&lt;/p&gt;

&lt;p&gt;If you run:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--list&lt;/span&gt;
Available versions:
  2.1.3
...
  3.10-dev
  activepython-2.7.14
...
  activepython-3.6.0
  anaconda-1.4.0
...
  anaconda3-2020.07
  graalpython-20.1.0
  graalpython-20.2.0
  ironpython-dev
...
  ironpython-2.7.7
  jython-dev
...
  jython-2.7.2
  micropython-dev
...
  miniconda-latest
...
  miniconda3-4.7.12
  pypy-c-jit-latest
...
  pypy3.6-7.3.1
  pyston-0.5.1
...
  pyston-0.6.1
  stackless-dev
...
  stackless-3.7.5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You will see the full list of &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; distributions available for installation.&lt;/p&gt;

&lt;p&gt;Choose the one you want and do, e.g. 3.9.0:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;3.9.0
python-build: use openssl@1.1 from homebrew
python-build: use readline from homebrew
Downloading Python-3.9.0.tar.xz...
-&amp;gt; https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tar.xz
Installing Python-3.9.0...
python-build: use readline from homebrew
python-build: use zlib from xcode sdk
Installed Python-3.9.0 to /Users/&amp;lt;username&amp;gt;/.pyenv/versions/3.9.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Once installation is complete, you can set this version as your global:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv global 3.9.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;At this point, one should be able to find the full executable path to each of these using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv version&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv version
3.9.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;by /Users/&amp;lt;username&amp;gt;/.pyenv/version&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;creating-and-managing-virtual-environments-automatically&quot;&gt;Creating and managing virtual environments automatically&lt;/h2&gt;
&lt;p&gt;This is a standard practice when working with &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;. The idea is to keep different environments isolated.
Each &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environment can be associated to multiple projects, but it is generally better to just go for a 
one to one mapping.&lt;/p&gt;

&lt;p&gt;Why you say? Well, for starters, this helps you maintain your system clean by not installing system-wide libraries
that you are only going to need in a small project. It also allows you to use a certain version of
a library for one project and a different version for another. Finally, it helps make your project 
reproducible and ensures it is configured in an identical manner across local environments amongst
collaborating developers.&lt;/p&gt;

&lt;p&gt;Let’s go through an example.&lt;/p&gt;

&lt;p&gt;Suppose you have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;github&lt;/code&gt; root directory where you clone and maintain all your projects and it looks like this:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
└── project_b

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What you want to do is setup a different python virtual environment per project. What’s more is that
you would like to automatically activate that virtual environment by means of simply accessing (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd&lt;/code&gt;-ing)
into that project. Let’s see how we can do that.&lt;/p&gt;

&lt;p&gt;First, I ‘ll assume you are using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.zshrc&lt;/code&gt; as your default shell and have configured &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oh-my-zsh&lt;/code&gt;. 
If not, then &lt;a href=&quot;https://ohmyz.sh&quot;&gt;just set it up&lt;/a&gt;. Note that this is not a pre-requisite; it’s more of a
personal preference, but using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oh-my-zsh&lt;/code&gt; does come with many benefits, like showing the current active 
python environment on your console, which is why I am recommending it.&lt;/p&gt;

&lt;p&gt;In order to enable the above automations, we will need two pre-requisites. The first, is to include 
2 files in each project (you can version control these files). The first is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.python-version&lt;/code&gt; and the
second is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.python-virtualenv&lt;/code&gt;, as per the below tree:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
│   ├── .python-version
│   └── .python-virtualenv
└── project_b
    ├── .python-version
    └── .python-virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In each of these files you just add a line at the very top of the file with:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;the python version you want to use&lt;/li&gt;
  &lt;li&gt;the name of the virtual environment you want to create.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, the contents of&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
│   ├── .python-version 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;can be:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;3.9.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
│   └── .python-virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;can be:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;project-a-venv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;similarly, for project b you can have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3.8.2&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;project-b-venv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now, on to your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.zshrc&lt;/code&gt;. Do:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;vi ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and add the following script:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Define your $PATH&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PYENV_ROOT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/.pyenv&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PYENV_ROOT&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/bin:&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PATH&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Automatic venv activation&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;eval&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;pyenv init -&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;eval&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;pyenv virtualenv-init -&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PYENV_VIRTUALENV_DISABLE_PROMPT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1

&lt;span class=&quot;c&quot;&gt;# Undo any existing alias for `cd`&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;unalias cd &lt;/span&gt;2&amp;gt;/dev/null

&lt;span class=&quot;c&quot;&gt;# Method that verifies all requirements and activates the virtualenv&lt;/span&gt;
hasAndSetVirtualenv&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# .python-version is mandatory for .python-virtualenv but not vice versa&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-virtualenv &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-version &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
      &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;To use .python-virtualenv you need a .python-version&quot;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return &lt;/span&gt;1
    &lt;span class=&quot;k&quot;&gt;fi
  fi&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# Check if pyenv has the Python version needed.&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# If not (or pyenv not available) exit with code 1 and the respective instructions.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-version &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-z&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;which pyenv&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
      &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Install pyenv see https://github.com/yyuu/pyenv&quot;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return &lt;/span&gt;1
    &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;pyenv versions 2&amp;gt;&amp;amp;1 | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;not installed&apos;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt;
      &lt;span class=&quot;c&quot;&gt;# Message &quot;not installed&quot; is automatically generated by `pyenv versions`&lt;/span&gt;
      &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;run &quot;pyenv install&quot;&apos;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return &lt;/span&gt;1
    &lt;span class=&quot;k&quot;&gt;fi
  fi&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# Create and activate the virtualenv if all conditions above are successful&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# Also, if virtualenv is already created, then just activate it.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-virtualenv &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;VIRTUALENV_NAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; .python-virtualenv&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;PYTHON_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; .python-version&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;MY_ENV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PYENV_ROOT&lt;/span&gt;/versions/&lt;span class=&quot;nv&quot;&gt;$PYTHON_VERSION&lt;/span&gt;/envs/&lt;span class=&quot;nv&quot;&gt;$VIRTUALENV_NAME&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MY_ENV&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; virtualenv &lt;span class=&quot;nv&quot;&gt;$MY_ENV&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;which python&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MY_ENV&lt;/span&gt;/bin/activate
  &lt;span class=&quot;k&quot;&gt;fi&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

pythonVirtualenvCd &lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# move to a folder + run the pyenv + virtualenv script&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$@&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; hasAndSetVirtualenv
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Every time you move to a folder, run the pyenv + virtualenv script&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;alias cd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pythonVirtualenvCd&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Save your changes, return to your terminal and either restart your terminal or do:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s assume that you are in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GitHub&lt;/code&gt; directory:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;pwd&lt;/span&gt;
/Users/&amp;lt;username&amp;gt;/Github
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, if you do:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;~/GitHub &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;project_a
created virtual environment CPython3.9.0.final.0-64 &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;448ms
  creator CPython3Posix&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;dest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/.pyenv/versions/3.9.0/envs/project-a-venvo, &lt;span class=&quot;nv&quot;&gt;clear&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;no_vcs_ignore&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;global&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  seeder FromAppData&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;via&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;copy, &lt;span class=&quot;nv&quot;&gt;app_data_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/Library/Application Support/virtualenv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    added seed packages: &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;20.3.1, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;51.3.3, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;project-a-venv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--------------------------------------------------------------------------------&lt;/span&gt;
~/GitHub/project_a &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and, if you come out of it and change to project b:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; ../project_b
created virtual environment CPython3.8.2.final.0-64 &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;932ms
  creator CPython3Posix&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;dest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/.pyenv/versions/3.8.2/envs/project-b-venv, &lt;span class=&quot;nv&quot;&gt;clear&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;no_vcs_ignore&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;global&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  seeder FromAppData&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;via&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;copy, &lt;span class=&quot;nv&quot;&gt;app_data_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/Library/Application Support/virtualenv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    added seed packages: &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;20.3.1, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;51.3.3, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;project-b-venv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--------------------------------------------------------------------------------&lt;/span&gt;
~/GitHub/project_b &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, two new virtual environments have been created:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv versions
system
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; 3.8.2 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;by /Users/&amp;lt;username&amp;gt;/GitHub/project_b/.python-version&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  3.8.2/envs/project-b-venv
  3.9.0
  3.9.0/envs/project-a-venv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and every time you cd into these directories, your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt; will switch automatically.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Note 1:&lt;/code&gt; You may face some issues with python 3.8.7.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Note 2:&lt;/code&gt; To uninstall a python env, do: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv uninstall 3.8.2/envs/project-b-venv&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;using-jupyter-notebook-or-jupyter-lab-with-a-virtual-environment-of-your-choice&quot;&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter notebook&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter lab&lt;/code&gt; with a virtual environment of your choice&lt;/h2&gt;
&lt;p&gt;Finally, suppose you want to use a python environment with a jupyter notebook. This is not as 
straightforward as one would think. Here is how you would do it.&lt;/p&gt;

&lt;p&gt;Let’s continue from where we left things in the previous section. You are in:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and you have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;project-a-venv&lt;/code&gt; activated:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;project-a-venv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--------------------------------------------------------------------------------&lt;/span&gt;
~/Github/project_a &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;First thing you need to do is install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ipykernel&lt;/code&gt; using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ pip install ipykernel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, you need to install a new kernel:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ipython kernel install --user --name=project-a-venv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Finally, assuming you have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyterlab&lt;/code&gt; installed, you can start &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter&lt;/code&gt;, create a new notebook and select the kernel that lives inside 
your environment.&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;jupyter notebook
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;final-notes&quot;&gt;Final notes&lt;/h2&gt;
&lt;p&gt;I really hope that this was a helpful post and if you are new to python, I hope that I have helped you
disambiguate some confusing aspects of configuring python at the start of your journey!&lt;/p&gt;

&lt;p&gt;The bellow references were very helpful for putting together this post:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://opensource.com/article/19/5/python-3-default-mac&quot;&gt;The right and wrong way to set Python 3 as default on a Mac&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://glhuilli.github.io/virtual-environments.html&quot;&gt;Automatic activation of virtualenv (+ pyenv)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2021-02-14T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2021/02/14/python-envs.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2021/02/14/python-envs.html</guid>
        </item>
        
        
        
        <item>
            <title>A BREXIT NLP Dataset!</title>
            <description>&lt;p&gt;So here is the thing… I love discussing politics; I think that everyone should, at least occasionally, bother 
themselves with what is happening in their country’s political scenery.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/brexit-nlp-dataset/eu-brexit-classifier.png&quot; alt=&quot;BREXIT 2016&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Regardless of whether you are into politics or not, it would be practically impossible to escape debating &lt;span class=&quot;blog-highlight blog-highlight--eu&quot;&gt;BREXIT&lt;/span&gt; back 
in the summer of 2016. At the time, I had just been hired by Data Reply UK and the company’s annual XChange conference was
around the corner.&lt;/p&gt;

&lt;p&gt;My boss at the time, wanted to us to come up with something interesting and eye catching for our demo pod at the conference. 
So, being that BREXIT was a trending and highly debated topic, I thought that maybe I can come up with a way to predict 
peoples’ political stance by means of their social activity.&lt;/p&gt;

&lt;h2 id=&quot;the-idea&quot;&gt;The idea&lt;/h2&gt;
&lt;p&gt;The idea was simple:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Provided one’s twitter @handle, try to infer their political views on &lt;span class=&quot;blog-highlight blog-highlight--eu&quot;&gt;BREXIT&lt;/span&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The original approach was to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Collect people’s tweets through the twitter API;&lt;/li&gt;
  &lt;li&gt;Label tweets related to &lt;span class=&quot;blog-highlight blog-highlight--eu&quot;&gt;BREXIT&lt;/span&gt; as either PRO or CON;&lt;/li&gt;
  &lt;li&gt;Calculate a ratio between the 2 and produce a number that would represent their political stance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After experimenting a bit, I figured out that using one’s own tweets would not be enough. Many twitter users don’t 
tweet that often and when they do, they are not really concerned with the EU or BREXIT. So I thought that maybe we can
use the tweets of the people that one follows. This draws from social science and ideas behind tribalism:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;“…you are likely to be ideologically aligned with the positions of your peers [or of those you follow on twitter ;)]!”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-dataset&quot;&gt;The dataset&lt;/h2&gt;
&lt;p&gt;In order to be able to label tweets, I had to develop an &lt;span class=&quot;blog-highlight blog-highlight--nlp&quot;&gt;NLP&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; model. To do so, I needed a relatively 
big corpus of labelled tweets.&lt;/p&gt;

&lt;p&gt;I turned to an &lt;a href=&quot;https://www.bbc.com/news/uk-politics-eu-referendum-35616946&quot;&gt;article by BBC&lt;/a&gt; 
at the time, which categorised MPs according to the public stance on BREXIT. Using a twitter list that had the twitter 
handles of 449 MPs at the time and using the twitter API, I accumulated a corpus of 60,941 tweets from 449 UK 
MPs (at the time). Tweets had one or more of the following keywords:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;key_words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;European union&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;European Union&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;european union&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;EUROPEAN UNION&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;Brexit&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;brexit&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;BREXIT&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;euref&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;EUREF&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;euRef&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;eu_ref&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;EUref&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;leaveeu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;leave_eu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;leaveEU&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;leaveEu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;borisvsdave&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;BorisVsDave&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;StrongerI&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;strongerI&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;strongeri&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;strongerI&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;votestay&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;vote_stay&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteStay&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;votein&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteout&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteIn&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteOut&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;vote_In&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;vote_Out&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;referendum&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;Referendum&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;REFERENDUM&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and were automatically labelled based on the views of the MP who tweeted them.&lt;/p&gt;

&lt;p&gt;You can find more details on how I worked to generate the ML model and how the demo solution worked if you follow this
 &lt;a href=&quot;https://github.com/Christos-Hadjinikolis/eu_tweet_classifier&quot;&gt;github repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;dataset-now-available-on-kaggle&quot;&gt;Dataset now available on Kaggle&lt;/h2&gt;
&lt;p&gt;It took me some time to publish it, but the dataset is now available to everyone to use on Kaggle. You can find it 
if you follow this &lt;a href=&quot;https://www.kaggle.com/chadjinik/labelledbrexittweets&quot;&gt;link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I hope that the ML community will make good use of it. It’s 4 years after the referendum but BREXIT is yet to really 
happen and unfortunately it remains a concerning issue. So, who knows, maybe someone wants to use this dataset in some 
other equally interesting way.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-09-02T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/09/02/BREXIT-NLP-dataset.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/09/02/BREXIT-NLP-dataset.html</guid>
        </item>
        
        
        
        <item>
            <title>Style Transfer in Heraklion</title>
            <description>&lt;p&gt;I am currently in Crete for my annual get away. Crete is an amazing island with many beautiful places to visit and a vast 
history that goes all the way back to the Minoans in 3500 BC.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-koules.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
One of the things I love doing whenever I am here is strolling around the city of Heraklion and taking pictures of the many hidden alleys, 
which reveal an amazing graffiti culture! I really wanted to write about it in my blog and I thought that maybe I can do so 
by using some amazing images I gathered just last week in a &lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&gt;style-transfer&lt;/span&gt; post. So this is it: &lt;strong&gt;&lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&gt;Style Transfer&lt;/span&gt; in Heraklion&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;a-bit-of-history-a-neural-algorithm-of-artistic-style&quot;&gt;A bit of history: A Neural Algorithm of Artistic Style&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&gt;Neural Style Transfer (NST)&lt;/span&gt; is a class of algorithms that process images to adopt the visual style of another image. A seminal paper 
that introduced this concept was &lt;a href=&quot;https://arxiv.org/abs/1508.06576&quot;&gt;“A Neural Algorithm of Artistic Style”&lt;/a&gt; by Leon A. Gatys, Alexander 
S. Ecker and Matthias Bethge. In their work, the authors emphasize that:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;“…representations of content and style in Neural Networks are seperable”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the foundation of this work, since if these two notions are indeed separable then provided two images you can get the style 
of the first, the content of the second and merge them together. So, how is this done exactly?&lt;/p&gt;

&lt;h2 id=&quot;delving-into-the-details&quot;&gt;Delving into the details&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-01.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The first figure in the paper shows the original setup and how a pre-trained NN, referred to as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&amp;gt;VGG19&amp;lt;/span&amp;gt;&lt;/code&gt;, was modified to do NST. What is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&amp;gt;VGG19&amp;lt;/span&amp;gt;&lt;/code&gt;? 
Well, the basic building blocks of traditional convolutional networks are the following layers:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;a &lt;a href=&quot;https://www.youtube.com/watch?v=YRhxdVk_sIs&amp;amp;list=RDCMUC4UJ26WkceqONNF5S26OiVw&amp;amp;index=2&quot;&gt;convolutional layer&lt;/a&gt; (with padding to maintain the resolution);&lt;/li&gt;
  &lt;li&gt;a non-linear activation layer such as a &lt;a href=&quot;https://www.youtube.com/watch?v=m0pIlLfpXWE&amp;amp;list=RDCMUC4UJ26WkceqONNF5S26OiVw&amp;amp;index=3&quot;&gt;ReLU&lt;/a&gt;, and;&lt;/li&gt;
  &lt;li&gt;a pooling layer such as a &lt;a href=&quot;https://www.youtube.com/watch?v=ZjM_XQa5s6s&quot;&gt;max pooling layer&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VGG&lt;/code&gt; block consists of a sequence of convolutional layers, followed by a max pooling layer for spatial down-sampling.
What we are interested in is how this network will respond to the inputs.&lt;/p&gt;

&lt;h3 id=&quot;retrieving-the-content&quot;&gt;Retrieving the content&lt;/h3&gt;
&lt;p&gt;Notice that the authors prefer to use paintings for the style and a random image; it seems like these combinations work best. 
The main idea is abstracting the content and putting more emphasis on the style!&lt;/p&gt;

&lt;p&gt;At the top left you see &lt;a href=&quot;https://artsandculture.google.com/asset/the-starry-night/bgEuwDxel93-Pg?hl=en-GB&amp;amp;avm=2&quot;&gt;“The Starry Night”&lt;/a&gt; 
by Vincent van Gogh and below it is just a random content image; let’s start with the latter.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-02.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Provided both an input (style) image and a content image, each neuron and respectively each layer in the NN will either activate or it won’t.
Each image is processed, or better yet filtered, in a different way (by nature of the activation or not of different neurons). Looking at 
how the content image is gradually filtered in the above image you will notice that the first layer leaves the image seemingly intact. 
But looking all the way to the last filtered output you see that this is not the case at all; shapes are there but the inside is 
not so much the same. This is because of how the resulting high-level features are generated on earlier abstractions of the same image 
produced by previous layers. This is the intended behaviour to retrieve the content.&lt;/p&gt;

&lt;h3 id=&quot;retrieving-the-style&quot;&gt;Retrieving the style&lt;/h3&gt;
&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-03.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;So, for the style, the authors explain that they have built a new feature-space, which focuses on the style of an input image on top 
of the original CNN representations. The style representation computes correlations between the different features in different
layers of the CNN. They reconstruct the style of the input image from style representations built on different subsets of CNN
layers and this results in images that match the style of the input on an increasing scale while discarding information of the 
arrangement of the scene.&lt;/p&gt;

&lt;h2 id=&quot;its-all-in-the-formulas-or-formulae&quot;&gt;It’s all in the formulas (or formul$ae$)&lt;/h2&gt;
&lt;p&gt;The authors also discuss the impact of the number of layers used to infer the style or the content of images before they are merged 
(visually depicted in Figure 3 of the paper). &lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-04.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
In the first row (A) only one layer is used in contrast to 5 layers used at the bottom row where the result is much better.&lt;/p&gt;

&lt;p&gt;To generate the images which are a mixture of the content of an image-A with the style of another (image-B) the authors explain that 
they jointly minimise the distance of a “white noise” image from the content representation of image-A in one layer of the network 
and the style representation of image-B in a number of layers of the CNN. This is gracefully captured by the below loss function:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-05.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where $\overrightarrow{p}$ is image-A (usually a photograph where we care about the content) and $\overrightarrow{a}$ is image-B 
(usually a painting where we care to retrieve the style). Then $\alpha$ and $\beta$ respectively concern weighting factors for content 
and style reconstruction.&lt;/p&gt;

&lt;p&gt;Going back to Figure 3 of the paper, looking at it from left to right we see what happens when we tweak these weighting factors ($\alpha$ and $\beta$). 
The left-most column concerns cases where $\alpha$ is low compared to $\beta$ and the right-most layer is the other way around. These two
factors are practically the optimisers of the content and style errors respectively. If $\alpha$ is high, it means that content error is more 
important and vice-versa for increasing $\beta$.&lt;/p&gt;

&lt;p&gt;The objective of the formula is to minimize $\mathcal{L}_{total}$. $\overrightarrow{x}$ is the image that we are gradually building through multiple iterations and it 
initially comes either from the photograph ($\overrightarrow{p}$) or it is initialised as white noise. $\alpha$ and $\beta$ are the weights that we 
need to set, and they are basically our hyper-parameters in this problem.&lt;/p&gt;

&lt;p&gt;What is now left is understanding \(\mathcal{L}_{content}\) and \(\mathcal{L}_{style}\).&lt;/p&gt;

&lt;h3 id=&quot;mathcall_content&quot;&gt;$\mathcal{L}_{content}$&lt;/h3&gt;
&lt;p&gt;Here is where everything gets a bit complicated but at the same time, you get to piece everything together nicely.&lt;/p&gt;

&lt;p&gt;$\mathcal{L}_{content}$ is described as the squared-error loss between two feature representations: one concerned with the random photograph 
$\overrightarrow{p}$ and the generated image $\overrightarrow{x}$ which is originally a white noise image. 
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-06.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
$P^l$ and $F^l$ are the respective feature representations for the two images in layer $l$. The authors used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VGG&lt;/code&gt; Network. 
Here, $F^l$ represents an activation function ($F$) at a given layer $l$ or, plainly, a bank of non-linear filters for that layer. The complexity of these filters increases 
with the position of the layer in the network. $F$ is practically a matrix of size $N\times M$ where $N$ is the number of filters within 
a given layer with $N_l$ feature maps of size $M_l$; the latter is the height $\times$ width if the feature map.&lt;/p&gt;

&lt;p&gt;So, a given input image $\overrightarrow{x}$ is encoded in each layer of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CNN&lt;/code&gt; by the filter responses to that image. 
To visualise the image information that is encoded at different layers of the hierarchy the authors perform gradient descent
on the white noise image to find another image that matches the feature responses of the original image. 
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-07.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
So, the approach is to gradually changes the initially random image $\overrightarrow{x}$ until it generates the same response in a certain layer of the CNN as the original image.&lt;/p&gt;

&lt;h3 id=&quot;mathcall_style&quot;&gt;$\mathcal{L}_{style}$&lt;/h3&gt;
&lt;p&gt;The style loss function is described by the following equation:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-08.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;  &lt;br /&gt;
which is basically a sum of the weighted distances between feature correlations across the different filter (layer) responses for two images:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the original image $\overrightarrow{a}$, and;&lt;/li&gt;
  &lt;li&gt;a white noise image $\overrightarrow{x}$, used to generate a style representation of the original image.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break this down a bit more; what are these feature correlations? Practically they are a way to express a relationship between a feature map $F$ and 
the filters ($i$ and $j$) of the different layers ($l$) applied on it. This is beautifully expressed as a matrix of all possible inner 
products between the generated set of feature vectors, called a &lt;a href=&quot;https://www.youtube.com/watch?v=DEK-W5cxG-g&quot;&gt;“Gram matrix $G$”&lt;/a&gt;, as per the below equation:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-09.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;One such matrix is generated for each of the two images (the original $\overrightarrow{a}$ and $\overrightarrow{x}$), namely $A_{ij}^l$ and $G_{ij}^l$, and a squared 
distance is calculated between these two. The objective is to minimise the distance. So, practically, as with every ML problem, what we have is an optimisation problem and 
a cost function! Minimising this distance can be achieved through the application of gradient descent using standard error back-propagation 
to adjust the weight values of equation $5$.&lt;/p&gt;

&lt;h3 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h3&gt;
&lt;p&gt;Finally, in order to generate the final image with the style transfer, we return to equation 7, which practically jointly 
minimises the distance of a white noise image from the content representation of the photograph in one layer of the network 
and the style representation of the painting in a number of layers of the CNN. The authors also note that:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;For image synthesis they found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly
more appealing results.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s it! So, what’s left now is getting our hands dirty!&lt;/p&gt;

&lt;h2 id=&quot;using-pytorch-for-style-transfer&quot;&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyTorch&lt;/code&gt; for Style transfer&lt;/h2&gt;
&lt;p&gt;If you following this &lt;a href=&quot;https://pytorch.org/tutorials/advanced/neural_style_tutorial.html?highlight=style%20transfer&quot;&gt;link&lt;/a&gt; to the official 
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyTorch&lt;/code&gt; website you will find a very well written tutorial on how to apply style transfer with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyTorch&lt;/code&gt;. I provide
my own take of it &lt;a href=&quot;https://github.com/Christos-Hadjinikolis/style-transfer/blob/master/tests/experiments/Style_Transfer_Tutorial.ipynb&quot;&gt;&lt;strong&gt;$\rightarrow$here$\leftarrow$&lt;/strong&gt;&lt;/a&gt;. You 
can follow the link to the python notebook and copy-paste the code to give it a try.
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-14.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I intend to work on creating a package for it and will provide an updated post on it once I do (it will be developed in the same repo as the link). The intention is
to be able to style images through the packages through an intuitive api that would take the image to be styled as the 
input and a choice between famous images that will be available through the package (provided as a text parameter) to produce the desired output (along with some other flags and side parameters). 
Something like:&lt;/p&gt;

&lt;pre&gt;
import pytorch_style_transfer as pst

pst.generate(
    input_image_path=&quot;path_to_input_image&quot;, 
    style=&quot;starry_night&quot;, 
    resolution=128, 
    output_dir=&quot;path/to/output&quot;)
&lt;/pre&gt;

&lt;h2 id=&quot;enjoy-some-of-the-outputs&quot;&gt;Enjoy some of the outputs:&lt;/h2&gt;
&lt;p&gt;Here are some of the results of this work. I tried blending the fortress of Koules with 4 different grafittis I was able to photograph.&lt;/p&gt;

&lt;p&gt;The original picture:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-koules-fortress.jpg&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The result is not always great, but it was still very interesting to try:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-10.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-11.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-12.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-13.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;That’s it!&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-08-15T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/08/15/on-style-transfer.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/08/15/on-style-transfer.html</guid>
        </item>
        
        
        
        <item>
            <title>Agile Data Science</title>
            <description>&lt;p&gt;Re-posting from &lt;a href=&quot;https://www.iunera.com/kraken/big-data-science-strategy/the-agile-approach-in-data-science-explained-by-an-ml-expert/&quot;&gt;https://www.iunera.com/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Around two weeks ago I was approached by &lt;a href=&quot;https://www.linkedin.com/in/dr-tim-frey-7b28171/&quot;&gt;Dr. Tim Frey&lt;/a&gt;, General Manager at Iunera GmbH &amp;amp; Co. KG. I was quite surprised to read his message:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Hi Christos, 
We met at the mind mastering machines conference in London.
We operate a company blog (https://iunera.com/kraken ) and one of our writers wrote about &lt;em&gt;agile&lt;/em&gt; in Data Science. 
I liked your talk two years ago and I thought she can approach you to ask a few questions like kind of an in-article 
interview with an expert. 
Hope that is fine with you. Would be super glad to get your insights.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I must admit this was a first for me! Then again, that talk in 2018 was quite an interesting one for me too.&lt;/p&gt;

&lt;h2 id=&quot;how-it-all-happened&quot;&gt;How it all happened…&lt;/h2&gt;
&lt;p&gt;You see, 3 years ago I was asked to join an exceptional team over at UBS to help with a graph analytics project. If you asked me then I would 
proudly tell you that “…I am a Data Scientist”; that is how I saw myself. However, that was bound to change forever.&lt;/p&gt;

&lt;p&gt;The first three months were amazing. I worked with a vast amount of data and revealed some very interesting insights. 
So, inevitably, my project manager approached me and asked “…how about we take this work of yours into production”!&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/agile-data-science/2020-08-11-agile-ds-01.png&quot; alt=&quot;Agile Data Science&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I didn’t have a clue about what that meant in reality, but I was about to find out. He said: “Well, don’t worry, we will pair 
you with an engineer and you both can get started on it”. So we did!&lt;/p&gt;

&lt;p&gt;This is basically the story about how I was exposed to software engineering and the &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; way of working–about how I was converted into 
an &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt;. Two years later I decided to take my learnings from this experience and share it with my community, and so I did at &lt;strong&gt;mCubed&lt;/strong&gt; London in 2018:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/nRsqFrutfSg&quot; title=&quot;Agile Data Science talk&quot; frameborder=&quot;0&quot; loading=&quot;lazy&quot; referrerpolicy=&quot;strict-origin-when-cross-origin&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;That’s where I also met Tim. Turns out that a year and half later a colleague of Tim’s (&lt;a href=&quot;https://www.linkedin.com/in/dhanhyaashri-mahendran/&quot;&gt;Dhanhyaashri Mahendran&lt;/a&gt;) was doing a bit iof research on 
&lt;em&gt;“Doing Data Science the &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; way”&lt;/em&gt; and Tim suggested that she gets in touch with me to ask me some questions, which I welcomed.&lt;/p&gt;

&lt;h2 id=&quot;some-very-interesting-questions-were-thrown-my-way&quot;&gt;Some very interesting questions were thrown my way…&lt;/h2&gt;
&lt;p&gt;I really liked the questions that Dhanhyaashri had prepared. She had obviously done her research. I did my best to respond and two weeks later 
the interview was published on the Iunera blog. You can read it &lt;a href=&quot;https://www.iunera.com/kraken/big-data-science-strategy/the-agile-approach-in-data-science-explained-by-an-ml-expert/&quot;&gt;here&lt;/a&gt; 
but I also felt like re-posting the interview on my personal blog too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Besides the cutting of time-consuming planning and quicker turnaround of projects, what other benefits are there in applying the Agile approach in data science?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;For the community, I would say that that would be the emergence of new Data-Science-oriented practices that will drive the application of Agile in the research domain.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;The problem with applying Agile in Data Science is that, traditionally, Agile is practiced in software development projects where experimentation, testing and tuning are minimum (usually dealt as spikes). The focus there is about delivering business requirements, in the form of features and products, fast in a volatile, constantly evolving environment. To support this, a number of underpinning practices have been developed, covering areas like modelling and design, coding and testing, risk handling and quality assurance. But all these, focus primarily on feature delivery (backlogs, user-stories, CI/CD, TDD or BDD to name a few). Some of these underpinning practices can directly be transferred in the Data Science world (e.g. user stories and backlogs, timeboxing and retrospectives) but others, not so much; for instance, how can TDD be useful when experimenting with what is the optimal k with which to cluster customer datasets? So, a clear benefit of trying to apply Agile in Data Science is that gradually, similar Data Science-specific underpinning practices will eventually be developed and these will, of course, be based on the same Agile drives: adaptive planning, evolutionary development, early delivery and continual improvement, and more generally, flexible responses to change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;For the Data Scientists I would say it is mostly about adjusting to the requirement of working in a way to deliver business value from their experimentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;The feature-oriented focus that Agile is characterised by in the software development world is not so familiar to data scientists and researchers. What’s more is that “value”—business value—is perceived in very different ways across these two worlds as well. Have you ever discussed the “value” of an experiment with a project manager? Not an easy task I assure you! My experience tells me that most of the time this comes down to project managers fearing that no tangible outputs will be produced through experimentation. This is completely wrong, but only as long as experiments are well-structured and well-thought. To me, Agile Data science is all about iterative hypothesis testing. Proving or disproving a hypothesis is always useful; it minimises the risk of failure and increases decision awareness when choosing what needs to be prioritised! But these outcomes can only be achieved when Data Scientists know exactly what they are trying to prove, discover or disprove and how that would be valuable to their team’s objective. Gradually, Data Scientists become better at it and this benefits both themselves as well as their teams.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What are the downsides of Agile in data science? What can we do about these downsides?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Agile is a set of values and principles; as such, I can’t really say that there is something wrong with it. What is surely wrong is to assume that Agile is the only way that a team can work and be productive—it’s not. Ever since Agile emerged—in the concrete form that we know it today through the Agile manifesto—many hurried to undermine the effectiveness of other development models, e.g. Waterfall.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;There is nothing wrong with the Waterfall model either; the real question is whether these practices or models are fit for purpose! There are surely research projects as well as business requirements around the delivery of software that could potentially be delivered through the Waterfall approach or maybe through a combination of the two. What project managers and teams should strive for is increasing their effectiveness and efficiency. If that can be done by building on top of the Agile values then great; if not, then maybe they will need to try and come up with a different formula.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Project managers focusing too much on what Agile is and what is not—if it needs to be Scrum or Kanban or if too much documentation or too much time spent in design is not Agile—are bound to make mistakes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Do you think that the imposition of Agile on teams (the Agile Industrial Complex) is defeating the purpose of Agile in finding what works best for teams in working adaptively?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;In similar spirit to my previous response, I do! Once more, I can’t stress enough how there is no single perfect development model. Project managers need to always assess what is fit for purpose. Primarily though, they should focus on the underpinning values and principles that Agile or other development models are characterised by. When they do, a recurrent mistake that I have experienced through my consulting career is the oversimplification of Agile as an anti-methodology, anti-documentation and anti-planning development model. I appreciate that this makes understanding Agile much easier, but at the same time it is a very unfair representation of what Agile is! Imposing it on this basis is surely wrong. Equally, practicing Agile is definitely not something that comes through imposition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;I was exposed to the Agile methodology through a passionate software engineer who was an evangelist of Xtreme Programming. To him, the way he worked was a way of seeing the software engineering world and was supported by many more things than just sprints and Jira tickets and user-stories. Knowledge transferring and evolution through an unparallel team spirit and an overall culture to do things in a way that will help everyone grow (people and software) in a fast-paced and fast-evolving world. Empathy was found at the centre of everything he did and his ability to convey this passion was extreme! &lt;a href=&quot;https://twitter.com/tumbarumba&quot;&gt;@tumbarumba&lt;/a&gt;, all the best wherever you are!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is because Agile is, above all, a culture—a way of thinking; a way of caring about the impact and consequences of every individual’s contribution to a team goal. When it is collectively addressed as such, then only good things can come out of it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Is there a possible reason for many data scientists to not be aware of the Agile manifesto?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;I can’t be too sure about this but I if I was to point at anything, that would be how Data Science has, until recently (5 years ago), been so disjoint from the delivery of production-ready solutions. It was more focussed on research and discovery to aid decision making. Lately, the evolution and growth of ML as well as of cost-effective services to support it, necessitates the interaction of the two worlds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Never before has it been so much the case that ML models are such an integral component of software. Before, Data Scientists did not need to worry about the operationalisation and maintenance of their model. Concepts like versioning, robustness, code-coverage and testing where not so much imposed or needed, let alone challenges related to things like dealing with technical debt and refactoring. The traditional work environment would be a Jupyter notebook with access to a database! So, Data Scientists did not need to be exposed to so many practices to govern how they would work to deliver new insights.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What kind of challenges stand in the way of operational production level DS solutions?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;This mostly has to do with bridging the gap between software engineers and data scientists. Software engineers not exposed to data science can’t really do this because they fail to appreciate how exactly to maintain ML-pipelines. Note that in contrast to traditional software pipelines, there are many more issues that need to be addressed; I would refer your readers to the 2014, NIPS seminal paper on the “Hidden Technical Debt in Machine Learning Systems”. Equally, Data Scientists don’t appreciate the complexity of developing and maintaining code-bases and software solutions in a flexible and robust way to allow for things like CI/CD to be supported. This gap is now partially addressed through the emergence of a new paradigm: the ML engineer, a hybrid data scientist and software engineer, equipped with the knowledge to deal with challenges from both worlds. However, that is not enough to account for everything. What is also necessary is the emergence of appropriate tooling to support the development and maintenance of ML pipelines. A good example is Apache KubeFlow, AWS Sagemaker and the less mature but fast evolving Google AI platform.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;What is surely not helpful is the bad practice of finding ways to schedule and run python notebooks in production, and I purposely changed paragraphs to highlight this! I can’t stress enough how many times I have dealt with this in my career! Python notebooks are not made to be run as part of production pipelines—yet so many companies just do so!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is a plea to every project manager running an ML project out there: &lt;strong&gt;This is madness! Please stop it!&lt;/strong&gt;&lt;/p&gt;
  &lt;div class=&quot;tenor-gif-embed&quot; data-postid=&quot;3413789&quot; data-share-method=&quot;host&quot; data-width=&quot;100%&quot; data-aspect-ratio=&quot;2.4174757281553396&quot;&gt;&lt;a href=&quot;https://tenor.com/view/300-action-drama-gerard-butler-madness-gif-3413789&quot;&gt;This. Is. Sparta! GIF&lt;/a&gt; from &lt;a href=&quot;https://tenor.com/search/300-gifs&quot;&gt;300 GIFs&lt;/a&gt;&lt;/div&gt;
  &lt;script type=&quot;text/javascript&quot; async=&quot;&quot; src=&quot;https://tenor.com/embed.js&quot;&gt;&lt;/script&gt;

&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;In your opinion, what is the most important factor in making ML-Ops agile?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;I think that the answer to this question is “culture”. ML-Ops are here to help cultivate collaboration between data scientists and engineers to support the ML-lifecycle. They are a manifestation of Agile for Data Science in a way! What’s needed is for this mentality towards the development of production level ML solutions to be supported by practitioners, project managers and stakeholders the same. Everyone needs to take risks and own responsibility. Data Scientists need to develop the courage of supporting their experiments even if they may appear to delay production; they need to help stakeholders and project managers appreciate the actual value of experimentation. This will often prove to be very challenging; loss aversion will eventually kick in and when it does people will be more reluctant to change, and they will want to stick to what they know. But this is to be expected! It is natural human behaviour, and this is what we, as a community, are up against.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;At the end of the day, we need to remember that it is almost impossible to find the right balance or get it perfectly right. There is no formula for it. Nevertheless, value will come simply from trying to get it right, and that is more than enough!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Many thanks again to both Tim and Dhanhyaashri for their time and effort!&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-08-11T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/08/11/agile-data-science.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/08/11/agile-data-science.html</guid>
        </item>
        
        
        
        <item>
            <title>AWS ML Certification</title>
            <description>&lt;p&gt;I recently took the &lt;a href=&quot;https://aws.amazon.com/certification/certified-machine-learning-specialty/&quot;&gt;AWS Certified Machine Learning - Specialty&lt;/a&gt;, which remains one of the most demanding &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; certifications. 
I went through a lot of work in order to adequately prepare for this exam and I can tell you that it is indeed one of 
the hardest &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS certifications&lt;/span&gt;. Nevertheless, with proper preparation and a bit of dedication you should be fine.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/aws-ml-certification/2020-07-29-AWS-Cert.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;how-long-do-i-need-to-study-for-this&quot;&gt;How long do I need to study for this?&lt;/h2&gt;
&lt;p&gt;Well it depends; if you are an experienced Data Scientist and have been applying Data Science for about 3+ years then an hour per day for a month should be enough. This also holds if you are an engineer already exposed to the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; infrastructure and services but are not familiar with Data Science topics.&lt;/p&gt;

&lt;p&gt;You see, this certification is labelled as hard simply because it is not just about &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; services. 50% of it is concerned with purely Data Science topics; the other 50% is about &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; services that support Data Science and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; activities. If you are neither exposed to Data Science nor to the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; services then at least 2 months of studying is recommended.&lt;/p&gt;

&lt;h2 id=&quot;what-does-the-exam-cover&quot;&gt;What does the exam cover?&lt;/h2&gt;
&lt;p&gt;Data Engineering covers 20% of the exam, then Exploratory Data Analysis concerns another 24%, modelling is 36% and Machine Learning Implementation and Operations is 20%.&lt;/p&gt;

&lt;p&gt;I put together a list below, in an attempt to summarise the content:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Data Concepts&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Deals with data preparation routines; things like:
        &lt;ul&gt;
          &lt;li&gt;Feature selection and;&lt;/li&gt;
          &lt;li&gt;Feature engineering&lt;/li&gt;
          &lt;li&gt;PCA,&lt;/li&gt;
          &lt;li&gt;dealing with missing data or unbalanced datasets,&lt;/li&gt;
          &lt;li&gt;labels and one-hot encoding as well as;&lt;/li&gt;
          &lt;li&gt;splitting and randomisation of data.&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML Concepts&lt;/strong&gt;: Covers:
    &lt;ul&gt;
      &lt;li&gt;Classical ML Categories of Algorithms&lt;/li&gt;
      &lt;li&gt;Deep Learning&lt;/li&gt;
      &lt;li&gt;The ML-Life-cycle&lt;/li&gt;
      &lt;li&gt;Optimisation: Gradient Descent&lt;/li&gt;
      &lt;li&gt;Regularisation&lt;/li&gt;
      &lt;li&gt;Hyperparameter Tuning&lt;/li&gt;
      &lt;li&gt;Cross-Validation&lt;/li&gt;
      &lt;li&gt;Record I/O&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML Algorithms&lt;/strong&gt;: A list of algorithms you should be familiar with:
    &lt;ul&gt;
      &lt;li&gt;Logistic Regression&lt;/li&gt;
      &lt;li&gt;Linear Regression&lt;/li&gt;
      &lt;li&gt;Support Vector Machine&lt;/li&gt;
      &lt;li&gt;Decision Trees&lt;/li&gt;
      &lt;li&gt;Random Forests&lt;/li&gt;
      &lt;li&gt;K-Means&lt;/li&gt;
      &lt;li&gt;K-Nearest Neighbours&lt;/li&gt;
      &lt;li&gt;Latent Dirichlet Allocation (LDA) Algorithm&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deep Learning&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Cover Neural Networks in a general sense&lt;/li&gt;
      &lt;li&gt;Convolutional Neural Networks: High-level understanding&lt;/li&gt;
      &lt;li&gt;Recurrent Neural Networks: High-level understanding&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Model Optimisation&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Confusion Matrix&lt;/li&gt;
      &lt;li&gt;Sensitivity and Specificity&lt;/li&gt;
      &lt;li&gt;Accuracy &amp;amp; Precision&lt;/li&gt;
      &lt;li&gt;ROC/AUC&lt;/li&gt;
      &lt;li&gt;Gini Impurity&lt;/li&gt;
      &lt;li&gt;F1-Score&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML Tools &amp;amp; Frameworks&lt;/strong&gt;: Cover basic ML tools (know what they do and what they are used for)
    &lt;ul&gt;
      &lt;li&gt;Jupyter Notebooks&lt;/li&gt;
      &lt;li&gt;Pytorch&lt;/li&gt;
      &lt;li&gt;MXNet&lt;/li&gt;
      &lt;li&gt;TensorFlow&lt;/li&gt;
      &lt;li&gt;Keras&lt;/li&gt;
      &lt;li&gt;Scikit-learn&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Amazon Serverless Services&lt;/strong&gt;: Not everything; think about the things that a Data Scientist of ML engineer would need to do.
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Simple Storage Services - S3&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Glue&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Athena&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Quicksight&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Kinesis&lt;/code&gt;, Streams, Firehose, Video &amp;amp; Analytics (S.O.S. this one ;) )&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EMR&lt;/code&gt; with Spark&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EC2&lt;/code&gt; for ML&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Lambda Functions&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Step Functions&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Amazon Serverless ML Services&lt;/strong&gt;: These are out-of-the-box ML solutions offered by AWS.
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rekognition&lt;/code&gt; (image/video)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Poly&lt;/code&gt; (Text-to-Speech)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Transcribe&lt;/code&gt; (Speech-to-Text)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Translate&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Comprehend&lt;/code&gt; (Text Analysis Service)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Lex&lt;/code&gt; (Conversation Interface Service - Chatbots)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Service Chaining&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWS Step Functions&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;SageMaker&lt;/span&gt;&lt;/strong&gt;: A service that you really need to spend time with!
    &lt;ul&gt;
      &lt;li&gt;What is it exactly?&lt;/li&gt;
      &lt;li&gt;Benefits? Advantages?&lt;/li&gt;
      &lt;li&gt;Supported Algorithms (huge list; learn most popular ones)&lt;/li&gt;
      &lt;li&gt;Building and Pre-processing / Ground Truth&lt;/li&gt;
      &lt;li&gt;Training and Data sourcing&lt;/li&gt;
      &lt;li&gt;Hyper-parameter Tuning&lt;/li&gt;
      &lt;li&gt;Model Servicing (Https endpoints)&lt;/li&gt;
      &lt;li&gt;Elastic inference&lt;/li&gt;
      &lt;li&gt;Batch Transform&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is by no means an exhaustive list, but you will at least get an idea about what is generally involved.&lt;/p&gt;

&lt;h2 id=&quot;how-should-i-prepare&quot;&gt;How should I prepare?&lt;/h2&gt;
&lt;p&gt;There are many ways to prepare. Myself, I covered &lt;a href=&quot;https://linuxacademy.com/cp/modules/view/id/340&quot;&gt;the relative course on Linux academy&lt;/a&gt;, which I highly recommend.&lt;/p&gt;

&lt;p&gt;Ideally I would recommend spending some time with &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SageMaker&lt;/code&gt;&lt;/span&gt; and try to interact with services like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lambda&lt;/code&gt; functions and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;step-functions&lt;/code&gt; as well as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Kinesis&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Glue&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Athena&lt;/code&gt;. 
However, that would take a while to do plus, using these resources does not come for free.&lt;/p&gt;

&lt;p&gt;The Linux Academy Course has a number of labs that will help you develop an adequate understanding of these services. You can worry about honing your skills and knowledge at a later point.&lt;/p&gt;

&lt;h2 id=&quot;how-long-does-the-exam-last&quot;&gt;How long does the exam last?&lt;/h2&gt;
&lt;p&gt;The exam consists of 65 multiple-choice, multi-selection questions. It is 3 hours long, which I think is more than enough 
to answer all questions and then review your responses (…or take a nap while waiting for your colleagues to finish; I do have a colleague who actually did this–myself I can never relax that much when it comes to exams).&lt;/p&gt;

&lt;p&gt;In general, AWS exams are taken at authorised exam centers. Due to the COVID-19 lockdown, this was adjusted to satisfy the high demand in exam takers and
people can take the test from home. However, the process is equally strict:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;You need to provide information about the room you will be sitting in;&lt;/li&gt;
  &lt;li&gt;Room needs to be completely quiet during the exam session;&lt;/li&gt;
  &lt;li&gt;You need to be alone in the room;&lt;/li&gt;
  &lt;li&gt;You need to provide pictures of your surroundings to show you have no notes or anything suspicious close to you;&lt;/li&gt;
  &lt;li&gt;A proctor will login at the time of the exam and will ask to inspect the space around you (he asked me to show him the back of my computer prior beginning and doing so with my iMac was quite a challenge… so if you have an option go for laptop).&lt;/li&gt;
  &lt;li&gt;The exam session will be recorded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that, as one would expect, looking away from the screen for more than a couple of seconds might prompt the proctor to give you a notice. To be honest, as soon as the exam began it was quite easy to just focus on the screen. It took me 
less than an hour to cover all questions and then used all the remaining time reviewing my responses. I received a positive notification that I passed on exam completion, but it was subject to a committee review. I guess that examiners inspect the video of 
yourself taking the exam to identify if you tried cheating or something. In no more than 3 days I got the official certification.&lt;/p&gt;

&lt;h2 id=&quot;any-tips-advice&quot;&gt;Any tips? Advice?&lt;/h2&gt;
&lt;p&gt;Well, tip number one is: &lt;em&gt;“If you don’t know which is the right answer, then just go for the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; solution in the list of options”.&lt;/em&gt; At large, this exam tests whether you are familiar with what is available to you through the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; platform. If a client wants to use &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; for image moderation and you recommend anything other than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rekognition&lt;/code&gt; then you clearly don’t know how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rekognition&lt;/code&gt; is used! This has generally worked for me as a way of filtering in and out options.&lt;/p&gt;

&lt;p&gt;I would definitely recommend covering the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;SageMaker&lt;/span&gt; &lt;a href=&quot;https://aws.amazon.com/sagemaker/faqs/&quot;&gt;FAQs&lt;/a&gt; which I see as a wonderful source for exam material.&lt;/p&gt;

&lt;p&gt;Do cover the official AWS practice exam; it is just 20 questions, but it is enough to give you an idea about what you are up against.&lt;/p&gt;

&lt;p&gt;That’s it! I really wish that this article will help you get started with your learning journey and I hope that soon enough you will be joining the &lt;a href=&quot;https://www.linkedin.com/groups/6814264/&quot;&gt;AWS Certified Global Community&lt;/a&gt; to share your badge with everyone.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;
</description>
            <pubDate>2020-07-29T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/07/29/aws-ml-certification.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/07/29/aws-ml-certification.html</guid>
        </item>
        
        
        
        <item>
            <title>Just do it!</title>
            <description>&lt;p&gt;The thing about writing a blog-post is that you are exposing yourself to the world; it feels a lot like &lt;span class=&quot;blog-highlight blog-highlight--signal&quot;&gt;flying for the first time&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/just-do-it/2020-07-28-Just-do-it.jpeg&quot; alt=&quot;Fly for the first time&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;You will be criticised! Some will appreciate your work. Others will say it’s wrong, they ‘ll disagree–which is actually 
promoting healthy public debates and hence is a good thing–or they will just not care. Ultimately, blogging has nothing 
to do with being right neither it is about writing the perfect post. Put simply, it is just about &lt;span class=&quot;blog-highlight blog-highlight--signal&quot;&gt;doing it&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;One of my favourite novels is “The Plague, by A. Camus”. At the moment of writing I can’t recommend this book enough given 
the global COVID-19 commotion. In this book, Camus’ characters are engaged in helping and saving people in the name of no ideology; 
people dying so unfairly (especially children) is enough to move anyone to act irrespective of whether this is suported by some
moral justification.&lt;/p&gt;

&lt;p&gt;There is one particular character, a side-character that came to mind when I sat down to write this post; Joseph Grand. 
Joseph is a fifty-year-old clerk operating for the city government. He lives an austere life, and in his spare time, he 
is writing a book. However, he is such a perfectionist that he ends up rewriting the first sentence over and over and never
gets to proceed any further. No words are ever good enough! What if meaning can be elevated to a higher level if a different 
wording is used. He is self-blocking himself feeling helpless and devastated.&lt;/p&gt;

&lt;p&gt;We ‘ve all been there–I am sure. If only he could let go of his perfectionism and move on to complete that first paragraph 
and write the first chapter. What would be the story that he would say? What morals and learnings would be revealed and shared?&lt;/p&gt;

&lt;p&gt;I guess we will never find out about Joseph Grand, but my blogging journey begins here and now. Looking forward to hear 
your thoughts and I welcome all of your comments.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-07-28T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/07/28/just-do-it.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/07/28/just-do-it.html</guid>
        </item>
        
        
    </channel>
</rss>