<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>ML Affairs</title>
        <description>Posts by C.Hadjinikolis</description>
        <link>https://christos-hadjinikolis.github.io</link>
        
        
        <item>
            <title>Kafka Streams vs Flink Is The Wrong Question</title>
            <description>&lt;p&gt;I am not neutral about &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;I have spent years advocating for it, using it anywhere I could, organizing London meetups around it before COVID, and talking to anyone who would listen about why the dataflow model is such a good way to think. I still love that model. I love how naturally event-driven systems can align to a domain: &lt;em&gt;a ship enters a port, this state changes, that downstream action happens next.&lt;/em&gt; Both &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; and &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; let you express stateful processes in a way that can stay close to business reality.&lt;/p&gt;

&lt;p&gt;And that is exactly why this lesson was useful for me.&lt;/p&gt;

&lt;p&gt;When I joined a later role, I found myself surrounded by repositories built with &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;. My first instinct was simple: replace them with &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. Some of those repos were chaotic, under-loved, and far away from the kind of streaming architecture I like to build. I felt outside my waters. I wanted to modernize, refactor, migrate, clean the slate.&lt;/p&gt;

&lt;p&gt;But over time, after giving those systems the attention they deserved, I learned something more valuable than another framework argument:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;The useful question is not whether &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is &lt;em&gt;&quot;better&quot;&lt;/em&gt; than &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;.&lt;/p&gt;
  &lt;p&gt;The useful question is when your streaming problem stops being an application concern and becomes a platform concern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is still the line I care about most. But now I care about it with much more respect for both sides.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/application-vs-platform-crossroads.png&quot; alt=&quot;A hand-drawn ninja engineer at a crossroads between rewriting toward Flink and curating a Kafka Streams system into a platform-aware architecture.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is the real fork in the road: not which mascot wins, but whether the system is still application-shaped or is becoming a platform concern.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;the-bias-i-had-to-correct&quot;&gt;The Bias I Had To Correct&lt;/h2&gt;

&lt;p&gt;There is a recurring engineering mistake hiding in this topic: &lt;em&gt;you inherit a system that feels old, untidy, or unfashionable, and you start reaching for the framework you know better.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I have had to relearn this lesson more than once in my career. It is almost embarrassing how often it comes back, which is probably proof of how important it is.&lt;/p&gt;

&lt;p&gt;I originally wanted to replace those &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; solutions largely because I was more fluent in &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. That fluency gave me clarity in one framework and discomfort in the other, and I briefly mistook that feeling for architecture.&lt;/p&gt;

&lt;p&gt;That is a dangerous mistake.&lt;/p&gt;

&lt;p&gt;Once I slowed down, cleaned up the code, made the domain model clearer, and brought more disciplined engineering practices to those codebases, I ended up with a much less dramatic conclusion:&lt;/p&gt;

&lt;p&gt;if you give an existing streaming system enough love, enough structure, and enough respect for the underlying model, you can get very far without rewriting it.&lt;/p&gt;

&lt;p&gt;That does not make &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; less good. It just makes engineering judgment less theatrical.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/rewrite-or-repair.png&quot; alt=&quot;A hand-drawn ninja engineer illustration showing the temptation to rewrite a messy Kafka Streams system while a cleaner architectural repair path is explained.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;The urge to rewrite is strong. The better question is whether the system is structurally wrong or simply under-engineered.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;div class=&quot;blog-insight&quot;&gt;
  &lt;span class=&quot;blog-insight__label&quot;&gt;The Lesson&lt;/span&gt;
  &lt;p&gt;&lt;strong&gt;Framework preference is not architecture.&lt;/strong&gt; My first instinct was to rewrite messy &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; systems into &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. The better answer was to clean the model first, then decide whether the runtime was actually the problem.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;what-i-still-love-about-flink&quot;&gt;What I Still Love About Flink&lt;/h2&gt;

&lt;p&gt;Let me be clear: I am still a very strong &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; advocate.&lt;/p&gt;

&lt;p&gt;I still think the &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; dataflow model is one of the cleanest ways to reason about stateful stream processing. Operator boundaries are explicit. State feels local to the operator that owns it. Checkpointing, recovery, repartitioning, and event-time semantics feel like first-class runtime concepts instead of side effects of a library attached to a broker.&lt;/p&gt;

&lt;p&gt;That is a big deal to me, because I care a lot about how easily a streaming system can be explained.&lt;/p&gt;

&lt;p&gt;When a framework makes the flow of state and events easy to communicate, it usually also makes the system easier to maintain.&lt;/p&gt;

&lt;p&gt;But none of that comes for free.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; asks you to pay an upfront complexity tax in operations, onboarding, debugging, and platform maturity. Misconfigured jobs are not charming. They are expensive. The model feels cleaner once you have paid that tax, not before.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/flink-complexity-tax.png&quot; alt=&quot;A hand-drawn ninja engineer facing a Flink complexity tax toll booth before entering a powerful streaming platform city.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is the part many framework comparisons skip: the platform is powerful, but you do pay for the privilege of operating it well.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This is why I still reach for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; eagerly when the runtime itself needs to be a serious part of the design.&lt;/p&gt;

&lt;h2 id=&quot;where-kafka-streams-grew-on-me&quot;&gt;Where Kafka Streams Grew On Me&lt;/h2&gt;

&lt;p&gt;What changed for me was not that I stopped liking &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. What changed is that I learned to appreciate where &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is more enabling than I first allowed.&lt;/p&gt;

&lt;h3 id=&quot;1-the-state-model-is-different-not-just-worse&quot;&gt;1. The State Model Is Different, Not Just Worse&lt;/h3&gt;

&lt;p&gt;One of the things that threw me off at first was the ergonomics of state in &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; gives you state stores, changelog-backed recovery, and table-oriented patterns that can feel more globally available than &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s cleaner operator-local state style. The processor API is very explicit that processors interact with attached state stores, and those stores are fault-tolerant by default. In practice, the default persistent path is a local &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;RocksDB&lt;/span&gt; store backed by a compacted changelog topic. On top of that, table abstractions and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GlobalKTable&lt;/code&gt;-style patterns can make shared reference data or queryable state feel very convenient in the application model.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kafka.apache.org/40/streams/architecture/&quot;&gt;Kafka Streams architecture&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kafka.apache.org/42/streams/developer-guide/processor-api/&quot;&gt;Kafka Streams processor API and state stores&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That convenience comes with real trade-offs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;local &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;RocksDB&lt;/span&gt; state is fast and useful, but fault tolerance still depends on changelogs&lt;/li&gt;
  &lt;li&gt;restore times can still become painful at scale, especially when local state is lost and the store must rebuild from the changelog&lt;/li&gt;
  &lt;li&gt;the relationship between topology code and materialized state can become messy in under-disciplined repos&lt;/li&gt;
  &lt;li&gt;the convenience of reachable state can encourage poor habits if the model is not kept clear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But convenience is still convenience. There are use cases where having easier access to shared or queryable state is genuinely useful, and it would be dishonest to pretend otherwise.&lt;/p&gt;

&lt;p&gt;My instinct, because of my &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; background, was to push &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; code toward a more operator-local way of thinking anyway: make state ownership clearer, keep logic close to the transform that really owns it, and avoid turning the topology into a stateful soup. That discipline improved those codebases a lot.&lt;/p&gt;

&lt;p&gt;But that is exactly the point: bringing some &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;-style discipline into &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; made the code better. It did not prove that the whole system needed to become &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;.&lt;/p&gt;

&lt;h3 id=&quot;2-kafka-native-integration-is-a-real-strength&quot;&gt;2. Kafka-Native Integration Is A Real Strength&lt;/h3&gt;

&lt;p&gt;I am not even talking here about the obvious ecosystem point in a lazy way. Yes, &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; lives naturally inside the &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; ecosystem. Yes, it works comfortably with keyed messages, schemas, topics, and the usual surrounding tooling. Yes, schema-registry-oriented flows often feel more straightforward there.&lt;/p&gt;

&lt;p&gt;That matters. Not because &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; cannot do these things. It can. But because being native to the ecosystem reduces friction when the whole world around the application is already shaped like &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;You should not dismiss that as a minor detail. It is part of the operating model.&lt;/p&gt;

&lt;h2 id=&quot;where-flink-still-pulls-away&quot;&gt;Where Flink Still Pulls Away&lt;/h2&gt;

&lt;p&gt;This is where my original instincts still hold up.&lt;/p&gt;

&lt;h3 id=&quot;1-scaling-stops-at-the-broker-boundary-much-earlier-in-kafka-streams&quot;&gt;1. Scaling Stops At The Broker Boundary Much Earlier In Kafka Streams&lt;/h3&gt;

&lt;p&gt;The scaling constraint in &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is tightly tied to partitions, tasks, and instances. That is not a bug. It is the design. It is also why the system stays so close to &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; itself.&lt;/p&gt;

&lt;p&gt;But it has consequences.&lt;/p&gt;

&lt;p&gt;There comes a point where adding more application instances does not really solve the problem because the partitioning boundary is already telling you how far you can go cleanly. You can absolutely scale &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;, but the broker topology keeps exerting a much stronger influence on the application topology.&lt;/p&gt;

&lt;p&gt;At that point, scaling stops being primarily demand-driven and starts becoming topology-constrained.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;, by contrast, is still constrained at the source when consuming from &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;, but once records are inside the runtime it has far more freedom to repartition, redistribute work, and run operators at a different parallelism from the source. I would not call that infinite scaling. I would call it a materially more flexible runtime.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/stateful-stream-processing/&quot;&gt;Stateful stream processing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/overview/&quot;&gt;Flink concepts overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That difference becomes major once traffic spikes, repartition pressure, or uneven workloads start shaping your architecture.&lt;/p&gt;

&lt;h3 id=&quot;2-checkpointing-and-recovery-are-in-a-different-league&quot;&gt;2. Checkpointing And Recovery Are In A Different League&lt;/h3&gt;

&lt;p&gt;This is still one of the clearest differentiators for me.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s checkpointing model is part of the platform. Recovery is an explicit runtime capability, not just the consequence of rebuilding local state from changelogs. The barrier-based snapshotting model, savepoints, and state redistribution semantics are exactly the kind of thing that make &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; feel like an engine rather than a library.&lt;/p&gt;

&lt;p&gt;In &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;, the picture is a little more nuanced than &lt;em&gt;“it always has to read the whole changelog again.”&lt;/em&gt; If the local state store still exists, the runtime can replay from the previously checkpointed offset and catch up from there. If local state is gone, it has to rebuild from the changelog from the beginning of the retained data. That is meaningfully better than a naive full replay every time, and it is one of the reasons the &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;RocksDB&lt;/span&gt; path works as well as it does in practice.&lt;/p&gt;

&lt;p&gt;But the deeper point still holds: fault tolerance and task migration are still anchored in changelog restoration, and on large stateful applications that can become one of the dominant operational pain points. Retention choices matter. Restore time matters. Recovery becomes less predictable under failure. Operational patience starts turning into architecture.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kafka.apache.org/41/streams/developer-guide/running-app/&quot;&gt;Running Streams applications and state restoration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/when-flink-earns-its-complexity-over-kafka-streams/restore-and-recovery.png&quot; alt=&quot;A hand-drawn comparison of Kafka Streams changelog restoration and Flink checkpoint-based restore and recovery.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;At smaller scale this looks like an implementation detail. At larger scale it starts deciding how painful failure and recovery really feel in production.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;That is the point where &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; stops being a nice architectural preference and starts becoming a serious operational advantage.&lt;/p&gt;

&lt;h2 id=&quot;the-real-trade-off&quot;&gt;The Real Trade-Off&lt;/h2&gt;

&lt;p&gt;So, here is the trade in one sentence:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is a very good way to build &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-native streaming applications.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is a very good way to operate stateful dataflows as a platform concern.&lt;/p&gt;

&lt;p&gt;Those are not the same problem, even if the diagrams sometimes look similar.&lt;/p&gt;

&lt;p&gt;And this is why I do not buy generic advice like &lt;em&gt;“use &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; if you need scale”&lt;/em&gt; or &lt;em&gt;“use &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; if you want simplicity.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Both statements are misleading. They sound practical, but they hide the real failure modes, encourage cargo-cult architecture, and make comfort-driven rewrites sound more principled than they are.&lt;/p&gt;

&lt;p&gt;The better rule is this:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote blog-pullquote--compact&quot;&gt;
  &lt;p&gt;If your system is still primarily an application that processes &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; topics, &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt; is often the right engineering choice.&lt;/p&gt;
  &lt;p&gt;If your system is becoming a stateful processing layer that needs explicit control over time, state, replay, recovery, and heterogeneous I/O, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; starts to justify its existence very quickly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-harder-lesson&quot;&gt;The Harder Lesson&lt;/h2&gt;

&lt;p&gt;This is the part I most wanted to say personally.&lt;/p&gt;

&lt;p&gt;I am still a huge &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; proponent. That has not changed.&lt;/p&gt;

&lt;p&gt;What has changed is that I now trust myself less when my first reaction is &lt;em&gt;“we should rewrite this in the framework I prefer.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That reaction is often just comfort seeking.&lt;/p&gt;

&lt;p&gt;Sometimes you really should migrate. Sometimes the runtime boundary is wrong, recovery is too painful, scaling is too constrained, and &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is the more honest architecture.&lt;/p&gt;

&lt;p&gt;But sometimes the better engineering decision is to love the existing system properly: clarify the model, clean the state boundaries, improve the abstractions, respect the domain flow, and stop assuming that old means wrong.&lt;/p&gt;

&lt;p&gt;That was the lesson here for me.&lt;/p&gt;

&lt;p&gt;If I had followed my first instinct blindly, I would have replaced some systems for the wrong reason.&lt;/p&gt;

&lt;h2 id=&quot;what-i-would-actually-do&quot;&gt;What I Would Actually Do&lt;/h2&gt;

&lt;p&gt;If I were starting with a &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-centric JVM team, modest operational requirements, and clean &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-in/&lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-out topologies, I would still be very happy with &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka Streams&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;I would move toward &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; once one or more of these became persistently true:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;stateful jobs became expensive to recover or rescale&lt;/li&gt;
  &lt;li&gt;I needed a broader processing platform rather than a library&lt;/li&gt;
  &lt;li&gt;event-time and replay behaviour started driving design choices&lt;/li&gt;
  &lt;li&gt;the system stopped being comfortably &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;-shaped&lt;/li&gt;
  &lt;li&gt;operability and runtime visibility became a daily concern rather than an occasional debugging aid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the moment &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; stops being overkill and starts being the more honest architecture.&lt;/p&gt;

&lt;p&gt;And that brings me back to where I started.&lt;/p&gt;

&lt;p&gt;I still love &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;. I still think its model is easier to reason about once runtime concerns become serious. I still think it is the stronger platform when state, recovery, and rescaling dominate the design.&lt;/p&gt;

&lt;p&gt;Many rewrites begin as comfort and only later get dressed up as architecture.&lt;/p&gt;

&lt;p&gt;That is the part I understand better now, and it is probably the most useful thing this comparison taught me.&lt;/p&gt;
</description>
            <pubDate>2026-04-01T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2026/04/01/when-flink-earns-its-complexity-over-kafka-streams.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2026/04/01/when-flink-earns-its-complexity-over-kafka-streams.html</guid>
        </item>
        
        
        
        <item>
            <title>PyFlink In 2026: Better Than Its Reputation, Still Not Frictionless</title>
            <description>&lt;p&gt;I do not think teams reach for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; because &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; feels nicer to type.&lt;/p&gt;

&lt;p&gt;They reach for it when they have already paid the cost of splitting one &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; system across two ecosystems.&lt;/p&gt;

&lt;p&gt;I have seen that pain in the most annoying way possible: training and experimentation lived in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, but the prediction path had to live in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;. On paper that sounds manageable. In practice it meant subtle differences in floating-point behavior, parsing choices, and even heading-angle calculations were enough to create inconsistent predictions. We lost months chasing what looked like model problems and turned out to be feature mismatches.&lt;/p&gt;

&lt;p&gt;That is the part many architecture discussions understate. Once training is in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; and prediction is in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;, the real problem is no longer just inference. It becomes feature parity, interface parity, and the feedback loop between two runtimes that each have their own libraries, their own defaults, and their own ways of being &lt;em&gt;almost&lt;/em&gt; the same.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/pyflink-pros-cons-in-2026/training-vs-prediction-drift.png&quot; alt=&quot;A hand-drawn illustration of Python training and Java prediction pipelines drifting apart in subtle but painful ways.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is the real tax of cross-language serving paths: not dramatic failure, but endless small mismatches that make the system harder to trust.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;You can try to escape that with &lt;span class=&quot;blog-highlight blog-highlight--onnx&quot;&gt;ONNX&lt;/span&gt;. You can rebuild parts of the feature logic in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;. You can expose the model behind a service boundary and call it remotely. All of these are reasonable patterns. None of them are free.&lt;/p&gt;

&lt;p&gt;Four years ago, &lt;span class=&quot;blog-highlight blog-highlight--onnx&quot;&gt;ONNX&lt;/span&gt; was not mature enough for the kinds of models and custom ops we cared about. The easy story broke precisely where real systems stop being toy examples. The fallback was the pattern most teams know well: deploy the model as a service and call it over REST. That works, but now your prediction pipeline owns an extra network hop, another SLA, another scaling surface, and one more place where raw features must remain perfectly aligned.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/pyflink-pros-cons-in-2026/model-service-tradeoffs.png&quot; alt=&quot;A hand-drawn illustration of a model service boundary with a load balancer, showing clean scaling but also latency and operational trade-offs.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;Model-as-a-service is often the sensible compromise. It is also where clean separation starts charging rent in latency, SLAs, and feature-parity work.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This is why I think the case for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; should be stated more bluntly than it usually is:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote&quot;&gt;
  &lt;p&gt;If the real source of friction in your system is that your training, feature logic, and model-adjacent code live naturally in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, then &lt;em&gt;&quot;just use &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;&quot;&lt;/em&gt; is not a neutral suggestion.&lt;/p&gt;
  &lt;p&gt;It is an architectural trade, and often an expensive one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the real driver for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; adoption.&lt;/p&gt;

&lt;p&gt;I went back to an older &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; review recently because I did not want to turn one painful period into a permanent opinion. Some of those frustrations had aged well. Some had not. And &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is exactly the kind of technology people form a durable opinion about after one painful quarter and then never revisit.&lt;/p&gt;

&lt;p&gt;That would have been lazy here, because the story has moved. &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is in a better place now than many engineers assume. The official docs cover installation, packaging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environments, debugging, a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API, and connector examples. That is already a more serious platform story than the older dismissive take that it is simply immature.&lt;/p&gt;

&lt;p&gt;But the core trade-off has not disappeared.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is now real enough to take seriously, but it still does not let you forget that &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is fundamentally a JVM-first distributed runtime. That is the part people need to hold in their head at the same time as the improvements.&lt;/p&gt;

&lt;h2 id=&quot;what-has-improved-since-the-older-evaluation&quot;&gt;What Has Improved Since The Older Evaluation&lt;/h2&gt;

&lt;p&gt;The first thing worth saying is that some of the older criticisms are now too blunt.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is no longer just a thin curiosity around the Table API. The current docs cover installation, a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API, debugging, dependency management, packaging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environments for cluster execution, and connector examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/installation/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; installation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream/intro_to_datastream_api/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/faq/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; FAQ&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/debugging/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; debugging&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/api/python/examples/datastream/connectors.html&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; connector examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is already a materially better story than the one many engineers still carry around in their heads.&lt;/p&gt;

&lt;p&gt;A few concrete improvements stand out:&lt;/p&gt;

&lt;h3 id=&quot;1-the-python-story-is-better-documented&quot;&gt;1. The &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; Story Is Better Documented&lt;/h3&gt;

&lt;p&gt;The installation docs now state clear &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; version requirements. At the time of writing, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; requires &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; 3.9, 3.10, 3.11 or 3.12:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/installation/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; installation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds minor, but it is not. One of the easiest ways to waste time with cross-language frameworks is by discovering environment assumptions too late. The current docs at least acknowledge that this is a real part of the user experience.&lt;/p&gt;

&lt;h3 id=&quot;2-the-datastream-story-is-no-longer-hand-wavy&quot;&gt;2. The DataStream Story Is No Longer Hand-Wavy&lt;/h3&gt;

&lt;p&gt;One of the old reasons people dismissed &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; was that serious low-level streaming work still felt like &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; territory.&lt;/p&gt;

&lt;p&gt;That is less true now. The &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API is documented, examples exist, and the API surface is real enough that you can reason about it as a deliberate part of the platform rather than a side alley:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream/intro_to_datastream_api/&quot;&gt;Intro to the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; DataStream API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would still be careful not to confuse &lt;em&gt;“documented”&lt;/em&gt; with &lt;em&gt;“equally frictionless as the JVM path,”&lt;/em&gt; but the old complaint that &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is barely there is no longer a fair description.&lt;/p&gt;

&lt;h3 id=&quot;3-debugging-and-packaging-are-better-acknowledged&quot;&gt;3. Debugging And Packaging Are Better Acknowledged&lt;/h3&gt;

&lt;p&gt;The older review spent a lot of energy on setup, environment pain, and debugging awkwardness.&lt;/p&gt;

&lt;p&gt;Those pains have not disappeared, but the current docs are more honest about them. They cover packaging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environments, adding JARs, client-side versus TaskManager-side logging, local debugging, remote debugging, and profiling:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/faq/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; FAQ&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/debugging/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; debugging&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because it tells you something important about the maturity of the ecosystem: it now documents the pain instead of pretending it is not there.&lt;/p&gt;

&lt;p&gt;That is progress, even if it is not magic.&lt;/p&gt;

&lt;h2 id=&quot;why-pyflink-is-genuinely-attractive&quot;&gt;Why &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; Is Genuinely Attractive&lt;/h2&gt;

&lt;p&gt;Despite the caveats, I do think &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; has a very real value proposition.&lt;/p&gt;

&lt;h3 id=&quot;1-it-keeps-the-streaming-layer-closer-to-the-actual-ml-ecosystem&quot;&gt;1. It Keeps The Streaming Layer Closer To The Actual &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; Ecosystem&lt;/h3&gt;

&lt;p&gt;This is the point I think most comparisons understate, and it is the one that matters most to me.&lt;/p&gt;

&lt;p&gt;The strongest argument for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is not merely &lt;em&gt;“our team prefers &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;.”&lt;/em&gt; The stronger argument is that the surrounding model ecosystem, experimentation culture, libraries, and iteration loops are still centered on &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;.&lt;/p&gt;

&lt;figure class=&quot;blog-figure blog-figure--wide&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2026/pyflink-pros-cons-in-2026/pyflink-same-ecosystem.png&quot; alt=&quot;A hand-drawn illustration showing PyFlink as a serious streaming platform that lets Python-native model and feature logic stay closer together.&quot; /&gt;
  &lt;figcaption class=&quot;blog-figure__caption&quot;&gt;This is why &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; remains attractive: not because the runtime becomes light, but because the surrounding &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; ecosystem can stay closer to the streaming layer.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;That matters when the alternative is forcing teams into one of these patterns:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;re-implementing logic in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;&lt;/li&gt;
  &lt;li&gt;exporting models through formats like &lt;span class=&quot;blog-highlight blog-highlight--onnx&quot;&gt;ONNX&lt;/span&gt; and accepting the translation burden&lt;/li&gt;
  &lt;li&gt;splitting the system so aggressively that the serving boundary becomes the architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are invalid. But all of them are real costs, and in many teams they are the &lt;em&gt;actual&lt;/em&gt; costs driving interest in &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;If the same raw features are calculated in one language for training and another for live prediction, you do not just inherit maintenance overhead. You inherit doubt. When a prediction looks wrong, is the model wrong, is the data wrong, or did one side normalise, round, parse, or order something differently? That uncertainty is corrosive, and it slows every feedback loop around the system.&lt;/p&gt;

&lt;h3 id=&quot;2-it-meets-python-heavy-teams-where-they-already-work&quot;&gt;2. It Meets &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;-Heavy Teams Where They Already Work&lt;/h3&gt;

&lt;p&gt;If your data and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; teams already live in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; reduces one major source of organisational friction.&lt;/p&gt;

&lt;p&gt;That does not mean everyone suddenly gets to ignore distributed systems. But it does mean:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;feature logic can stay closer to the surrounding &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; estate&lt;/li&gt;
  &lt;li&gt;model-adjacent transformations feel more natural&lt;/li&gt;
  &lt;li&gt;experimentation paths from notebook thinking to streaming execution become less culturally awkward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For some organisations, that is a very big deal.&lt;/p&gt;

&lt;p&gt;The wrong reaction here is to sneer and say &lt;em&gt;“just learn &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;.”&lt;/em&gt; Sometimes that is the right answer. Often it is just a lazy one.&lt;/p&gt;

&lt;h3 id=&quot;3-it-makes-flink-more-reachable-without-hiding-flink&quot;&gt;3. It Makes &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; More Reachable Without Hiding &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;&lt;/h3&gt;

&lt;p&gt;Good language bindings should not pretend the platform underneath does not exist.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is useful when it gives &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; teams access to &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s real strengths: state, checkpoints, event-time semantics, long-running streaming jobs, and broader dataflow capabilities. If that is what you are buying, then the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; layer can be a practical bridge.&lt;/p&gt;

&lt;p&gt;That is especially true for teams whose work already mixes ETL, feature pipelines, and model-centric logic.&lt;/p&gt;

&lt;h3 id=&quot;4-there-is-a-real-connector-surface&quot;&gt;4. There Is A Real Connector Surface&lt;/h3&gt;

&lt;p&gt;This is another place where the older blanket criticism needs updating.&lt;/p&gt;

&lt;p&gt;The current &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; docs and examples do show &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt;, Pulsar, and Elasticsearch examples in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/api/python/examples/datastream/connectors.html&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; connector examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So it would be wrong to say that the connector story is absent.&lt;/p&gt;

&lt;p&gt;But it would also be wrong to say that it feels like a pure &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; ecosystem.&lt;/p&gt;

&lt;p&gt;That brings me to the real downside.&lt;/p&gt;

&lt;h2 id=&quot;why-pyflink-is-still-not-flink-but-easy&quot;&gt;Why &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; Is Still Not &lt;em&gt;“&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;, But Easy”&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;The strongest criticism from the old evaluation still holds:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; reduces language friction, but it does not remove runtime friction.&lt;/p&gt;

&lt;h3 id=&quot;1-you-still-have-to-think-in-two-worlds&quot;&gt;1. You Still Have To Think In Two Worlds&lt;/h3&gt;

&lt;p&gt;The installation and FAQ pages make this clear if you read them carefully.&lt;/p&gt;

&lt;p&gt;You have to think about:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; interpreter version&lt;/li&gt;
  &lt;li&gt;&lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; packaging and archives&lt;/li&gt;
  &lt;li&gt;where &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; executes&lt;/li&gt;
  &lt;li&gt;how dependencies are shipped&lt;/li&gt;
  &lt;li&gt;JAR dependencies for connectors or &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;-side integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That earlier review made this painfully concrete. Getting local execution into a sane state meant lining up:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the right &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; version&lt;/li&gt;
  &lt;li&gt;the right &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; version&lt;/li&gt;
  &lt;li&gt;the right connector JARs&lt;/li&gt;
  &lt;li&gt;the right &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That list is not just setup trivia. It is the operating model announcing itself early.&lt;/p&gt;

&lt;p&gt;That is not a small footnote. It is the day-to-day ergonomics of the platform:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/installation/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; installation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/faq/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; FAQ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why I would resist overselling &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; to a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; team as &lt;em&gt;“just write &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; and the rest disappears.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It does not disappear.&lt;/p&gt;

&lt;p&gt;It relocates.&lt;/p&gt;

&lt;h3 id=&quot;2-the-connector-story-still-leaks-jvm-reality&quot;&gt;2. The Connector Story Still Leaks JVM Reality&lt;/h3&gt;

&lt;p&gt;The connector examples are useful, but they also reveal the real shape of things: adding JARs, managing connector dependencies, and living with the fact that some integration points are still fundamentally JVM-shaped.&lt;/p&gt;

&lt;p&gt;Even the current &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; connector docs explicitly talk about bringing connector dependencies yourself for &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; jobs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/connectors/datastream/kafka/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; connector docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a deal-breaker. It is just not the same experience as working inside a native &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; framework whose extension model is &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; all the way down.&lt;/p&gt;

&lt;p&gt;It also shows up in deployment. In that earlier review, the easiest workable path for local standalone deployment was not &lt;em&gt;“package a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; app and run it.”&lt;/em&gt; It was closer to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;start from a vanilla &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; image&lt;/li&gt;
  &lt;li&gt;add the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; dependencies&lt;/li&gt;
  &lt;li&gt;mount the repo or bundle the code carefully&lt;/li&gt;
  &lt;li&gt;run the &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; entrypoint from inside the live container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a perfectly workable path. It is also a strong reminder that the deployment experience is still shaped by &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;’s runtime model, not by &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;’s usual ergonomics.&lt;/p&gt;

&lt;h3 id=&quot;3-debugging-still-tells-you-what-the-system-really-is&quot;&gt;3. Debugging Still Tells You What The System Really Is&lt;/h3&gt;

&lt;p&gt;The current debugging docs are better than before, but they are also revealing.&lt;/p&gt;

&lt;p&gt;They distinguish between client-side logging and TaskManager-side logging. They discuss local debug, remote debug, and profiling &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; UDFs. That is helpful, but it also tells you that when things go wrong, you are not debugging a simple &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; program. You are debugging &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; inside a distributed &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; runtime:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/debugging/&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; debugging&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, that means some classes of issue still feel cross-boundary by nature:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;packaging bugs&lt;/li&gt;
  &lt;li&gt;dependency mismatches&lt;/li&gt;
  &lt;li&gt;behavioural differences between local and cluster execution&lt;/li&gt;
  &lt;li&gt;performance bottlenecks around &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; execution paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; being uniquely bad. It is just the cost of the abstraction being honest.&lt;/p&gt;

&lt;h3 id=&quot;4-native-python-models-are-not-an-automatic-architectural-win&quot;&gt;4. Native &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; Models Are Not An Automatic Architectural Win&lt;/h3&gt;

&lt;p&gt;This was one of the more useful parts of the earlier review, because it is exactly the kind of point people skip when they are trying to justify a new stack.&lt;/p&gt;

&lt;p&gt;Yes, being able to interact with model code directly inside a &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; job is a real plus. It can simplify some flows and avoid a network hop.&lt;/p&gt;

&lt;p&gt;But that is not the same as saying it is always the better architecture.&lt;/p&gt;

&lt;p&gt;Once the model is served behind a proper boundary, you often gain things that matter a lot in production:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;safer zero-downtime upgrades&lt;/li&gt;
  &lt;li&gt;cleaner readiness and health semantics&lt;/li&gt;
  &lt;li&gt;independent model scaling behind a load balancer&lt;/li&gt;
  &lt;li&gt;a clearer separation between streaming orchestration and serving concerns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, yes, native execution can save some overhead. But it can also collapse boundaries that were doing useful work for you.&lt;/p&gt;

&lt;p&gt;The reason I still take the native path seriously is not hand-wavy elegance. It is that model-as-a-service also comes with a bill:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;every prediction path now pays a network round trip&lt;/li&gt;
  &lt;li&gt;the serving tier becomes another system you need to scale for throughput and protect with its own SLA&lt;/li&gt;
  &lt;li&gt;raw feature generation has to stay perfectly aligned across the caller and the served model boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If demand is modest, teams can live with that for a long time. Once prediction volume rises, that architecture stops being an abstract diagram and starts showing up as latency, capacity planning, and operational drag.&lt;/p&gt;

&lt;h3 id=&quot;5-the-performance-question-never-fully-goes-away&quot;&gt;5. The Performance Question Never Fully Goes Away&lt;/h3&gt;

&lt;p&gt;I would be very careful here not to pretend a benchmark I have not run.&lt;/p&gt;

&lt;p&gt;But I am comfortable saying something narrower and more useful: if your workload is highly latency-sensitive, connector-heavy, or operationally unforgiving, the JVM path still deserves to be the default starting point.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; can absolutely be the right choice. I just would not choose it because I wanted to avoid understanding the &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; side of &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;That is not how this platform works.&lt;/p&gt;

&lt;h2 id=&quot;so-when-would-i-use-it&quot;&gt;So When Would I Use It?&lt;/h2&gt;

&lt;p&gt;I would take &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; seriously when these conditions hold:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the team is materially more fluent in &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; than in &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;&lt;/li&gt;
  &lt;li&gt;the reason for adopting &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is the runtime model, not fashion&lt;/li&gt;
  &lt;li&gt;the jobs are important, but not balanced on the sharpest latency edge&lt;/li&gt;
  &lt;li&gt;I am willing to own environment packaging and connector dependency management as part of the operating model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would lean back toward &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;connector maturity dominates the problem&lt;/li&gt;
  &lt;li&gt;the hot path is extremely performance-sensitive&lt;/li&gt;
  &lt;li&gt;the team already has strong JVM strength&lt;/li&gt;
  &lt;li&gt;I expect deep platform integration and want the least surprising execution path&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;if-you-want-to-try-it&quot;&gt;If You Want To Try It&lt;/h2&gt;

&lt;p&gt;If this post pushed you toward experimenting rather than debating in the abstract, I put together a small starter page here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/pyflink-agent-starter.html&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; starter archetype and agent prompt&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is intentionally minimal. The goal is not to hand you a grand framework. The goal is to give you a sensible first project shape and an agent prompt that can get a small &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;-first streaming scaffold off the ground without immediate chaos.&lt;/p&gt;

&lt;h2 id=&quot;the-practical-takeaway&quot;&gt;The Practical Takeaway&lt;/h2&gt;

&lt;p&gt;What matters here is not whether &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is &lt;em&gt;“good”&lt;/em&gt; or &lt;em&gt;“bad.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is far too vague to help anyone.&lt;/p&gt;

&lt;p&gt;The better question is this:&lt;/p&gt;

&lt;blockquote class=&quot;blog-pullquote blog-pullquote--compact&quot;&gt;
  &lt;p&gt;Do I want &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; as the working language for a &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; system badly enough to own the extra operational boundary that comes with it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the answer is yes, &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;PyFlink&lt;/span&gt; is now mature enough to be a serious option.&lt;/p&gt;

&lt;p&gt;If the answer is no, then &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; is still the cleaner way to get the full benefits of &lt;span class=&quot;blog-highlight blog-highlight--flink&quot;&gt;Flink&lt;/span&gt; without pretending the JVM underneath is someone else’s problem.&lt;/p&gt;

&lt;p&gt;That, at least, is the view I would hold today.&lt;/p&gt;
</description>
            <pubDate>2026-03-27T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2026/03/27/pyflink-pros-cons-in-2026.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2026/03/27/pyflink-pros-cons-in-2026.html</guid>
        </item>
        
        
        
        <item>
            <title>Harmonizing Avro and Python: A Dance of Data Classes</title>
            <description>&lt;p&gt;Reposting from the &lt;a href=&quot;https://medium.com/vortechsa/harmonizing-avro-and-python-a-dance-of-data-classes-d1cc7bf6bb33&quot;&gt;Vortexa medium blog&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the realm of data engineering, managing data types and schemas efficiently is of paramount importance. The crux of the matter? When data schemas are poorly managed, a myriad of issues arise, ranging from data incompatibility to runtime errors. What I am aiming for in this article is to introduce &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt;, a binary serialization format born from the Apache Hadoop project, through which I hope to highlight the significance of Avro schemas in data engineering. Finally, I will provide you with a hands-on guide on converting Avro files into &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; data classes. By the end of this read, you’ll grasp the fundamentals of Avro schemas, understand the advantages of using them, and be equipped with a practical example of generating &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; data classes from these schemas.&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/avro-schema-management/2023-11-07-break-screen.png&quot; alt=&quot;Break-Screen&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;the-issue-at-hand&quot;&gt;The Issue at Hand&lt;/h2&gt;
&lt;p&gt;Imagine the following scenario:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Your application’s new update starts crashing for a specific set of users.&lt;/li&gt;
  &lt;li&gt;Upon investigation, you discover the root cause: a mismatch between the expected data format and the actual data sent from the backend.&lt;/li&gt;
  &lt;li&gt;Such mismatches can occur due to several reasons — maybe a field was renamed, or its data type got changed without proper communication to all stakeholders.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;These are real-world problems arising from the lack of efficient schema management.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;So, how can &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt; and particularly &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt; schemas help deal with these predicaments?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;avro-what-now&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;… what now?&lt;/h2&gt;
&lt;p&gt;In the grand scheme of data engineering and big data, one might compare the efficient storage and transmission of data to the very lifeblood of the show. Now, if this show needed a backstage hero, it would be Apache Avro. This binary serialization format, conceived in the heart of the Apache Hadoop project, is swift, concise, and unparalleled in dealing with huge data loads. When the curtain rises for powerhouses like Data Lakes, Apache Kafka, and Apache Hadoop, it’s Avro that steals the limelight.&lt;/p&gt;

&lt;h3 id=&quot;the-evolution-of-data-serialization&quot;&gt;The Evolution of Data Serialization&lt;/h3&gt;
&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/avro-schema-management/2023-11-07-Package.png&quot; alt=&quot;Package&quot; /&gt;
&lt;/div&gt;
&lt;p&gt;Before diving into the tapestry of data’s history, let’s demystify a foundational concept here: serialization. At its core, serialization is the process of converting complex data structures or objects into a format that can be easily stored or transmitted and later reconstructed. Imagine packing for a trip; you organize and fold your clothes (data) into a suitcase (a serialized format) so that they fit neatly and can be effortlessly unpacked at your destination.&lt;/p&gt;

&lt;p&gt;With that in mind, the story of data storage and transmission is a dynamic saga filled with innovation, challenges, and breakthroughs. Cast your mind back to the times of simple flat files–text files abiding to a specific structure. They were the humble beginning, like parchment scrolls in a digital era. But as data grew in complexity, our digital scrolls evolved into intricate relational databases, swift NoSQL solutions, and vast data lakes.&lt;/p&gt;

&lt;p&gt;Now, imagine various systems, microservices, or extract-transform-load (ETL) pipelines, trying to communicate with one another by attempting to read unfamiliar data formats. It’s like trying to read a book when you don’t know the language it’s written in. To solve this, data had to be serialized–essentially translating complex data structures into a universally understood format. The early translators in this world were XML and JSON. Effective? Yes. Efficient? Not quite. They often felt like scribes painstakingly inking each letter, especially when handling vast amounts of data. The world needed a faster scribe; one that was both concise and precise.&lt;/p&gt;

&lt;p&gt;Enter &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;. Inspired by the bustling highways of big data scenarios–from the lightning speed of &lt;span class=&quot;blog-highlight blog-highlight--kafka&quot;&gt;Kafka&lt;/span&gt; to the vastness of Hadoop–Avro was born to ensure that data packets glided smoothly without unexpected stops. It became the guardian of data integrity and compatibility.&lt;/p&gt;

&lt;h2 id=&quot;whats-in-a-pojo&quot;&gt;What’s in a POJO?&lt;/h2&gt;
&lt;p&gt;So, integrity is the keyword here, and in the context of this blog, we care about integrity breaches concerned with schema changes in a service that are not properly propagated to its consumers, rendering them unable to accommodate the new schema of the data they consume–like reading a book in a foreign language 😉.&lt;/p&gt;

&lt;h3 id=&quot;the-dawn-of-the-pojo-era&quot;&gt;The Dawn of the POJO Era&lt;/h3&gt;
&lt;p&gt;In the realm of programming, particularly within Java, a hero emerged named the Plain Old Java Object (POJO). This simple, unadorned object didn’t extend or implement any specific Java framework or class, allowing it to represent data without any preset behaviors or constraints. Imagine a Person POJO, detailing fields like name, age, and address without binding rules on how you should engage with these fields. Simple and elegant.&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Person&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// Default constructor&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Person&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// Constructor with parameters&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Person&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;// Getters and setters for each field&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getAge&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setAge&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Person{&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;s&quot;&gt;&quot;name=&apos;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\&apos;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;s&quot;&gt;&quot;, age=&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;s&quot;&gt;&quot;, address=&apos;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;address&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&apos;\&apos;&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
               &lt;span class=&quot;sc&quot;&gt;&apos;}&apos;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;However, as data complexity increased and systems multiplied, ensuring that these straightforward representations, our POJOs, maintained their integrity when transmitted or stored across varying systems became a challenge. Manual serialization, translating each POJO for different systems, wasn’t just laborious — it was a minefield of potential errors.&lt;/p&gt;

&lt;p&gt;Enter the need for an efficient and consistent serialization mechanism. One that could not only describe these POJOs but also seamlessly encode and decode them, ensuring data looked and felt the same everywhere.&lt;/p&gt;

&lt;h2 id=&quot;apache-avro--the-magic-of-schemas&quot;&gt;&lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt; &amp;amp; the Magic of Schemas&lt;/h2&gt;
&lt;p&gt;Amidst this backdrop, &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt; took centre stage. While the POJO painted the picture, &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt; became the artist’s brush, allowing the artwork to be replicated without losing its original essence. Integral to &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;’s magic were its schemas. These files, with their unique &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.avsc&lt;/code&gt; extension, were a form of a blueprint, dictating the structure of an entity, data types, and nullable fields or default values. (see the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Person.avsc&lt;/code&gt; as an example here).&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;record&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Person&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;com.example&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;fields&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;int&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;address&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Pairing the intuitive design of POJOs with the precision of &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt; schemas, developers had a formidable toolkit. Now, data could be managed, shuttled, and transformed without ever losing its core essence or structure. But what if these changes weren’t properly communicated amongst interacting systems?&lt;/p&gt;

&lt;h2 id=&quot;challenges-in-schema-communication&quot;&gt;Challenges in Schema Communication&lt;/h2&gt;
&lt;p&gt;Imagine two services: Service A (the Producer) that creates and sends data, and Service B (the Consumer) that receives and processes it. Service A updates its schema — perhaps it added a new field or modified an existing one. But if Service B is unaware of this change, it might end up expecting apples and receiving oranges.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;The Domino Effect&lt;/strong&gt;: Let’s say Service A, our producer, changes a field from being a number to a string. Service B, expecting a number, might crash or perform incorrect operations when it encounters a string. In a real-world scenario, this could mean misinterpretation of important metrics, corrupted databases, or application failures.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Versioning Nightmares&lt;/strong&gt;: If every schema change requires updating the application logic in both the producer and consumer, this can quickly spiral into a versioning nightmare. How does one ensure that Service B is always compatible with Service A’s data, especially when they are updated at different intervals?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enter the Schema Registry&lt;/strong&gt;: A centralized Schema Registry can be the saviour in this scenario. Instead of letting every service decide how to send or interpret data, the Schema Registry sets the standard.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Registration &amp;amp; Validation&lt;/strong&gt;: When Service A wishes to update its schema, it first registers the new schema with the registry. The registry validates this schema, ensuring backward compatibility with its previous versions.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Schema Sharing&lt;/strong&gt;: Service B, before processing any data, checks with the registry to get the most recent schema. This ensures it knows exactly how to interpret the data it receives.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Library Generation&lt;/strong&gt;: On successful registration, the producer can then trigger a script to create or update the corresponding POJO or Python data class. This automatically generated class can be used directly, ensuring that the code aligns with the latest schema.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;artifact-repository--versioning&quot;&gt;Artifact Repository &amp;amp; Versioning&lt;/h2&gt;
&lt;p&gt;The generated data classes need a home. An Artifact Repository acts as this home. Whenever there’s a change, the updated class is given a new version and stored in this repository. Service B can then reference the specific version of the class it needs, ensuring data compatibility.&lt;/p&gt;

&lt;p&gt;Producers, Consumers, and their Interaction: Once the schema changes are validated and registered, and the respective classes are updated, both the producer and consumer know exactly how to interact. They can reliably share data, knowing that both sides understand the data’s structure and meaning.&lt;/p&gt;

&lt;p&gt;In essence, a centralised schema management system, paired with a robust registry and an efficient artifact repository, ensures that such data incompatibility issues are rendered not possible!&lt;/p&gt;

&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/avro-schema-management/2023-11-07-Example-Architecture.png&quot; alt=&quot;Package&quot; /&gt;
&lt;/div&gt;

&lt;h2 id=&quot;generating-python-data-classes-from-avsc-files&quot;&gt;Generating &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; Data Classes from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;*.avsc&lt;/code&gt; files&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;, by its design and origin, has a strong affinity for the &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt; ecosystem. &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Apache Avro&lt;/span&gt;’s project comes with built-in tools and libraries tailored for &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;, which makes generating POJOs straightforward. But when working with &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;, things aren’t as easy.&lt;/p&gt;

&lt;p&gt;Historically, it is worth noting that the introduction of data classes, which brought a feature similar to &lt;span class=&quot;blog-highlight blog-highlight--java&quot;&gt;Java&lt;/span&gt;’s POJOs, came with &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; 3.7. It, however, necessitated reliance on external libraries, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dataclasses_avroschema&lt;/code&gt;, for schema-based generation. While these libraries are effective, their unofficial status can raise concerns about long-term reliability. Moreover, their utilization often depends on well-documented and clear examples, which might sometimes be ambiguous or lacking altogether. Furthermore, &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;’s dynamic type system, though offering flexibility, poses challenges in maintaining data representation consistency when interfacing with &lt;span class=&quot;blog-highlight blog-highlight--avro&quot;&gt;Avro&lt;/span&gt;’s static schemas.&lt;/p&gt;

&lt;p&gt;In this blog post, I hope to provide a clear example for data class-autogeneration, using an easy-to-understand script. So, let’s dive into an example.&lt;/p&gt;

&lt;p&gt;Suppose, as we have already iterated, that we have the Person.avsc:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;record&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Person&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;namespace&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;com.example&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;fields&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;int&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;address&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;string&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Before providing the script, let’s discuss the sample project structure, which can help clarify why, later on, I state that the generated files must be read-only.&lt;/p&gt;

&lt;h3 id=&quot;sample-project-structure&quot;&gt;Sample Project Structure&lt;/h3&gt;
&lt;p&gt;Your project structure might look like this:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;project/
│
├── resources/
│   └── schemas/
│       └── Person.avsc
├── src/
│   └── types/
│       └── Person.py
├── scripts/
│   └── generate_dataclasses.py
└── Makefile
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;resources/schemas/&lt;/code&gt;: This directory contains the Avro schema files (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.avsc&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/types/&lt;/code&gt;: This directory will contain the generated Python data classes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.py&lt;/code&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scripts/generate_dataclasses.py&lt;/code&gt;: This script generates the Python data classes from the Avro schemas&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Makefile&lt;/code&gt;: This file contains the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make&lt;/code&gt; command to run the script.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Now, you can use the following Python script to generate a Python data class from this Avro schema:&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;json&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;os&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;subprocess&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;dataclasses_avroschema.model_generator.generator&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ModelGenerator&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Starting script...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;model_generator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ModelGenerator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Ensure the output directory exists
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;../src/types&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;makedirs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;exist_ok&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Scan the directory for .avsc files
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;root&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;files&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;walk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;../resources/schemas&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;files&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Generating DataClass for: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;endswith&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;.avsc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;schema_file&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;root&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;.avsc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;.py&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Load the schema
&lt;/span&gt;                &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Generate the python code for the schema
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model_generator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Open the output file
&lt;/span&gt;                &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;w&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                    &lt;span class=&quot;c1&quot;&gt;# Write a comment at the top of the file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;# This is an autogenerated python class&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;c1&quot;&gt;# Write the imports to the output file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;from dataclasses_avroschema import AvroModel&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;import dataclasses&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                    &lt;span class=&quot;c1&quot;&gt;# Remove the imports from the result because we have already written them to the output file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;from dataclasses_avroschema import AvroModel&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;import dataclasses&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                    &lt;span class=&quot;c1&quot;&gt;# Write the generated python code to the output file
&lt;/span&gt;                    &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Format the output file using isort and black
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;subprocess&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;isort&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;subprocess&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;black&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

                &lt;span class=&quot;c1&quot;&gt;# Make the file read-only
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chmod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mo&quot;&gt;0o444&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

                &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Generated &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; from &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schema_file&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;__main__&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This script will generate a Python file Person.py in the ../src/types directory with the following content:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# This is an autogenerated python class
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;dataclasses_avroschema&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AvroModel&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;dataclasses&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataclasses&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataclass&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Person&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AvroModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;address&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;com.example&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h3 id=&quot;why-read-only&quot;&gt;Why Read-Only?&lt;/h3&gt;
&lt;p&gt;The generated Python files are made read-only to prevent accidental modifications. Since these files are autogenerated, any changes should be made in the Avro schema files, and then the Python files should be regenerated.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The integration of Avro files with Python data classes streamlines the complexities of data handling. It’s a union that empowers the data engineering toolkit, delivering precise type-checking, user-friendly code suggestions, rigorous validation, and crystal-clear readability. With the solid foundation provided by the schema registry, the integrity of your data remains uncompromised, no matter how intricate the data operations become. And while the magic lies in the technology and techniques discussed, the real art is in the consistent, reliable data flow it facilitates. As you delve deeper into the vast world of data, know that tools like these are pivotal in weaving the seamless narrative of your data story.&lt;/p&gt;

&lt;p&gt;Stay tuned, as more insights await in follow-up discussions, where we’ll further dissect the intricacies of a comprehensive schema management ecosystem.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;
</description>
            <pubDate>2023-11-07T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2023/11/07/Avro-Schema-Management.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2023/11/07/Avro-Schema-Management.html</guid>
        </item>
        
        
        
        <item>
            <title>Agile In Action: Bridging Data Science and Engineering</title>
            <description>&lt;div class=&quot;image center&quot;&gt;
  &lt;img src=&quot;/assets/images/posts/2023/agile-in-action/2023-10-31-Turner.png&quot; alt=&quot;Joseph Mallord William Turner | Dutch Boats in a Gale (&apos;The Bridgewater Sea Piece&apos;) | National Gallery, London&quot; /&gt;
  &lt;p class=&quot;image-credit&quot;&gt;Picture taken from &lt;a href=&quot;https://www.nationalgallery.org.uk/paintings/joseph-mallord-william-turner-dutch-boats-in-a-gale-the-bridgewater-sea-piece&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;National Gallery, London&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;A few weeks ago, Bill Raymond invited me onto his &lt;a href=&quot;https://agileinaction.com/agile-in-action-podcast/2023/10/31/bridging-ai-data-science-and-engineering-a-personal-journey.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Agile in Action podcast&lt;/a&gt; after reading an older post of mine on &lt;a href=&quot;/2020/08/11/agile-data-science.html&quot;&gt;doing data science the Agile way&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I said yes because this topic has followed me through most of my career.&lt;/p&gt;

&lt;p&gt;I started as a data scientist. Then I spent years watching perfectly respectable prototypes fail to become products. By the time I reached Vortexa, I was leading a team of data scientists and engineers and living right in the middle of the tension I had been talking about for years.&lt;/p&gt;

&lt;p&gt;That is the version of &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; I wanted to discuss in the episode. Not the clean whiteboard version. The one that appears when a model has to leave a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; notebook, survive production, and still make sense to the people who have to operate it.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The real gap in &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; teams is rarely enthusiasm. It is the distance between a model that works once and a system that can be trusted repeatedly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;why-this-topic-stayed-with-me&quot;&gt;Why This Topic Stayed With Me&lt;/h2&gt;

&lt;p&gt;Part of the reason this topic matters so much to me is that I learned it the frustrating way.&lt;/p&gt;

&lt;p&gt;At Data Reply, I worked on one prototype after another. We would explore a problem, build something promising, show strong results, and then hit the same wall: the client liked the idea, but the system never really made it into production. Sometimes the missing piece was infrastructure. Sometimes it was culture. Sometimes it was simply that nobody owned the hard part after the demo.&lt;/p&gt;

&lt;p&gt;That started to change for me at UBS.&lt;/p&gt;

&lt;p&gt;For the first time, I heard the sentence I had wanted to hear for years: &lt;em&gt;“Great. Now how do we put this into production?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I was paired with an experienced engineer, and that changed the direction of my career. I stopped seeing engineering as the final packaging step after the interesting work was done. I started seeing it as part of the thinking itself.&lt;/p&gt;

&lt;p&gt;That shift is still with me today.&lt;/p&gt;

&lt;h2 id=&quot;the-real-gap-between-data-science-and-engineering&quot;&gt;The Real Gap Between Data Science And Engineering&lt;/h2&gt;

&lt;p&gt;When people talk about cross-functional &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; teams, they often make the collaboration sound natural. In practice, it is not.&lt;/p&gt;

&lt;p&gt;Data scientists are usually optimising for learning:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;trying ideas quickly&lt;/li&gt;
  &lt;li&gt;testing hypotheses&lt;/li&gt;
  &lt;li&gt;moving fast through a messy search space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Engineers are usually optimising for control:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;reproducibility&lt;/li&gt;
  &lt;li&gt;determinism&lt;/li&gt;
  &lt;li&gt;maintainability&lt;/li&gt;
  &lt;li&gt;safe change over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both instincts are valid.&lt;/p&gt;

&lt;p&gt;The problem is that they are protecting the system from different failure modes.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The issue is not that data scientists are messy and engineers are rigid. The issue is that both are right about different kinds of breakage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Take a simple pricing model. A data scientist can build a strong prototype in a notebook, engineer the features, train the model, and prove the concept. But once that model becomes part of a product, somebody has to make sure the production path transforms the raw input in exactly the same way. If the training pipeline and the prediction pipeline drift apart, the system lies even when the model itself is good.&lt;/p&gt;

&lt;p&gt;That is why the gap matters so much.&lt;/p&gt;

&lt;p&gt;It is not about user interfaces or wrapping code nicely. It is about making sure the system that predicts tomorrow behaves like the system that was validated yesterday.&lt;/p&gt;

&lt;h2 id=&quot;what-agile-actually-helped-with&quot;&gt;What &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; Actually Helped With&lt;/h2&gt;

&lt;p&gt;When I say &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; helped here, I do not mean that Scrum ceremonies somehow solved the problem.&lt;/p&gt;

&lt;p&gt;What helped was having a way to make uncertainty legible.&lt;/p&gt;

&lt;p&gt;For me, that meant three things.&lt;/p&gt;

&lt;h3 id=&quot;1-making-experiments-explicit&quot;&gt;1. Making experiments explicit&lt;/h3&gt;

&lt;p&gt;In &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; work, &lt;em&gt;“we are exploring”&lt;/em&gt; is too vague.&lt;/p&gt;

&lt;p&gt;An experiment becomes useful when the team can answer:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;what assumption are we testing?&lt;/li&gt;
  &lt;li&gt;what would count as useful evidence?&lt;/li&gt;
  &lt;li&gt;what result would tell us to stop?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds simple, but it changes the conversation completely. It stops research from turning into open-ended wandering and gives product and engineering a clearer way to understand what the team is actually learning.&lt;/p&gt;

&lt;h3 id=&quot;2-creating-shared-visibility&quot;&gt;2. Creating shared visibility&lt;/h3&gt;

&lt;p&gt;At Vortexa, one of the most useful habits we built was a regular data science catch-up where engineers and data scientists could present what they were doing, why they were doing it, and where the risks were.&lt;/p&gt;

&lt;p&gt;This was not code review. It was not a status ritual either.&lt;/p&gt;

&lt;p&gt;It was a way to keep everyone on the same mental map.&lt;/p&gt;

&lt;p&gt;That mattered because a lot of problems in &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; systems do not come from one catastrophic mistake. They come from small drifts in understanding. A feature is computed one way in training, another way in production. An assumption about data quality goes unchallenged. A result sounds promising, but nobody else can reproduce it.&lt;/p&gt;

&lt;p&gt;Communication is not a soft add-on here.&lt;/p&gt;

&lt;p&gt;It is part of the control surface of the system.&lt;/p&gt;

&lt;h3 id=&quot;3-putting-discipline-around-handoffs&quot;&gt;3. Putting discipline around handoffs&lt;/h3&gt;

&lt;p&gt;The teams I trust most are not the ones with the nicest process diagrams. They are the ones that make handoffs visible and expensive enough that people try to remove them.&lt;/p&gt;

&lt;p&gt;If the data scientist can disappear after training a model and the engineer is left to guess the rest, the system will eventually reflect that fracture.&lt;/p&gt;

&lt;p&gt;If the engineer is never exposed to how experimental the work really is, they will overestimate how stable the solution already is.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; helped when it forced us to confront those boundaries earlier.&lt;/p&gt;

&lt;h2 id=&quot;what-ml-teams-still-underestimate&quot;&gt;What &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; Teams Still Underestimate&lt;/h2&gt;

&lt;p&gt;One of the themes that came up in the podcast is that many teams still underestimate how much work starts after the model looks good.&lt;/p&gt;

&lt;p&gt;You do not just need versioned code. You need versioned data and a credible way to tie the two together.&lt;/p&gt;

&lt;p&gt;You do not just need a model in production. You need monitoring, drift detection, and a practical way to replace the model without breaking the product.&lt;/p&gt;

&lt;p&gt;You do not just need experimentation. You need a path from experimentation to something deterministic enough to support.&lt;/p&gt;

&lt;p&gt;This is why I often say that notebooks are wonderful research tools and terrible places to leave an idea if you want a system around it to survive.&lt;/p&gt;

&lt;h2 id=&quot;the-lesson-i-was-trying-to-communicate&quot;&gt;The Lesson I Was Trying To Communicate&lt;/h2&gt;

&lt;p&gt;When Bill asked what &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; meant to me in this context, the answer I wanted to give was not especially fashionable.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt;, Agile is useful when it helps the team learn quickly without losing control of the system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is really the heart of it.&lt;/p&gt;

&lt;p&gt;Not velocity in the abstract.&lt;/p&gt;

&lt;p&gt;Not ceremony for its own sake.&lt;/p&gt;

&lt;p&gt;Not pretending that uncertainty can be planned away.&lt;/p&gt;

&lt;p&gt;Just a disciplined way to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;test assumptions early&lt;/li&gt;
  &lt;li&gt;expose the right risks&lt;/li&gt;
  &lt;li&gt;keep engineers and data scientists in sync&lt;/li&gt;
  &lt;li&gt;and make sure the thing you learned can actually survive contact with production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That was my view then, and I still think it was the right thing to say.&lt;/p&gt;

&lt;h2 id=&quot;the-podcast&quot;&gt;The Podcast&lt;/h2&gt;

&lt;p&gt;If you prefer the conversation version, the episode is below.&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/LdDasrMOJLs?si=dk-YcjCqW6YpBPWZ&quot; title=&quot;Agile in Action podcast episode&quot; frameborder=&quot;0&quot; loading=&quot;lazy&quot; referrerpolicy=&quot;strict-origin-when-cross-origin&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
</description>
            <pubDate>2023-10-31T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2023/10/31/Agile-In-Action.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2023/10/31/Agile-In-Action.html</guid>
        </item>
        
        
        
        <item>
            <title>Dynamic(i/o) Why you should start your ML-Ops journey with wrapping your I/O</title>
            <description>&lt;p&gt;If you call yourself an &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt; then you ‘ve been there–you ‘ve seen this before. To productionise your &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; pipeline; well, that’s surely a challenge.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/2022-06-01-dynamicio.png&quot; alt=&quot;dynamic(i/o)&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I have worked for many years as a Data Science consultant, and I can surely confirm the statement that &lt;a href=&quot;https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/&quot;&gt;“…more that 87% of Data Science projects never make it to production”&lt;/a&gt;.
There is a reason why the first rule of doing &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;Machine Learning&lt;/span&gt; is to really be sure you need to do &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt;! Surely, many reasons play into this challenge:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;lack of the right leadership;&lt;/li&gt;
  &lt;li&gt;no or limited access to data in siloed organisations;&lt;/li&gt;
  &lt;li&gt;lack of the necessary tooling or infrastructure support, and even;&lt;/li&gt;
  &lt;li&gt;lack of a research-driven culture.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there is one more beast to be tamed out there; the gap between Data Science and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineering&lt;/span&gt;. And this is a gap you can perceive both in terms of the two practitioners in each 
field of work–data scientists and SWE–but also in terms of literally getting from a prototype to a production ready ML pipeline.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/xkcd-data-answers.png&quot; alt=&quot;xkcd - data answers&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Simply put, to put a model into production is one thing; but to maintain that model, properly monitor it to identify possible drifts and streamline the process of re-training it or updating it in a robust and 
reproducible way, supported by a clean CI/CD process, is daunting task! If anything, I ‘d dare say that  ML-Engineering, as a domain, fully encapsulates SWE in addition to many more 
challenges (highly recommend reading &lt;a href=&quot;https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf&quot;&gt;Hidden Technical Debt in Machine Learning Systems&lt;/a&gt;), for some of which we 
are still trying to standardise how we work in terms of best tooling or practices.&lt;/p&gt;

&lt;p&gt;In many cases, organisations are forced to come up with their own ways of working to accommodate the unique challenges of their custom use-cases. Then again, it all comes down to the requirements of a project. 
&lt;a href=&quot;https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6&quot;&gt;Netflix has streamlined the process of putting python notebooks into production using papermil&lt;/a&gt;. 
Others, go as far as to standardise the whole &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineering&lt;/span&gt; process using tools like &lt;span class=&quot;blog-highlight blog-highlight--graph&quot;&gt;Airflow&lt;/span&gt; or &lt;span class=&quot;blog-highlight blog-highlight--graph&quot;&gt;Kubeflow&lt;/span&gt;, relying on AI pipelines (on GCP) or &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;SageMaker&lt;/span&gt; (on AWS), etc.&lt;/p&gt;

&lt;h2 id=&quot;so-what-do-we-do&quot;&gt;So what do we do…?&lt;/h2&gt;
&lt;p&gt;At Vortexa, we are heavy users of Airflow and have recently embarked into a journey to include Kubeflow into our tech stack. 
As an ML-Engineer, my job usually concerns receiving a successful prototype of a model and implementing a complete end-to-end ML pipeline out of it; one that can be easily maintained
and reused. In many ways, this process is very common to a traditional SWE project, only more complex, since ML projects come with more requirements and a strong dependency on data.
Hence, it easily follows that everything one cares to implement for a SWE project needs to also be implemented for an ML-Engineering (MLE) project; and more.&lt;/p&gt;

&lt;p&gt;But let’s start simple…&lt;/p&gt;

&lt;h2 id=&quot;here-is-my-notebook-i-am-done-your-turn-now&quot;&gt;Here is my notebook! I am done; your turn now!&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/xkcd-data-pipelines.png&quot; alt=&quot;xkcd - data pipelines&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;So you are handed a notebook, and you inspect it; you spend time with the Data Scientist and understand all crucial aspects in the procedural logic, and you start splitting the 
process into various tasks. You, usually, end up with something like this:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2022/dynamicio-at-odsc/data-pipeline.png&quot; alt=&quot;xkcd - data pipelines&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;You think about the structure of your codebase, about how everything will be deployed, how you want to decouple orchestration from the logic of your ML-pipeline, and then you start thinking 
about domain driven development (DDD). You start thinking about abstractions and encapsulation, about testing and data validation. That’s when it hits you–testing; you can unit test 
most things and build a robust pipeline, but you also want fast feedback for when you introduce changes and improvements to your pipeline (shifting to the left)! What if you wanted to run a 
local regression test? With all data being read from external resources (databases, object storage service) you ‘ll have to mock all these calls (doable, but takes time) and replace actual 
data with sample input. And, finally, what about schema and data validations? How do you guarantee after data ingestion that all your expectations on the input are respected?&lt;/p&gt;

&lt;p&gt;You have a look at the code again. Filled with various I/O operations. Sometimes it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;csv&lt;/code&gt;, others &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parquet&lt;/code&gt;, and others it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;json&lt;/code&gt;, sometimes you read from a database and others
from an object storage service (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s3&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcp&lt;/code&gt;). Different libraries used to facilitate all these: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcsfs&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s3fs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fsspec&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;boto3&lt;/code&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sql-alchemy&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tables&lt;/code&gt;; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pandas&lt;/code&gt;, of course, sits at the core
of this process. As if that’s not enough, each file comes with a series of peculiar set of requirements supported through the use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kwargs&lt;/code&gt;; in your python code: orientation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;json&lt;/code&gt;
files, row-group-sizes for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parquet&lt;/code&gt; files, coercions on certain timestamp columns–the list keeps going… This won’t be the last time you need to do this either!&lt;/p&gt;

&lt;p&gt;It’s just too many details–way too many details–for you to worry about. A clear violation of the dependency inversion principle:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Business logic (high level code) should not be implemented in a way that “depends” on technical details (low level code, e.g., I/O in our case); instead both should be managed through abstractions!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You need abstractions to facilitate the flexibility to easily introduce changes. More often than not, business needs will require high-level modules to be modified. Low level code, on the 
other hand, is usually more cumbersome and difficult to change. The two should be independent; a database migration or a switch to an object storage service should have no impact on your
work to generate a new valuable feature for your model, and vice-versa. Abstracting both of these using distinct layers can achieve this!&lt;/p&gt;

&lt;p&gt;As David Wheeler said:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;All problems in computer science can be solved by adding a layer of indirection.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;what-is-dynamicio-then&quot;&gt;What is &lt;span class=&quot;blog-highlight blog-highlight--dynamicio&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dynamicio&lt;/code&gt;&lt;/span&gt; then?&lt;/h2&gt;
&lt;p&gt;Wouldn’t it be great if you could:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;have an abstraction that encapsulates all I/O logic;&lt;/li&gt;
  &lt;li&gt;be able to seamlessly handle reading or writing from and to different resource types or data types;&lt;/li&gt;
  &lt;li&gt;have an interface that is easy to understand and use with minimum configuration;&lt;/li&gt;
  &lt;li&gt;respect your expectations on schema types and data quality;&lt;/li&gt;
  &lt;li&gt;automatically generate metrics that would be used to leverage further insights, and more importantly;&lt;/li&gt;
  &lt;li&gt;be able to seamlessly switch between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;local&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dev&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;staging&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prod&lt;/code&gt; environments, performing dynamic I/O against different datasets and effectively supporting development, testing and qa use cases?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Well, &lt;span class=&quot;blog-highlight blog-highlight--dynamicio&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dynamic(i/o)&lt;/code&gt;&lt;/span&gt; is exactly that; a layer of indirection for pandas I/O operations.&lt;/p&gt;

&lt;p&gt;If you want to find out more about it then &lt;a href=&quot;https://odsc.com/europe/&quot;&gt;register to attend this year’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ODSC&lt;/code&gt;&lt;/a&gt; and &lt;a href=&quot;https://odsc.com/speakers/dynamicio-a-pandas-i-o-wrapper-why-you-should-start-your-ml-ops-journey-with-wrapping-your-i-o/&quot;&gt;attend the presentation by myself and my colleague Tyler Ferguson on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dynamic(i/o)&lt;/code&gt;&lt;/a&gt;. 
Come and learn about how its implementation and adoption has helped us go beyond just achieving consistency across our ML repos, effectively dealing with glue code and keeping our code-bases DRY, but also acting as an interface between different teams.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;
</description>
            <pubDate>2022-05-31T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2022/05/31/dynamicio-at-ODSC.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2022/05/31/dynamicio-at-ODSC.html</guid>
        </item>
        
        
        
        <item>
            <title>Complete Guide to Python Envs (MacOS)</title>
            <description>&lt;p&gt;Configuring &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; on your machine for the first time is a definite headache for any software 
engineer that decides to delve into the world of &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;. Doing it properly confuses a lot of 
people and can prove to be very challenging.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2021/python-envs/python_environment_2x.png&quot; alt=&quot;Python Envs&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It is often the case that many developers have numerous interpreters configured on their machines,
without knowing where they live.&lt;/p&gt;

&lt;h2 id=&quot;most-common-ways-of-setting-up-python&quot;&gt;Most common ways of setting up &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;Firstly, there is a &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; version that ships with macOS, but it is usually v2.7, which is not
just out of date but also deprecated.&lt;/p&gt;

&lt;p&gt;So, commonly, most users will download the latest Python release and move it to their &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PATH&lt;/code&gt;
or use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;brew install python3&lt;/code&gt; (which does this for them).&lt;/p&gt;

&lt;p&gt;Both of these solutions can cause many problems that will not be evident straight away. The main 
challenge, is usually not knowing, at any given time, what is the “default Python” that your system 
is using. Ideally, this is something you shouldn’t care about, but if you don’t set up things 
properly, you end up installing packages for the wrong environment or the wrong active &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; interpreter,
un-intentionally created from the wrong &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; distribution and… well, you get the point 
(…this is pretty much summed up in the &lt;a href=&quot;https://xkcd.com/1987/&quot;&gt;xkcd image&lt;/a&gt; above).&lt;/p&gt;

&lt;p&gt;To find out more details, read this excellent &lt;a href=&quot;https://opensource.com/article/19/5/python-3-default-mac&quot;&gt;December 2020, post&lt;/a&gt;,
by Matthew Broberg.&lt;/p&gt;

&lt;h2 id=&quot;how-to-avoid-all-these&quot;&gt;How to avoid all these?&lt;/h2&gt;
&lt;p&gt;The short answer is “use &amp;lt;span class=&quot;blog-highlight blog-highlight–python&quot;&amp;gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt;&amp;lt;/span&amp;gt;”. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt; will enable you to not only setup python properly on your machine, but
to also manage different versions and python environments in a simple and straightforward way. As
explained on the &lt;a href=&quot;https://github.com/pyenv/pyenv&quot;&gt;package’s github page&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“It’s simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For me, its main benefits are:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;It depends on Python itself, i.e. since it was made from pure shell scripts, there is no bootstrap problem of Python.&lt;/li&gt;
  &lt;li&gt;It manages the need to be loaded into your shell though &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt;’s shim approach, which adds a directory to your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$PATH&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;It manages virtual environments, though I recommend using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv-virtualenv&lt;/code&gt; to automate the process.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;lets-get-to-it&quot;&gt;Let’s get to it&lt;/h2&gt;
&lt;p&gt;Before you do anything make sure you start with a clean sheet. To do so, uninstall or remove any python distributions
you already have. I strongly advise you to follow this &lt;a href=&quot;https://www.macupdate.com/app/mac/5880/python/uninstall&quot;&gt;link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, assuming you have &lt;a href=&quot;unintently&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;brew&lt;/code&gt;&lt;/a&gt; installed on your machine, do:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;brew update
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;pyenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We will now need &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv-virtualenv&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv-virtualenv&lt;/code&gt; is a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt; plugin that provides features
to manage &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;virtualenvs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conda&lt;/code&gt; environments for &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UNIX-like&lt;/code&gt; systems.&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;pyenv-virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;setting-up-your-global-interpreter&quot;&gt;Setting up your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;global&lt;/code&gt; interpreter&lt;/h2&gt;
&lt;p&gt;So, the first thing you want to do is set up your global interpreter. This is the python environment
that will be used by default by your system, unless you dictate otherwise.&lt;/p&gt;

&lt;p&gt;If you run:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--list&lt;/span&gt;
Available versions:
  2.1.3
...
  3.10-dev
  activepython-2.7.14
...
  activepython-3.6.0
  anaconda-1.4.0
...
  anaconda3-2020.07
  graalpython-20.1.0
  graalpython-20.2.0
  ironpython-dev
...
  ironpython-2.7.7
  jython-dev
...
  jython-2.7.2
  micropython-dev
...
  miniconda-latest
...
  miniconda3-4.7.12
  pypy-c-jit-latest
...
  pypy3.6-7.3.1
  pyston-0.5.1
...
  pyston-0.6.1
  stackless-dev
...
  stackless-3.7.5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You will see the full list of &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; distributions available for installation.&lt;/p&gt;

&lt;p&gt;Choose the one you want and do, e.g. 3.9.0:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;3.9.0
python-build: use openssl@1.1 from homebrew
python-build: use readline from homebrew
Downloading Python-3.9.0.tar.xz...
-&amp;gt; https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tar.xz
Installing Python-3.9.0...
python-build: use readline from homebrew
python-build: use zlib from xcode sdk
Installed Python-3.9.0 to /Users/&amp;lt;username&amp;gt;/.pyenv/versions/3.9.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Once installation is complete, you can set this version as your global:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv global 3.9.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;At this point, one should be able to find the full executable path to each of these using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv version&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv version
3.9.0 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;by /Users/&amp;lt;username&amp;gt;/.pyenv/version&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;creating-and-managing-virtual-environments-automatically&quot;&gt;Creating and managing virtual environments automatically&lt;/h2&gt;
&lt;p&gt;This is a standard practice when working with &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt;. The idea is to keep different environments isolated.
Each &lt;span class=&quot;blog-highlight blog-highlight--python&quot;&gt;Python&lt;/span&gt; environment can be associated to multiple projects, but it is generally better to just go for a 
one to one mapping.&lt;/p&gt;

&lt;p&gt;Why you say? Well, for starters, this helps you maintain your system clean by not installing system-wide libraries
that you are only going to need in a small project. It also allows you to use a certain version of
a library for one project and a different version for another. Finally, it helps make your project 
reproducible and ensures it is configured in an identical manner across local environments amongst
collaborating developers.&lt;/p&gt;

&lt;p&gt;Let’s go through an example.&lt;/p&gt;

&lt;p&gt;Suppose you have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;github&lt;/code&gt; root directory where you clone and maintain all your projects and it looks like this:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
└── project_b

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What you want to do is setup a different python virtual environment per project. What’s more is that
you would like to automatically activate that virtual environment by means of simply accessing (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd&lt;/code&gt;-ing)
into that project. Let’s see how we can do that.&lt;/p&gt;

&lt;p&gt;First, I ‘ll assume you are using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.zshrc&lt;/code&gt; as your default shell and have configured &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oh-my-zsh&lt;/code&gt;. 
If not, then &lt;a href=&quot;https://ohmyz.sh&quot;&gt;just set it up&lt;/a&gt;. Note that this is not a pre-requisite; it’s more of a
personal preference, but using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;oh-my-zsh&lt;/code&gt; does come with many benefits, like showing the current active 
python environment on your console, which is why I am recommending it.&lt;/p&gt;

&lt;p&gt;In order to enable the above automations, we will need two pre-requisites. The first, is to include 
2 files in each project (you can version control these files). The first is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.python-version&lt;/code&gt; and the
second is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.python-virtualenv&lt;/code&gt;, as per the below tree:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
│   ├── .python-version
│   └── .python-virtualenv
└── project_b
    ├── .python-version
    └── .python-virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In each of these files you just add a line at the very top of the file with:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;the python version you want to use&lt;/li&gt;
  &lt;li&gt;the name of the virtual environment you want to create.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, the contents of&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
│   ├── .python-version 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;can be:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;3.9.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
│   └── .python-virtualenv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;can be:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;project-a-venv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;similarly, for project b you can have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3.8.2&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;project-b-venv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now, on to your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.zshrc&lt;/code&gt;. Do:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;vi ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and add the following script:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Define your $PATH&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PYENV_ROOT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/.pyenv&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PYENV_ROOT&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/bin:&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PATH&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Automatic venv activation&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;eval&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;pyenv init -&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;eval&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;pyenv virtualenv-init -&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PYENV_VIRTUALENV_DISABLE_PROMPT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1

&lt;span class=&quot;c&quot;&gt;# Undo any existing alias for `cd`&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;unalias cd &lt;/span&gt;2&amp;gt;/dev/null

&lt;span class=&quot;c&quot;&gt;# Method that verifies all requirements and activates the virtualenv&lt;/span&gt;
hasAndSetVirtualenv&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# .python-version is mandatory for .python-virtualenv but not vice versa&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-virtualenv &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-version &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
      &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;To use .python-virtualenv you need a .python-version&quot;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return &lt;/span&gt;1
    &lt;span class=&quot;k&quot;&gt;fi
  fi&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# Check if pyenv has the Python version needed.&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# If not (or pyenv not available) exit with code 1 and the respective instructions.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-version &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-z&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;which pyenv&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
      &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Install pyenv see https://github.com/yyuu/pyenv&quot;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return &lt;/span&gt;1
    &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;pyenv versions 2&amp;gt;&amp;amp;1 | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;not installed&apos;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt;
      &lt;span class=&quot;c&quot;&gt;# Message &quot;not installed&quot; is automatically generated by `pyenv versions`&lt;/span&gt;
      &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;run &quot;pyenv install&quot;&apos;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;return &lt;/span&gt;1
    &lt;span class=&quot;k&quot;&gt;fi
  fi&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# Create and activate the virtualenv if all conditions above are successful&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# Also, if virtualenv is already created, then just activate it.&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; .python-virtualenv &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then
    &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;VIRTUALENV_NAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; .python-virtualenv&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;PYTHON_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; .python-version&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;MY_ENV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PYENV_ROOT&lt;/span&gt;/versions/&lt;span class=&quot;nv&quot;&gt;$PYTHON_VERSION&lt;/span&gt;/envs/&lt;span class=&quot;nv&quot;&gt;$VIRTUALENV_NAME&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;([&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MY_ENV&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; virtualenv &lt;span class=&quot;nv&quot;&gt;$MY_ENV&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;which python&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$MY_ENV&lt;/span&gt;/bin/activate
  &lt;span class=&quot;k&quot;&gt;fi&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

pythonVirtualenvCd &lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;# move to a folder + run the pyenv + virtualenv script&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$@&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; hasAndSetVirtualenv
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Every time you move to a folder, run the pyenv + virtualenv script&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;alias cd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pythonVirtualenvCd&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Save your changes, return to your terminal and either restart your terminal or do:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s assume that you are in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GitHub&lt;/code&gt; directory:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;pwd&lt;/span&gt;
/Users/&amp;lt;username&amp;gt;/Github
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, if you do:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;~/GitHub &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;project_a
created virtual environment CPython3.9.0.final.0-64 &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;448ms
  creator CPython3Posix&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;dest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/.pyenv/versions/3.9.0/envs/project-a-venvo, &lt;span class=&quot;nv&quot;&gt;clear&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;no_vcs_ignore&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;global&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  seeder FromAppData&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;via&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;copy, &lt;span class=&quot;nv&quot;&gt;app_data_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/Library/Application Support/virtualenv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    added seed packages: &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;20.3.1, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;51.3.3, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;project-a-venv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--------------------------------------------------------------------------------&lt;/span&gt;
~/GitHub/project_a &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and, if you come out of it and change to project b:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; ../project_b
created virtual environment CPython3.8.2.final.0-64 &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;932ms
  creator CPython3Posix&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;dest&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/.pyenv/versions/3.8.2/envs/project-b-venv, &lt;span class=&quot;nv&quot;&gt;clear&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;no_vcs_ignore&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;global&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  seeder FromAppData&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;False, &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;bundle, &lt;span class=&quot;nv&quot;&gt;via&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;copy, &lt;span class=&quot;nv&quot;&gt;app_data_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/Users/&amp;lt;username&amp;gt;/Library/Application Support/virtualenv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    added seed packages: &lt;span class=&quot;nv&quot;&gt;pip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;20.3.1, &lt;span class=&quot;nv&quot;&gt;setuptools&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;51.3.3, &lt;span class=&quot;nv&quot;&gt;wheel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;project-b-venv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--------------------------------------------------------------------------------&lt;/span&gt;
~/GitHub/project_b &lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, two new virtual environments have been created:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pyenv versions
system
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; 3.8.2 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;by /Users/&amp;lt;username&amp;gt;/GitHub/project_b/.python-version&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  3.8.2/envs/project-b-venv
  3.9.0
  3.9.0/envs/project-a-venv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and every time you cd into these directories, your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv&lt;/code&gt; will switch automatically.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Note 1:&lt;/code&gt; You may face some issues with python 3.8.7.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Note 2:&lt;/code&gt; To uninstall a python env, do: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyenv uninstall 3.8.2/envs/project-b-venv&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;using-jupyter-notebook-or-jupyter-lab-with-a-virtual-environment-of-your-choice&quot;&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter notebook&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter lab&lt;/code&gt; with a virtual environment of your choice&lt;/h2&gt;
&lt;p&gt;Finally, suppose you want to use a python environment with a jupyter notebook. This is not as 
straightforward as one would think. Here is how you would do it.&lt;/p&gt;

&lt;p&gt;Let’s continue from where we left things in the previous section. You are in:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;GitHub
├── project_a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and you have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;project-a-venv&lt;/code&gt; activated:&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;project-a-venv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--------------------------------------------------------------------------------&lt;/span&gt;
~/Github/project_a &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;First thing you need to do is install &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ipykernel&lt;/code&gt; using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pip&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ pip install ipykernel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, you need to install a new kernel:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ipython kernel install --user --name=project-a-venv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Finally, assuming you have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyterlab&lt;/code&gt; installed, you can start &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jupyter&lt;/code&gt;, create a new notebook and select the kernel that lives inside 
your environment.&lt;/p&gt;
&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;jupyter notebook
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;final-notes&quot;&gt;Final notes&lt;/h2&gt;
&lt;p&gt;I really hope that this was a helpful post and if you are new to python, I hope that I have helped you
disambiguate some confusing aspects of configuring python at the start of your journey!&lt;/p&gt;

&lt;p&gt;The bellow references were very helpful for putting together this post:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://opensource.com/article/19/5/python-3-default-mac&quot;&gt;The right and wrong way to set Python 3 as default on a Mac&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://glhuilli.github.io/virtual-environments.html&quot;&gt;Automatic activation of virtualenv (+ pyenv)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2021-02-14T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2021/02/14/python-envs.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2021/02/14/python-envs.html</guid>
        </item>
        
        
        
        <item>
            <title>A BREXIT NLP Dataset!</title>
            <description>&lt;p&gt;So here is the thing… I love discussing politics; I think that everyone should, at least occasionally, bother 
themselves with what is happening in their country’s political scenery.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/brexit-nlp-dataset/eu-brexit-classifier.png&quot; alt=&quot;BREXIT 2016&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Regardless of whether you are into politics or not, it would be practically impossible to escape debating &lt;span class=&quot;blog-highlight blog-highlight--eu&quot;&gt;BREXIT&lt;/span&gt; back 
in the summer of 2016. At the time, I had just been hired by Data Reply UK and the company’s annual XChange conference was
around the corner.&lt;/p&gt;

&lt;p&gt;My boss at the time, wanted to us to come up with something interesting and eye catching for our demo pod at the conference. 
So, being that BREXIT was a trending and highly debated topic, I thought that maybe I can come up with a way to predict 
peoples’ political stance by means of their social activity.&lt;/p&gt;

&lt;h2 id=&quot;the-idea&quot;&gt;The idea&lt;/h2&gt;
&lt;p&gt;The idea was simple:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Provided one’s twitter @handle, try to infer their political views on &lt;span class=&quot;blog-highlight blog-highlight--eu&quot;&gt;BREXIT&lt;/span&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The original approach was to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Collect people’s tweets through the twitter API;&lt;/li&gt;
  &lt;li&gt;Label tweets related to &lt;span class=&quot;blog-highlight blog-highlight--eu&quot;&gt;BREXIT&lt;/span&gt; as either PRO or CON;&lt;/li&gt;
  &lt;li&gt;Calculate a ratio between the 2 and produce a number that would represent their political stance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After experimenting a bit, I figured out that using one’s own tweets would not be enough. Many twitter users don’t 
tweet that often and when they do, they are not really concerned with the EU or BREXIT. So I thought that maybe we can
use the tweets of the people that one follows. This draws from social science and ideas behind tribalism:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;“…you are likely to be ideologically aligned with the positions of your peers [or of those you follow on twitter ;)]!”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-dataset&quot;&gt;The dataset&lt;/h2&gt;
&lt;p&gt;In order to be able to label tweets, I had to develop an &lt;span class=&quot;blog-highlight blog-highlight--nlp&quot;&gt;NLP&lt;/span&gt; &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; model. To do so, I needed a relatively 
big corpus of labelled tweets.&lt;/p&gt;

&lt;p&gt;I turned to an &lt;a href=&quot;https://www.bbc.com/news/uk-politics-eu-referendum-35616946&quot;&gt;article by BBC&lt;/a&gt; 
at the time, which categorised MPs according to the public stance on BREXIT. Using a twitter list that had the twitter 
handles of 449 MPs at the time and using the twitter API, I accumulated a corpus of 60,941 tweets from 449 UK 
MPs (at the time). Tweets had one or more of the following keywords:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;key_words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;European union&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;European Union&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;european union&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;EUROPEAN UNION&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;Brexit&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;brexit&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;BREXIT&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;euref&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;EUREF&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;euRef&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;eu_ref&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;EUref&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;leaveeu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;leave_eu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;leaveEU&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;leaveEu&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;borisvsdave&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;BorisVsDave&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;StrongerI&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;strongerI&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;strongeri&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;strongerI&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;votestay&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;vote_stay&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteStay&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;votein&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteout&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteIn&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;voteOut&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;vote_In&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;vote_Out&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&apos;referendum&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;Referendum&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;REFERENDUM&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and were automatically labelled based on the views of the MP who tweeted them.&lt;/p&gt;

&lt;p&gt;You can find more details on how I worked to generate the ML model and how the demo solution worked if you follow this
 &lt;a href=&quot;https://github.com/Christos-Hadjinikolis/eu_tweet_classifier&quot;&gt;github repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;dataset-now-available-on-kaggle&quot;&gt;Dataset now available on Kaggle&lt;/h2&gt;
&lt;p&gt;It took me some time to publish it, but the dataset is now available to everyone to use on Kaggle. You can find it 
if you follow this &lt;a href=&quot;https://www.kaggle.com/chadjinik/labelledbrexittweets&quot;&gt;link&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I hope that the ML community will make good use of it. It’s 4 years after the referendum but BREXIT is yet to really 
happen and unfortunately it remains a concerning issue. So, who knows, maybe someone wants to use this dataset in some 
other equally interesting way.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-09-02T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/09/02/BREXIT-NLP-dataset.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/09/02/BREXIT-NLP-dataset.html</guid>
        </item>
        
        
        
        <item>
            <title>Style Transfer in Heraklion</title>
            <description>&lt;p&gt;I am currently in Crete for my annual get away. Crete is an amazing island with many beautiful places to visit and a vast 
history that goes all the way back to the Minoans in 3500 BC.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-koules.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
One of the things I love doing whenever I am here is strolling around the city of Heraklion and taking pictures of the many hidden alleys, 
which reveal an amazing graffiti culture! I really wanted to write about it in my blog and I thought that maybe I can do so 
by using some amazing images I gathered just last week in a &lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&gt;style-transfer&lt;/span&gt; post. So this is it: &lt;strong&gt;&lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&gt;Style Transfer&lt;/span&gt; in Heraklion&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;a-bit-of-history-a-neural-algorithm-of-artistic-style&quot;&gt;A bit of history: A Neural Algorithm of Artistic Style&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&gt;Neural Style Transfer (NST)&lt;/span&gt; is a class of algorithms that process images to adopt the visual style of another image. A seminal paper 
that introduced this concept was &lt;a href=&quot;https://arxiv.org/abs/1508.06576&quot;&gt;“A Neural Algorithm of Artistic Style”&lt;/a&gt; by Leon A. Gatys, Alexander 
S. Ecker and Matthias Bethge. In their work, the authors emphasize that:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;“…representations of content and style in Neural Networks are seperable”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the foundation of this work, since if these two notions are indeed separable then provided two images you can get the style 
of the first, the content of the second and merge them together. So, how is this done exactly?&lt;/p&gt;

&lt;h2 id=&quot;delving-into-the-details&quot;&gt;Delving into the details&lt;/h2&gt;
&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-01.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The first figure in the paper shows the original setup and how a pre-trained NN, referred to as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&amp;gt;VGG19&amp;lt;/span&amp;gt;&lt;/code&gt;, was modified to do NST. What is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;span class=&quot;blog-highlight blog-highlight--vision&quot;&amp;gt;VGG19&amp;lt;/span&amp;gt;&lt;/code&gt;? 
Well, the basic building blocks of traditional convolutional networks are the following layers:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;a &lt;a href=&quot;https://www.youtube.com/watch?v=YRhxdVk_sIs&amp;amp;list=RDCMUC4UJ26WkceqONNF5S26OiVw&amp;amp;index=2&quot;&gt;convolutional layer&lt;/a&gt; (with padding to maintain the resolution);&lt;/li&gt;
  &lt;li&gt;a non-linear activation layer such as a &lt;a href=&quot;https://www.youtube.com/watch?v=m0pIlLfpXWE&amp;amp;list=RDCMUC4UJ26WkceqONNF5S26OiVw&amp;amp;index=3&quot;&gt;ReLU&lt;/a&gt;, and;&lt;/li&gt;
  &lt;li&gt;a pooling layer such as a &lt;a href=&quot;https://www.youtube.com/watch?v=ZjM_XQa5s6s&quot;&gt;max pooling layer&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VGG&lt;/code&gt; block consists of a sequence of convolutional layers, followed by a max pooling layer for spatial down-sampling.
What we are interested in is how this network will respond to the inputs.&lt;/p&gt;

&lt;h3 id=&quot;retrieving-the-content&quot;&gt;Retrieving the content&lt;/h3&gt;
&lt;p&gt;Notice that the authors prefer to use paintings for the style and a random image; it seems like these combinations work best. 
The main idea is abstracting the content and putting more emphasis on the style!&lt;/p&gt;

&lt;p&gt;At the top left you see &lt;a href=&quot;https://artsandculture.google.com/asset/the-starry-night/bgEuwDxel93-Pg?hl=en-GB&amp;amp;avm=2&quot;&gt;“The Starry Night”&lt;/a&gt; 
by Vincent van Gogh and below it is just a random content image; let’s start with the latter.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-02.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Provided both an input (style) image and a content image, each neuron and respectively each layer in the NN will either activate or it won’t.
Each image is processed, or better yet filtered, in a different way (by nature of the activation or not of different neurons). Looking at 
how the content image is gradually filtered in the above image you will notice that the first layer leaves the image seemingly intact. 
But looking all the way to the last filtered output you see that this is not the case at all; shapes are there but the inside is 
not so much the same. This is because of how the resulting high-level features are generated on earlier abstractions of the same image 
produced by previous layers. This is the intended behaviour to retrieve the content.&lt;/p&gt;

&lt;h3 id=&quot;retrieving-the-style&quot;&gt;Retrieving the style&lt;/h3&gt;
&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-03.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;So, for the style, the authors explain that they have built a new feature-space, which focuses on the style of an input image on top 
of the original CNN representations. The style representation computes correlations between the different features in different
layers of the CNN. They reconstruct the style of the input image from style representations built on different subsets of CNN
layers and this results in images that match the style of the input on an increasing scale while discarding information of the 
arrangement of the scene.&lt;/p&gt;

&lt;h2 id=&quot;its-all-in-the-formulas-or-formulae&quot;&gt;It’s all in the formulas (or formul$ae$)&lt;/h2&gt;
&lt;p&gt;The authors also discuss the impact of the number of layers used to infer the style or the content of images before they are merged 
(visually depicted in Figure 3 of the paper). &lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-04.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
In the first row (A) only one layer is used in contrast to 5 layers used at the bottom row where the result is much better.&lt;/p&gt;

&lt;p&gt;To generate the images which are a mixture of the content of an image-A with the style of another (image-B) the authors explain that 
they jointly minimise the distance of a “white noise” image from the content representation of image-A in one layer of the network 
and the style representation of image-B in a number of layers of the CNN. This is gracefully captured by the below loss function:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-05.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;where $\overrightarrow{p}$ is image-A (usually a photograph where we care about the content) and $\overrightarrow{a}$ is image-B 
(usually a painting where we care to retrieve the style). Then $\alpha$ and $\beta$ respectively concern weighting factors for content 
and style reconstruction.&lt;/p&gt;

&lt;p&gt;Going back to Figure 3 of the paper, looking at it from left to right we see what happens when we tweak these weighting factors ($\alpha$ and $\beta$). 
The left-most column concerns cases where $\alpha$ is low compared to $\beta$ and the right-most layer is the other way around. These two
factors are practically the optimisers of the content and style errors respectively. If $\alpha$ is high, it means that content error is more 
important and vice-versa for increasing $\beta$.&lt;/p&gt;

&lt;p&gt;The objective of the formula is to minimize $\mathcal{L}_{total}$. $\overrightarrow{x}$ is the image that we are gradually building through multiple iterations and it 
initially comes either from the photograph ($\overrightarrow{p}$) or it is initialised as white noise. $\alpha$ and $\beta$ are the weights that we 
need to set, and they are basically our hyper-parameters in this problem.&lt;/p&gt;

&lt;p&gt;What is now left is understanding \(\mathcal{L}_{content}\) and \(\mathcal{L}_{style}\).&lt;/p&gt;

&lt;h3 id=&quot;mathcall_content&quot;&gt;$\mathcal{L}_{content}$&lt;/h3&gt;
&lt;p&gt;Here is where everything gets a bit complicated but at the same time, you get to piece everything together nicely.&lt;/p&gt;

&lt;p&gt;$\mathcal{L}_{content}$ is described as the squared-error loss between two feature representations: one concerned with the random photograph 
$\overrightarrow{p}$ and the generated image $\overrightarrow{x}$ which is originally a white noise image. 
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-06.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
$P^l$ and $F^l$ are the respective feature representations for the two images in layer $l$. The authors used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VGG&lt;/code&gt; Network. 
Here, $F^l$ represents an activation function ($F$) at a given layer $l$ or, plainly, a bank of non-linear filters for that layer. The complexity of these filters increases 
with the position of the layer in the network. $F$ is practically a matrix of size $N\times M$ where $N$ is the number of filters within 
a given layer with $N_l$ feature maps of size $M_l$; the latter is the height $\times$ width if the feature map.&lt;/p&gt;

&lt;p&gt;So, a given input image $\overrightarrow{x}$ is encoded in each layer of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CNN&lt;/code&gt; by the filter responses to that image. 
To visualise the image information that is encoded at different layers of the hierarchy the authors perform gradient descent
on the white noise image to find another image that matches the feature responses of the original image. 
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-07.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;
So, the approach is to gradually changes the initially random image $\overrightarrow{x}$ until it generates the same response in a certain layer of the CNN as the original image.&lt;/p&gt;

&lt;h3 id=&quot;mathcall_style&quot;&gt;$\mathcal{L}_{style}$&lt;/h3&gt;
&lt;p&gt;The style loss function is described by the following equation:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-08.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;  &lt;br /&gt;
which is basically a sum of the weighted distances between feature correlations across the different filter (layer) responses for two images:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the original image $\overrightarrow{a}$, and;&lt;/li&gt;
  &lt;li&gt;a white noise image $\overrightarrow{x}$, used to generate a style representation of the original image.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break this down a bit more; what are these feature correlations? Practically they are a way to express a relationship between a feature map $F$ and 
the filters ($i$ and $j$) of the different layers ($l$) applied on it. This is beautifully expressed as a matrix of all possible inner 
products between the generated set of feature vectors, called a &lt;a href=&quot;https://www.youtube.com/watch?v=DEK-W5cxG-g&quot;&gt;“Gram matrix $G$”&lt;/a&gt;, as per the below equation:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-09.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;One such matrix is generated for each of the two images (the original $\overrightarrow{a}$ and $\overrightarrow{x}$), namely $A_{ij}^l$ and $G_{ij}^l$, and a squared 
distance is calculated between these two. The objective is to minimise the distance. So, practically, as with every ML problem, what we have is an optimisation problem and 
a cost function! Minimising this distance can be achieved through the application of gradient descent using standard error back-propagation 
to adjust the weight values of equation $5$.&lt;/p&gt;

&lt;h3 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h3&gt;
&lt;p&gt;Finally, in order to generate the final image with the style transfer, we return to equation 7, which practically jointly 
minimises the distance of a white noise image from the content representation of the photograph in one layer of the network 
and the style representation of the painting in a number of layers of the CNN. The authors also note that:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;For image synthesis they found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly
more appealing results.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s it! So, what’s left now is getting our hands dirty!&lt;/p&gt;

&lt;h2 id=&quot;using-pytorch-for-style-transfer&quot;&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyTorch&lt;/code&gt; for Style transfer&lt;/h2&gt;
&lt;p&gt;If you following this &lt;a href=&quot;https://pytorch.org/tutorials/advanced/neural_style_tutorial.html?highlight=style%20transfer&quot;&gt;link&lt;/a&gt; to the official 
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyTorch&lt;/code&gt; website you will find a very well written tutorial on how to apply style transfer with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PyTorch&lt;/code&gt;. I provide
my own take of it &lt;a href=&quot;https://github.com/Christos-Hadjinikolis/style-transfer/blob/master/tests/experiments/Style_Transfer_Tutorial.ipynb&quot;&gt;&lt;strong&gt;$\rightarrow$here$\leftarrow$&lt;/strong&gt;&lt;/a&gt;. You 
can follow the link to the python notebook and copy-paste the code to give it a try.
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-paper-14.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I intend to work on creating a package for it and will provide an updated post on it once I do (it will be developed in the same repo as the link). The intention is
to be able to style images through the packages through an intuitive api that would take the image to be styled as the 
input and a choice between famous images that will be available through the package (provided as a text parameter) to produce the desired output (along with some other flags and side parameters). 
Something like:&lt;/p&gt;

&lt;pre&gt;
import pytorch_style_transfer as pst

pst.generate(
    input_image_path=&quot;path_to_input_image&quot;, 
    style=&quot;starry_night&quot;, 
    resolution=128, 
    output_dir=&quot;path/to/output&quot;)
&lt;/pre&gt;

&lt;h2 id=&quot;enjoy-some-of-the-outputs&quot;&gt;Enjoy some of the outputs:&lt;/h2&gt;
&lt;p&gt;Here are some of the results of this work. I tried blending the fortress of Koules with 4 different grafittis I was able to photograph.&lt;/p&gt;

&lt;p&gt;The original picture:
&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-koules-fortress.jpg&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The result is not always great, but it was still very interesting to try:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-10.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-11.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-12.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/on-style-transfer/2020-08-15-style-transfer-output-13.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;That’s it!&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-08-15T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/08/15/on-style-transfer.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/08/15/on-style-transfer.html</guid>
        </item>
        
        
        
        <item>
            <title>Agile Data Science</title>
            <description>&lt;p&gt;Re-posting from &lt;a href=&quot;https://www.iunera.com/kraken/big-data-science-strategy/the-agile-approach-in-data-science-explained-by-an-ml-expert/&quot;&gt;https://www.iunera.com/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Around two weeks ago I was approached by &lt;a href=&quot;https://www.linkedin.com/in/dr-tim-frey-7b28171/&quot;&gt;Dr. Tim Frey&lt;/a&gt;, General Manager at Iunera GmbH &amp;amp; Co. KG. I was quite surprised to read his message:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Hi Christos, 
We met at the mind mastering machines conference in London.
We operate a company blog (https://iunera.com/kraken ) and one of our writers wrote about &lt;em&gt;agile&lt;/em&gt; in Data Science. 
I liked your talk two years ago and I thought she can approach you to ask a few questions like kind of an in-article 
interview with an expert. 
Hope that is fine with you. Would be super glad to get your insights.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I must admit this was a first for me! Then again, that talk in 2018 was quite an interesting one for me too.&lt;/p&gt;

&lt;h2 id=&quot;how-it-all-happened&quot;&gt;How it all happened…&lt;/h2&gt;
&lt;p&gt;You see, 3 years ago I was asked to join an exceptional team over at UBS to help with a graph analytics project. If you asked me then I would 
proudly tell you that “…I am a Data Scientist”; that is how I saw myself. However, that was bound to change forever.&lt;/p&gt;

&lt;p&gt;The first three months were amazing. I worked with a vast amount of data and revealed some very interesting insights. 
So, inevitably, my project manager approached me and asked “…how about we take this work of yours into production”!&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/agile-data-science/2020-08-11-agile-ds-01.png&quot; alt=&quot;Agile Data Science&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I didn’t have a clue about what that meant in reality, but I was about to find out. He said: “Well, don’t worry, we will pair 
you with an engineer and you both can get started on it”. So we did!&lt;/p&gt;

&lt;p&gt;This is basically the story about how I was exposed to software engineering and the &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; way of working–about how I was converted into 
an &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML Engineer&lt;/span&gt;. Two years later I decided to take my learnings from this experience and share it with my community, and so I did at &lt;strong&gt;mCubed&lt;/strong&gt; London in 2018:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/nRsqFrutfSg&quot; title=&quot;Agile Data Science talk&quot; frameborder=&quot;0&quot; loading=&quot;lazy&quot; referrerpolicy=&quot;strict-origin-when-cross-origin&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;That’s where I also met Tim. Turns out that a year and half later a colleague of Tim’s (&lt;a href=&quot;https://www.linkedin.com/in/dhanhyaashri-mahendran/&quot;&gt;Dhanhyaashri Mahendran&lt;/a&gt;) was doing a bit iof research on 
&lt;em&gt;“Doing Data Science the &lt;span class=&quot;blog-highlight blog-highlight--agile&quot;&gt;Agile&lt;/span&gt; way”&lt;/em&gt; and Tim suggested that she gets in touch with me to ask me some questions, which I welcomed.&lt;/p&gt;

&lt;h2 id=&quot;some-very-interesting-questions-were-thrown-my-way&quot;&gt;Some very interesting questions were thrown my way…&lt;/h2&gt;
&lt;p&gt;I really liked the questions that Dhanhyaashri had prepared. She had obviously done her research. I did my best to respond and two weeks later 
the interview was published on the Iunera blog. You can read it &lt;a href=&quot;https://www.iunera.com/kraken/big-data-science-strategy/the-agile-approach-in-data-science-explained-by-an-ml-expert/&quot;&gt;here&lt;/a&gt; 
but I also felt like re-posting the interview on my personal blog too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Besides the cutting of time-consuming planning and quicker turnaround of projects, what other benefits are there in applying the Agile approach in data science?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;For the community, I would say that that would be the emergence of new Data-Science-oriented practices that will drive the application of Agile in the research domain.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;The problem with applying Agile in Data Science is that, traditionally, Agile is practiced in software development projects where experimentation, testing and tuning are minimum (usually dealt as spikes). The focus there is about delivering business requirements, in the form of features and products, fast in a volatile, constantly evolving environment. To support this, a number of underpinning practices have been developed, covering areas like modelling and design, coding and testing, risk handling and quality assurance. But all these, focus primarily on feature delivery (backlogs, user-stories, CI/CD, TDD or BDD to name a few). Some of these underpinning practices can directly be transferred in the Data Science world (e.g. user stories and backlogs, timeboxing and retrospectives) but others, not so much; for instance, how can TDD be useful when experimenting with what is the optimal k with which to cluster customer datasets? So, a clear benefit of trying to apply Agile in Data Science is that gradually, similar Data Science-specific underpinning practices will eventually be developed and these will, of course, be based on the same Agile drives: adaptive planning, evolutionary development, early delivery and continual improvement, and more generally, flexible responses to change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;For the Data Scientists I would say it is mostly about adjusting to the requirement of working in a way to deliver business value from their experimentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;The feature-oriented focus that Agile is characterised by in the software development world is not so familiar to data scientists and researchers. What’s more is that “value”—business value—is perceived in very different ways across these two worlds as well. Have you ever discussed the “value” of an experiment with a project manager? Not an easy task I assure you! My experience tells me that most of the time this comes down to project managers fearing that no tangible outputs will be produced through experimentation. This is completely wrong, but only as long as experiments are well-structured and well-thought. To me, Agile Data science is all about iterative hypothesis testing. Proving or disproving a hypothesis is always useful; it minimises the risk of failure and increases decision awareness when choosing what needs to be prioritised! But these outcomes can only be achieved when Data Scientists know exactly what they are trying to prove, discover or disprove and how that would be valuable to their team’s objective. Gradually, Data Scientists become better at it and this benefits both themselves as well as their teams.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What are the downsides of Agile in data science? What can we do about these downsides?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Agile is a set of values and principles; as such, I can’t really say that there is something wrong with it. What is surely wrong is to assume that Agile is the only way that a team can work and be productive—it’s not. Ever since Agile emerged—in the concrete form that we know it today through the Agile manifesto—many hurried to undermine the effectiveness of other development models, e.g. Waterfall.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;There is nothing wrong with the Waterfall model either; the real question is whether these practices or models are fit for purpose! There are surely research projects as well as business requirements around the delivery of software that could potentially be delivered through the Waterfall approach or maybe through a combination of the two. What project managers and teams should strive for is increasing their effectiveness and efficiency. If that can be done by building on top of the Agile values then great; if not, then maybe they will need to try and come up with a different formula.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Project managers focusing too much on what Agile is and what is not—if it needs to be Scrum or Kanban or if too much documentation or too much time spent in design is not Agile—are bound to make mistakes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Do you think that the imposition of Agile on teams (the Agile Industrial Complex) is defeating the purpose of Agile in finding what works best for teams in working adaptively?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;In similar spirit to my previous response, I do! Once more, I can’t stress enough how there is no single perfect development model. Project managers need to always assess what is fit for purpose. Primarily though, they should focus on the underpinning values and principles that Agile or other development models are characterised by. When they do, a recurrent mistake that I have experienced through my consulting career is the oversimplification of Agile as an anti-methodology, anti-documentation and anti-planning development model. I appreciate that this makes understanding Agile much easier, but at the same time it is a very unfair representation of what Agile is! Imposing it on this basis is surely wrong. Equally, practicing Agile is definitely not something that comes through imposition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;I was exposed to the Agile methodology through a passionate software engineer who was an evangelist of Xtreme Programming. To him, the way he worked was a way of seeing the software engineering world and was supported by many more things than just sprints and Jira tickets and user-stories. Knowledge transferring and evolution through an unparallel team spirit and an overall culture to do things in a way that will help everyone grow (people and software) in a fast-paced and fast-evolving world. Empathy was found at the centre of everything he did and his ability to convey this passion was extreme! &lt;a href=&quot;https://twitter.com/tumbarumba&quot;&gt;@tumbarumba&lt;/a&gt;, all the best wherever you are!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is because Agile is, above all, a culture—a way of thinking; a way of caring about the impact and consequences of every individual’s contribution to a team goal. When it is collectively addressed as such, then only good things can come out of it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Is there a possible reason for many data scientists to not be aware of the Agile manifesto?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;I can’t be too sure about this but I if I was to point at anything, that would be how Data Science has, until recently (5 years ago), been so disjoint from the delivery of production-ready solutions. It was more focussed on research and discovery to aid decision making. Lately, the evolution and growth of ML as well as of cost-effective services to support it, necessitates the interaction of the two worlds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Never before has it been so much the case that ML models are such an integral component of software. Before, Data Scientists did not need to worry about the operationalisation and maintenance of their model. Concepts like versioning, robustness, code-coverage and testing where not so much imposed or needed, let alone challenges related to things like dealing with technical debt and refactoring. The traditional work environment would be a Jupyter notebook with access to a database! So, Data Scientists did not need to be exposed to so many practices to govern how they would work to deliver new insights.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What kind of challenges stand in the way of operational production level DS solutions?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;This mostly has to do with bridging the gap between software engineers and data scientists. Software engineers not exposed to data science can’t really do this because they fail to appreciate how exactly to maintain ML-pipelines. Note that in contrast to traditional software pipelines, there are many more issues that need to be addressed; I would refer your readers to the 2014, NIPS seminal paper on the “Hidden Technical Debt in Machine Learning Systems”. Equally, Data Scientists don’t appreciate the complexity of developing and maintaining code-bases and software solutions in a flexible and robust way to allow for things like CI/CD to be supported. This gap is now partially addressed through the emergence of a new paradigm: the ML engineer, a hybrid data scientist and software engineer, equipped with the knowledge to deal with challenges from both worlds. However, that is not enough to account for everything. What is also necessary is the emergence of appropriate tooling to support the development and maintenance of ML pipelines. A good example is Apache KubeFlow, AWS Sagemaker and the less mature but fast evolving Google AI platform.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;What is surely not helpful is the bad practice of finding ways to schedule and run python notebooks in production, and I purposely changed paragraphs to highlight this! I can’t stress enough how many times I have dealt with this in my career! Python notebooks are not made to be run as part of production pipelines—yet so many companies just do so!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is a plea to every project manager running an ML project out there: &lt;strong&gt;This is madness! Please stop it!&lt;/strong&gt;&lt;/p&gt;
  &lt;div class=&quot;tenor-gif-embed&quot; data-postid=&quot;3413789&quot; data-share-method=&quot;host&quot; data-width=&quot;100%&quot; data-aspect-ratio=&quot;2.4174757281553396&quot;&gt;&lt;a href=&quot;https://tenor.com/view/300-action-drama-gerard-butler-madness-gif-3413789&quot;&gt;This. Is. Sparta! GIF&lt;/a&gt; from &lt;a href=&quot;https://tenor.com/search/300-gifs&quot;&gt;300 GIFs&lt;/a&gt;&lt;/div&gt;
  &lt;script type=&quot;text/javascript&quot; async=&quot;&quot; src=&quot;https://tenor.com/embed.js&quot;&gt;&lt;/script&gt;

&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;In your opinion, what is the most important factor in making ML-Ops agile?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;I think that the answer to this question is “culture”. ML-Ops are here to help cultivate collaboration between data scientists and engineers to support the ML-lifecycle. They are a manifestation of Agile for Data Science in a way! What’s needed is for this mentality towards the development of production level ML solutions to be supported by practitioners, project managers and stakeholders the same. Everyone needs to take risks and own responsibility. Data Scientists need to develop the courage of supporting their experiments even if they may appear to delay production; they need to help stakeholders and project managers appreciate the actual value of experimentation. This will often prove to be very challenging; loss aversion will eventually kick in and when it does people will be more reluctant to change, and they will want to stick to what they know. But this is to be expected! It is natural human behaviour, and this is what we, as a community, are up against.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;At the end of the day, we need to remember that it is almost impossible to find the right balance or get it perfectly right. There is no formula for it. Nevertheless, value will come simply from trying to get it right, and that is more than enough!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Many thanks again to both Tim and Dhanhyaashri for their time and effort!&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-08-11T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/08/11/agile-data-science.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/08/11/agile-data-science.html</guid>
        </item>
        
        
        
        <item>
            <title>AWS ML Certification</title>
            <description>&lt;p&gt;I recently took the &lt;a href=&quot;https://aws.amazon.com/certification/certified-machine-learning-specialty/&quot;&gt;AWS Certified Machine Learning - Specialty&lt;/a&gt;, which remains one of the most demanding &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; certifications. 
I went through a lot of work in order to adequately prepare for this exam and I can tell you that it is indeed one of 
the hardest &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS certifications&lt;/span&gt;. Nevertheless, with proper preparation and a bit of dedication you should be fine.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/aws-ml-certification/2020-07-29-AWS-Cert.png&quot; alt=&quot;&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;how-long-do-i-need-to-study-for-this&quot;&gt;How long do I need to study for this?&lt;/h2&gt;
&lt;p&gt;Well it depends; if you are an experienced Data Scientist and have been applying Data Science for about 3+ years then an hour per day for a month should be enough. This also holds if you are an engineer already exposed to the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; infrastructure and services but are not familiar with Data Science topics.&lt;/p&gt;

&lt;p&gt;You see, this certification is labelled as hard simply because it is not just about &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; services. 50% of it is concerned with purely Data Science topics; the other 50% is about &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; services that support Data Science and &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; activities. If you are neither exposed to Data Science nor to the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; services then at least 2 months of studying is recommended.&lt;/p&gt;

&lt;h2 id=&quot;what-does-the-exam-cover&quot;&gt;What does the exam cover?&lt;/h2&gt;
&lt;p&gt;Data Engineering covers 20% of the exam, then Exploratory Data Analysis concerns another 24%, modelling is 36% and Machine Learning Implementation and Operations is 20%.&lt;/p&gt;

&lt;p&gt;I put together a list below, in an attempt to summarise the content:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Data Concepts&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Deals with data preparation routines; things like:
        &lt;ul&gt;
          &lt;li&gt;Feature selection and;&lt;/li&gt;
          &lt;li&gt;Feature engineering&lt;/li&gt;
          &lt;li&gt;PCA,&lt;/li&gt;
          &lt;li&gt;dealing with missing data or unbalanced datasets,&lt;/li&gt;
          &lt;li&gt;labels and one-hot encoding as well as;&lt;/li&gt;
          &lt;li&gt;splitting and randomisation of data.&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML Concepts&lt;/strong&gt;: Covers:
    &lt;ul&gt;
      &lt;li&gt;Classical ML Categories of Algorithms&lt;/li&gt;
      &lt;li&gt;Deep Learning&lt;/li&gt;
      &lt;li&gt;The ML-Life-cycle&lt;/li&gt;
      &lt;li&gt;Optimisation: Gradient Descent&lt;/li&gt;
      &lt;li&gt;Regularisation&lt;/li&gt;
      &lt;li&gt;Hyperparameter Tuning&lt;/li&gt;
      &lt;li&gt;Cross-Validation&lt;/li&gt;
      &lt;li&gt;Record I/O&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML Algorithms&lt;/strong&gt;: A list of algorithms you should be familiar with:
    &lt;ul&gt;
      &lt;li&gt;Logistic Regression&lt;/li&gt;
      &lt;li&gt;Linear Regression&lt;/li&gt;
      &lt;li&gt;Support Vector Machine&lt;/li&gt;
      &lt;li&gt;Decision Trees&lt;/li&gt;
      &lt;li&gt;Random Forests&lt;/li&gt;
      &lt;li&gt;K-Means&lt;/li&gt;
      &lt;li&gt;K-Nearest Neighbours&lt;/li&gt;
      &lt;li&gt;Latent Dirichlet Allocation (LDA) Algorithm&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deep Learning&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Cover Neural Networks in a general sense&lt;/li&gt;
      &lt;li&gt;Convolutional Neural Networks: High-level understanding&lt;/li&gt;
      &lt;li&gt;Recurrent Neural Networks: High-level understanding&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Model Optimisation&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Confusion Matrix&lt;/li&gt;
      &lt;li&gt;Sensitivity and Specificity&lt;/li&gt;
      &lt;li&gt;Accuracy &amp;amp; Precision&lt;/li&gt;
      &lt;li&gt;ROC/AUC&lt;/li&gt;
      &lt;li&gt;Gini Impurity&lt;/li&gt;
      &lt;li&gt;F1-Score&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ML Tools &amp;amp; Frameworks&lt;/strong&gt;: Cover basic ML tools (know what they do and what they are used for)
    &lt;ul&gt;
      &lt;li&gt;Jupyter Notebooks&lt;/li&gt;
      &lt;li&gt;Pytorch&lt;/li&gt;
      &lt;li&gt;MXNet&lt;/li&gt;
      &lt;li&gt;TensorFlow&lt;/li&gt;
      &lt;li&gt;Keras&lt;/li&gt;
      &lt;li&gt;Scikit-learn&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Amazon Serverless Services&lt;/strong&gt;: Not everything; think about the things that a Data Scientist of ML engineer would need to do.
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Simple Storage Services - S3&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Glue&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Athena&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Quicksight&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Kinesis&lt;/code&gt;, Streams, Firehose, Video &amp;amp; Analytics (S.O.S. this one ;) )&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EMR&lt;/code&gt; with Spark&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EC2&lt;/code&gt; for ML&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Lambda Functions&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Step Functions&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Amazon Serverless ML Services&lt;/strong&gt;: These are out-of-the-box ML solutions offered by AWS.
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rekognition&lt;/code&gt; (image/video)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Poly&lt;/code&gt; (Text-to-Speech)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Transcribe&lt;/code&gt; (Speech-to-Text)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Translate&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Comprehend&lt;/code&gt; (Text Analysis Service)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Lex&lt;/code&gt; (Conversation Interface Service - Chatbots)&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Amazon Service Chaining&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWS Step Functions&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;SageMaker&lt;/span&gt;&lt;/strong&gt;: A service that you really need to spend time with!
    &lt;ul&gt;
      &lt;li&gt;What is it exactly?&lt;/li&gt;
      &lt;li&gt;Benefits? Advantages?&lt;/li&gt;
      &lt;li&gt;Supported Algorithms (huge list; learn most popular ones)&lt;/li&gt;
      &lt;li&gt;Building and Pre-processing / Ground Truth&lt;/li&gt;
      &lt;li&gt;Training and Data sourcing&lt;/li&gt;
      &lt;li&gt;Hyper-parameter Tuning&lt;/li&gt;
      &lt;li&gt;Model Servicing (Https endpoints)&lt;/li&gt;
      &lt;li&gt;Elastic inference&lt;/li&gt;
      &lt;li&gt;Batch Transform&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is by no means an exhaustive list, but you will at least get an idea about what is generally involved.&lt;/p&gt;

&lt;h2 id=&quot;how-should-i-prepare&quot;&gt;How should I prepare?&lt;/h2&gt;
&lt;p&gt;There are many ways to prepare. Myself, I covered &lt;a href=&quot;https://linuxacademy.com/cp/modules/view/id/340&quot;&gt;the relative course on Linux academy&lt;/a&gt;, which I highly recommend.&lt;/p&gt;

&lt;p&gt;Ideally I would recommend spending some time with &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SageMaker&lt;/code&gt;&lt;/span&gt; and try to interact with services like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lambda&lt;/code&gt; functions and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;step-functions&lt;/code&gt; as well as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Kinesis&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Glue&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Athena&lt;/code&gt;. 
However, that would take a while to do plus, using these resources does not come for free.&lt;/p&gt;

&lt;p&gt;The Linux Academy Course has a number of labs that will help you develop an adequate understanding of these services. You can worry about honing your skills and knowledge at a later point.&lt;/p&gt;

&lt;h2 id=&quot;how-long-does-the-exam-last&quot;&gt;How long does the exam last?&lt;/h2&gt;
&lt;p&gt;The exam consists of 65 multiple-choice, multi-selection questions. It is 3 hours long, which I think is more than enough 
to answer all questions and then review your responses (…or take a nap while waiting for your colleagues to finish; I do have a colleague who actually did this–myself I can never relax that much when it comes to exams).&lt;/p&gt;

&lt;p&gt;In general, AWS exams are taken at authorised exam centers. Due to the COVID-19 lockdown, this was adjusted to satisfy the high demand in exam takers and
people can take the test from home. However, the process is equally strict:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;You need to provide information about the room you will be sitting in;&lt;/li&gt;
  &lt;li&gt;Room needs to be completely quiet during the exam session;&lt;/li&gt;
  &lt;li&gt;You need to be alone in the room;&lt;/li&gt;
  &lt;li&gt;You need to provide pictures of your surroundings to show you have no notes or anything suspicious close to you;&lt;/li&gt;
  &lt;li&gt;A proctor will login at the time of the exam and will ask to inspect the space around you (he asked me to show him the back of my computer prior beginning and doing so with my iMac was quite a challenge… so if you have an option go for laptop).&lt;/li&gt;
  &lt;li&gt;The exam session will be recorded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that, as one would expect, looking away from the screen for more than a couple of seconds might prompt the proctor to give you a notice. To be honest, as soon as the exam began it was quite easy to just focus on the screen. It took me 
less than an hour to cover all questions and then used all the remaining time reviewing my responses. I received a positive notification that I passed on exam completion, but it was subject to a committee review. I guess that examiners inspect the video of 
yourself taking the exam to identify if you tried cheating or something. In no more than 3 days I got the official certification.&lt;/p&gt;

&lt;h2 id=&quot;any-tips-advice&quot;&gt;Any tips? Advice?&lt;/h2&gt;
&lt;p&gt;Well, tip number one is: &lt;em&gt;“If you don’t know which is the right answer, then just go for the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; solution in the list of options”.&lt;/em&gt; At large, this exam tests whether you are familiar with what is available to you through the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;AWS&lt;/span&gt; platform. If a client wants to use &lt;span class=&quot;blog-highlight blog-highlight--ml&quot;&gt;ML&lt;/span&gt; for image moderation and you recommend anything other than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rekognition&lt;/code&gt; then you clearly don’t know how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Rekognition&lt;/code&gt; is used! This has generally worked for me as a way of filtering in and out options.&lt;/p&gt;

&lt;p&gt;I would definitely recommend covering the &lt;span class=&quot;blog-highlight blog-highlight--aws&quot;&gt;SageMaker&lt;/span&gt; &lt;a href=&quot;https://aws.amazon.com/sagemaker/faqs/&quot;&gt;FAQs&lt;/a&gt; which I see as a wonderful source for exam material.&lt;/p&gt;

&lt;p&gt;Do cover the official AWS practice exam; it is just 20 questions, but it is enough to give you an idea about what you are up against.&lt;/p&gt;

&lt;p&gt;That’s it! I really wish that this article will help you get started with your learning journey and I hope that soon enough you will be joining the &lt;a href=&quot;https://www.linkedin.com/groups/6814264/&quot;&gt;AWS Certified Global Community&lt;/a&gt; to share your badge with everyone.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;
</description>
            <pubDate>2020-07-29T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/07/29/aws-ml-certification.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/07/29/aws-ml-certification.html</guid>
        </item>
        
        
        
        <item>
            <title>Just do it!</title>
            <description>&lt;p&gt;The thing about writing a blog-post is that you are exposing yourself to the world; it feels a lot like &lt;span class=&quot;blog-highlight blog-highlight--signal&quot;&gt;flying for the first time&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image center&quot;&gt;&lt;img src=&quot;/assets/images/posts/2020/just-do-it/2020-07-28-Just-do-it.jpeg&quot; alt=&quot;Fly for the first time&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;You will be criticised! Some will appreciate your work. Others will say it’s wrong, they ‘ll disagree–which is actually 
promoting healthy public debates and hence is a good thing–or they will just not care. Ultimately, blogging has nothing 
to do with being right neither it is about writing the perfect post. Put simply, it is just about &lt;span class=&quot;blog-highlight blog-highlight--signal&quot;&gt;doing it&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;One of my favourite novels is “The Plague, by A. Camus”. At the moment of writing I can’t recommend this book enough given 
the global COVID-19 commotion. In this book, Camus’ characters are engaged in helping and saving people in the name of no ideology; 
people dying so unfairly (especially children) is enough to move anyone to act irrespective of whether this is suported by some
moral justification.&lt;/p&gt;

&lt;p&gt;There is one particular character, a side-character that came to mind when I sat down to write this post; Joseph Grand. 
Joseph is a fifty-year-old clerk operating for the city government. He lives an austere life, and in his spare time, he 
is writing a book. However, he is such a perfectionist that he ends up rewriting the first sentence over and over and never
gets to proceed any further. No words are ever good enough! What if meaning can be elevated to a higher level if a different 
wording is used. He is self-blocking himself feeling helpless and devastated.&lt;/p&gt;

&lt;p&gt;We ‘ve all been there–I am sure. If only he could let go of his perfectionism and move on to complete that first paragraph 
and write the first chapter. What would be the story that he would say? What morals and learnings would be revealed and shared?&lt;/p&gt;

&lt;p&gt;I guess we will never find out about Joseph Grand, but my blogging journey begins here and now. Looking forward to hear 
your thoughts and I welcome all of your comments.&lt;/p&gt;

&lt;p&gt;Remember to like my post and re-share it (if you really liked it)!&lt;/p&gt;

&lt;p&gt;See you soon!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;&lt;img src=&quot;//feedburner.google.com/fb/images/pub/feed-icon32x32.png&quot; alt=&quot;&quot; style=&quot;vertical-align:middle;border:0&quot; /&gt;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://feeds.feedburner.com/MlAffairs&quot; rel=&quot;alternate&quot; type=&quot;application/rss+xml&quot;&gt;Register to the ML-Affairs RSS Feed&lt;/a&gt;&lt;/p&gt;

</description>
            <pubDate>2020-07-28T00:00:00+00:00</pubDate>
            <link>https://christos-hadjinikolis.github.io/2020/07/28/just-do-it.html</link>
            <guid isPermaLink="true">https://christos-hadjinikolis.github.io/2020/07/28/just-do-it.html</guid>
        </item>
        
        
    </channel>
</rss>