<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Wojciech Sznapka Web Technologies</title>
	<atom:link href="https://blog.sznapka.pl/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.sznapka.pl</link>
	<description>Blog about programming, software architecture, Big Data, data streaming</description>
	<lastBuildDate>Sun, 01 Nov 2020 22:14:53 +0000</lastBuildDate>
	<language>pl-PL</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.5.14</generator>
	<item>
		<title>Robust API communication with exponential backoff</title>
		<link>https://blog.sznapka.pl/robust-api-communication-with-exponential-backoff/</link>
					<comments>https://blog.sznapka.pl/robust-api-communication-with-exponential-backoff/#respond</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Sun, 01 Nov 2020 22:10:33 +0000</pubDate>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[webdev]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[api-design]]></category>
		<category><![CDATA[backoff]]></category>
		<category><![CDATA[php]]></category>
		<guid isPermaLink="false">https://blog.sznapka.pl/?p=903</guid>

					<description><![CDATA[Every API fails at random points of time and that's unavoidable. Sadly, it's not taken care correctly during integrating third party API's. The simple solution is to retry. In this post, I'll show how to easily implement efficient retry mechanism.]]></description>
										<content:encoded><![CDATA[
<p>Every API fails at random points of time and that&#8217;s unavoidable. Sadly, it&#8217;s not taken care correctly during integrating third party API&#8217;s. I see it very often. &#8222;Hey there! But I&#8217;m using try-catch and handle errors, sometimes even I log them to the file&#8230;&#8221; one might say. Well, so what? What happens when it fails and you miss data which needed to be fetched during the daily ETL process? Or your business partner misses information if you send data to their API and for some reason, it fails. What then? As long as you use <code>cron</code> and have output emailed to some mailbox, which is being monitored &#8211; you&#8217;ll notice. Maybe you use Sentry or any other application monitoring/error tracking software and you&#8217;ll spot some anomaly. But imagine having dozens of such jobs running on a daily basis &#8211; it&#8217;s easy to lose track.</p>



<p>I think you get my point now. API errors occur quite often. Most of them are due to temporary service unavailability, caused mainly by having too much traffic at the moment. The simple solution is to retry. In this post, I&#8217;ll show how to easily implement efficient retry mechanism.</p>



<span id="more-903"></span>



<p>According to <a href="https://en.wikipedia.org/wiki/Exponential_backoff">Wikipedia</a></p>



<blockquote class="wp-block-quote"><p>Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate.</p></blockquote>



<p>In case, which I&#8217;ll explore, it&#8217;s a way to retry a piece of code, in case of an exception occurrence. It will delay every attempt with exponential pause, according to the equation <code>(2^attempt) * baseTime</code></p>



<p>Below example uses PHP&#8217;s <a href="https://github.com/stechstudio/Backoff#exponential">backoff library</a>. Similar libs can be found in <a href="https://pypi.org/project/backoff/">Python</a>, <a href="https://github.com/MathieuTurcotte/node-backoff">node.js</a> and probably any language of choice. </p>



<pre class="wp-block-code"><code>use GuzzleHttp\Exception\RequestException;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Processor\UidProcessor;
use STS\Backoff\Backoff; # nothing to do with sts.pl ;)

$log = new Logger('x');
$log->pushHandler(new StreamHandler('php://stdout', Logger::DEBUG));
$log->pushProcessor(new UidProcessor());

$client = new GuzzleHttp\Client();
// below API fails 60% of time
// always fail first call in 5 minutes timespan
$url = 'https://sznapka.pl/fakeapi.php';
$log->debug(sprintf('Fetching from %s', $url));

$backoff = new Backoff(10, 'exponential', 10000, true);
$result = $backoff->run(function() use ($client, $url, $log) {
    try {
        $res = $client->request('GET', $url);
        $data = json_decode($res->getBody(), true);
        $log->info(sprintf('Got response, %d items', count($data)));
    } catch (RequestException $e) {
        $log->error($e->getResponse()->getBody());
        throw $e; // causes backoff lib to retry
    }
});
$log->debug('All done');
</code></pre>



<p>When you run it with <code>php ./console.php</code> you&#8217;ll get:</p>



<pre class="wp-block-code"><code>&#91;22:29:15.649] x.DEBUG: Fetching from https://sznapka.pl/fakeapi.php &#91;] {"uid":"6b3d944"}
&#91;22:29:15.759] x.ERROR: API failed, what a surprise! &#91;] {"uid":"6b3d944"}
&#91;22:29:15.832] x.ERROR: API failed, what a surprise! &#91;] {"uid":"6b3d944"}
&#91;22:29:16.268] x.INFO: Got response, 5 rows &#91;] {"uid":"6b3d944"}
&#91;22:29:16.269] x.DEBUG: All done &#91;] {"uid":"6b3d944"}</code></pre>



<pre class="wp-block-preformatted"></pre>



<p>As you can see, the backoff library retries with exponential intervals until our wrapped closure doesn&#8217;t throw an exception. It has set 10 retries and <code>waitCap</code>  10 seconds, so will stop processing whenever one of those conditions appear. I&#8217;ve also defined <code>jitter</code> parameter to true to spread out retries and minimize collisions. </p>



<p>Last but not least &#8211; always log your external API usage. It will help you a lot during the debugging phase. Also, it&#8217;s very handy to use <code>UidProcessor</code> which puts an uid (per session) into logger&#8217;s context. It allows filtering logs from given invocation, which is especially helpful with overlapping calls or concurrent usage.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/robust-api-communication-with-exponential-backoff/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Google BigQuery &#8211; querying repeated fields</title>
		<link>https://blog.sznapka.pl/google-bigquery-querying-repeated-fields/</link>
					<comments>https://blog.sznapka.pl/google-bigquery-querying-repeated-fields/#respond</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Sun, 25 Oct 2020 17:19:59 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[big query]]></category>
		<category><![CDATA[big-data]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[data warehouse]]></category>
		<guid isPermaLink="false">https://blog.sznapka.pl/?p=890</guid>

					<description><![CDATA[Google BigQuery is probably one of the best data warehouses in the market nowadays. It dominated Big Data landscape with its infinite scaling capabilities (querying over petabytes of data), ANSI SQL support and ease of use. It has proven its worth in many use cases.

One of the least used and least appreciated features, in my opinion, is repeated fields.]]></description>
										<content:encoded><![CDATA[
<p>Google BigQuery is probably one of the best data warehouses in the market nowadays. It dominated Big Data landscape with its infinite scaling capabilities (querying over petabytes of data), ANSI SQL support and ease of use. It has proven its worth in many use cases.</p>



<p>One of the least used and least appreciated features, in my opinion, is repeated fields. The name doesn&#8217;t indicate well enough the intention, so for a sake of simplicity please consider it as an array field or nested field. You can define any structure inside the repeated field you like, leveraging types of columns which regular columns can be. The important part is to set mode REPEATED for the field of type RECORD.</p>



<span id="more-890"></span>



<p>In below example, you&#8217;ll see the table definition and SQL INSERT query, which will populate it.</p>



<figure class="wp-block-image size-large"><img loading="lazy" width="679" height="794" src="https://blog.sznapka.pl/wp-content/uploads/2020/10/image-3.png" alt="" class="wp-image-895" srcset="https://blog.sznapka.pl/wp-content/uploads/2020/10/image-3.png 679w, https://blog.sznapka.pl/wp-content/uploads/2020/10/image-3-257x300.png 257w" sizes="(max-width: 679px) 100vw, 679px" /></figure>



<pre class="wp-block-code"><code>insert into test.oscars values  
  ('Best Actor', 2019, &#91;
    struct('Joaquin Phoenix', 'Arthur Fleck / Joker', 'Joker', true),
    struct('Antonio Banderas', 'Salvador Mallo', 'Pain and Glory', false),
  /* .. */
  ]),
  ('Best Actor', 2018, &#91;
    struct('Rami Malek', 'Freddie Mercury', 'Bohemian Rhapsody', true),
    struct('Christian Bale', 'Dick Cheney', 'Vice', false),
    /* .. */
  ]));</code></pre>



<p>Now, you can query this table and immediately you&#8217;ll see a benefit of it &#8211; the records will be flattened. What&#8217;s worth noting &#8211; if you use BigQuery SDK for any language or even <code>bq</code> CLI tool, you&#8217;ll get 3 records and every one of them will have our nominee field.</p>



<figure class="wp-block-image size-large"><img loading="lazy" width="787" height="681" src="https://blog.sznapka.pl/wp-content/uploads/2020/10/image-4.png" alt="" class="wp-image-896" srcset="https://blog.sznapka.pl/wp-content/uploads/2020/10/image-4.png 787w, https://blog.sznapka.pl/wp-content/uploads/2020/10/image-4-300x260.png 300w, https://blog.sznapka.pl/wp-content/uploads/2020/10/image-4-768x665.png 768w" sizes="(max-width: 787px) 100vw, 787px" /></figure>



<p>Now, bear in mind, that our example is very simple. Imagine that in real-world scenarios, you&#8217;ll have 100+ columns in the repeated field, but you only need a few. Also, due to column-oriented nature of underneath storage in BigQuery, it&#8217;s always the best to limit the number of columns (see: <a href="https://cloud.google.com/bigquery/docs/best-practices-performance-input" data-type="URL" data-id="https://cloud.google.com/bigquery/docs/best-practices-performance-input">Control projection &#8211; Avoid&nbsp;<code>SELECT *</code></a>).</p>



<p>Let&#8217;s say, we want to get only an award and year and the actor who has been nominated and the flag for the winner. My first bet was to use:</p>



<pre class="wp-block-code"><code>SELECT award, year nominees.actor, nominees.winner
FROM test.oscars</code></pre>



<p>Wrong! It turned out, it&#8217;s not the way how it works. You need to use ARRAY mechanism and UNNEST the nominees in order to subquery it and return a RECORD in the form you need. Here goes:</p>



<pre class="wp-block-code"><code>SELECT
  award,
  year,
  ARRAY(
    SELECT STRUCT(actor, winner)
    FROM UNNEST(nominees)
    ORDER BY winner DESC, actor
  ) AS nominees
FROM test.oscars
ORDER BY year</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" width="686" height="773" src="https://blog.sznapka.pl/wp-content/uploads/2020/10/image-5.png" alt="" class="wp-image-897" srcset="https://blog.sznapka.pl/wp-content/uploads/2020/10/image-5.png 686w, https://blog.sznapka.pl/wp-content/uploads/2020/10/image-5-266x300.png 266w" sizes="(max-width: 686px) 100vw, 686px" /></figure>



<p>By using ARRAY construct you can leverage full query on repeated field which gives you a lot of flexibility. Hope you&#8217;ll find it out useful!</p>



<p></p>



<p class="has-text-align-right has-small-font-size">Cover photo <a href="https://www.freepik.com/photos/business">created by natanaelginting &#8211; www.freepik.com</a></p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/google-bigquery-querying-repeated-fields/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Quickly ingest initial data to Redis</title>
		<link>https://blog.sznapka.pl/quickly-ingest-initial-data-to-redis/</link>
					<comments>https://blog.sznapka.pl/quickly-ingest-initial-data-to-redis/#respond</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Fri, 16 Oct 2020 15:54:39 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[big-data]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[ingest]]></category>
		<category><![CDATA[mass-insert]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[redis]]></category>
		<category><![CDATA[redis-pipe]]></category>
		<guid isPermaLink="false">https://blog.sznapka.pl/?p=874</guid>

					<description><![CDATA[Imagine, you have massive data pipeline and, where thousands of requests per seconds needs to read (that&#8217;s easy) or write (that&#8217;s harder) data. The obvious and often right choice would be to use Redis to handle all that. But what happens when you start it on production and need to have some historical data, in [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>Imagine, you have massive data pipeline and, where thousands of requests per seconds needs to read (that&#8217;s easy) or write (that&#8217;s harder) data. The obvious and often right choice would be to use Redis to handle all that.</p>



<p>But what happens when you start it on production and need to have some historical data, in order to keep consistency? Of course &#8211; there is a need to import that. There are many ways to achieve that, including writing some custom script. I urge you to have a look at <code>redis --pipe</code> option, also called <a href="https://redis.io/topics/mass-insert" data-type="URL" data-id="https://redis.io/topics/mass-insert">Redis Mass Insertion</a>, where you can leverage Redis&#8217; protocol in order to really quickly ingest a lot of data (way faster than writing a custom script to migrate data using Redis SDK).</p>



<span id="more-874"></span>



<p>Once you have <code>redis-cli</code> tool, prepare a file with Redis commands and run <code>redis-cli --pipe</code>:</p>



<pre class="wp-block-code"><code>$ cat migration.txt
SET migrated-test-1 value-1
SET migrated-test-2 value-2
SET migrated-test-3 value-3

$ cat migration.txt | redis-cli  --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 3

$ redis-cli get migrated-test-1
"value-1"</code></pre>



<p>In case you want to use different data structure provided by Redis, you can also do that. Let&#8217;s take an example Sorted Sets:</p>



<pre class="wp-block-code"><code>$ cat migration-sortedsets.txt
ZADD migrated-ss-1 1 value1
ZADD migrated-ss-1 2 value2
ZADD migrated-ss-2 1 value3
ZADD migrated-ss-2 2 value4

$ cat migration-sortedsets.txt | redis-cli  --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 4

$ redis-cli zrange migrated-ss-1 0 -1
1) "value1"
2) "value2"

$ redis-cli zrange migrated-ss-2 0 -1
1) "value3"
2) "value4"</code></pre>



<p>Everything seems perfect so far. But how to quickly create those import files? I often use a dirty trick, which is not the most elegant but works like a charm. Use CONCAT, which is being provided by every SQL engine:</p>



<pre class="wp-block-code"><code>$ mysql> select * from games;
+----------------+----------------+------------+
| studio         | game           | popularity |
+----------------+----------------+------------+
| CD Project Red | Witcher III    |          1 |
| CD Project Red | Cyberpunk 2077 |          2 |
| EA Sports      | Fifa 21        |          1 |
| EA Sports      | NHL 21         |          2 |
+----------------+----------------+------------+
4 rows in set (0.00 sec)
</code></pre>



<p>Create query which will transform your data into Redis commands:</p>



<pre class="wp-block-code"><code>SELECT
  CONCAT('ZADD games-', replace(studio, ' ', '_'), ' ', popularity, ' "', game, '"')
FROM
  games;</code></pre>



<p>Now run the query and save result to file and ingest it: </p>



<pre class="wp-block-code"><code>$ cat query.sql | mysql -uuser -ppass dbname 2>/dev/null | tail -n +2 > redis-import.txt

$ cat redis-import.txt
ZADD games-CD_Project_Red 1 "Witcher III"
ZADD games-CD_Project_Red 2 "Cyberpunk 2077"
ZADD games-EA_Sports 1 "Fifa 21"
ZADD games-EA_Sports 2 "NHL 21"

$ cat redis-import.txt | redis-cli --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 4</code></pre>



<p>Our data is here!</p>



<pre class="wp-block-code"><code>$ redis-cli zrange games-CD_Project_Red 0 -1
1) "Witcher III"
2) "Cyberpunk 2077"
$ redis-cli zrange games-EA_Sports 0 -1
1) "Fifa 21"
2) "NHL 21"
</code></pre>



<p>Please note, that for larger data sets Redis might refuse to ingest it, so I suggest to split into smaller files and import them in bash for-loop.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/quickly-ingest-initial-data-to-redis/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Commarize &#8211; publicly available and open-sourced</title>
		<link>https://blog.sznapka.pl/commarize-publicly-available-and-open-sourced/</link>
					<comments>https://blog.sznapka.pl/commarize-publicly-available-and-open-sourced/#respond</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Sun, 11 Oct 2020 19:33:28 +0000</pubDate>
				<category><![CDATA[webdev]]></category>
		<category><![CDATA[comma-separated]]></category>
		<category><![CDATA[commarize]]></category>
		<category><![CDATA[csv]]></category>
		<category><![CDATA[excel]]></category>
		<category><![CDATA[multi-line]]></category>
		<category><![CDATA[sql]]></category>
		<guid isPermaLink="false">https://blog.sznapka.pl/?p=875</guid>

					<description><![CDATA[commarize.com changes multi-line input into the comma-separated ]]></description>
										<content:encoded><![CDATA[
<p>TL;DR: <a href="http://commarize.com" data-type="URL" data-id="commarize.com">commarize.com</a> changes multi-line input into the comma-separated output.</p>



<p>Full story: around 6 years ago I created a simple tool to speed up my daily job. The problem was &#8211; our Affiliate Manager has been giving me excel file with one column &#8211; the IDs of customers to change their affiliate association in the database. There was a simple query behind it:</p>



<pre class="wp-block-code"><code>UPDATE clients
SET affiliate_id = 100001
WHERE id IN (&lt;here goes comma separated list of clients>);</code></pre>



<p>Of course, you can do that somehow in Excel. I often pasted that column to VIM and put the commas using a macro. But that was becoming a hassle, when I&#8217;ve been asked a couple times per week, sometimes a day.</p>



<p>I decided to create a simple tool, which looked ugly, but worked just fine.</p>



<span id="more-875"></span>



<figure class="wp-block-image size-large"><img loading="lazy" width="850" height="748" src="https://blog.sznapka.pl/wp-content/uploads/2020/10/image.png" alt="" class="wp-image-876" srcset="https://blog.sznapka.pl/wp-content/uploads/2020/10/image.png 850w, https://blog.sznapka.pl/wp-content/uploads/2020/10/image-300x264.png 300w, https://blog.sznapka.pl/wp-content/uploads/2020/10/image-768x676.png 768w" sizes="(max-width: 850px) 100vw, 850px" /></figure>



<p>I&#8217;ve changed job since then, but still use <a href="https://commarize.com">commarize.com </a>almost every day. I&#8217;ve seen people in my teams having the same problem so I&#8217;ve given them the tool as well. Today is a good time to open-source it and make it slightly prettier, thanks to <a href="https://tailwindcss.com/" data-type="URL" data-id="https://tailwindcss.com/">Tailwind CSS</a>.</p>



<figure class="wp-block-image size-large"><img src="https://commarize.com/commarize-screenshot.png" alt=""/></figure>



<p>I&#8217;ve also pushed source to the GitHub, feel free to use it your own way or submit Pull Request or Issue: <a href="https://github.com/wowo/commarize">https://github.com/wowo/commarize</a></p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/commarize-publicly-available-and-open-sourced/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Producing AVRO messages with PHP for Kafka Connect</title>
		<link>https://blog.sznapka.pl/producing-avro-messages-with-php-for-kafka-connect/</link>
					<comments>https://blog.sznapka.pl/producing-avro-messages-with-php-for-kafka-connect/#respond</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Thu, 24 Sep 2020 19:26:17 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[big-data]]></category>
		<category><![CDATA[kafka]]></category>
		<category><![CDATA[kafka-connect]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[streaming]]></category>
		<guid isPermaLink="false">https://blog.sznapka.pl/?p=859</guid>

					<description><![CDATA[Apache Kafka became an obvious choice and industry standard for data streaming. Let's see how to send data to Kafka in AVRO format, so that Kafka Connect can parse it and put to sink.]]></description>
										<content:encoded><![CDATA[
<p>Apache Kafka has became an obvious choice and industry standard for data streaming. When streaming large amounts of data it&#8217;s often reasonable to use&nbsp;<a href="http://avro.apache.org/">AVRO</a>&nbsp;format, which has at least three advantages:</p>



<ul><li>it&#8217;s one of most size efficient (compared to JSON, protobuf, or parquet); AVRO serialized payload can be 10 times smaller than the JSON equivalent,</li><li>enforces usage of a schema,</li><li>works out of the box with Kafka Connect (it&#8217;s a requirement if you&#8217;d like to use BigQuery sink connector).</li></ul>



<p>Let&#8217;s see how to send data to Kafka in AVRO format from PHP producer, so that Kafka Connect can parse it and put data to sink.</p>



<span id="more-859"></span>



<p>I assume you know the basics of AVRO and Schema Registry, but if not &#8211; let me know in the comments, I&#8217;d be happy to help with setting up Schema Registry! Footnote: the easiest way is it, is to use Confluent&#8217;s official Docker image and deploy it on Kubernetes. </p>



<p class="has-text-align-left">In order to use PHP producer which will serialize payloads in AVRO, we need to send it in a particular &#8222;envelope&#8221;, which contains schema ID. This way Kafka Connect will know which schema should be retrieved from Schema Registry.</p>



<p>Let&#8217;s consider, that our PHP Kafka producer will send simple payload:</p>



<pre class="wp-block-code"><code>$data = &#91;
  'time' => '2020-09-24 20:45:00',
  'level' => 'info',
  'channel' => 'main',
  'message' => 'Some log entry has been produced',
];</code></pre>



<p>First of all, you need to create an AVRO schema for that. You can get one by using number of tools available online, like <a href="https://toolslick.com/generation/metadata/avro-schema-from-json">this</a>.</p>



<p>Our AVRO schema will look like this:</p>



<pre class="wp-block-code"><code>$schema = &lt;&lt;&lt;SCHEMA
{
  "name": "LogEntry",
  "type": "record",
  "namespace": "pl.sznapka",
  "fields": &#91;
    {
      "name": "time",
      "type": "string"
    },
    {
      "name": "level",
      "type": "string"
    },
    {
      "name": "channel",
      "type": "string"
    },
    {
      "name": "message",
      "type": "string"
    }
  ]
}
SCHEMA;</code></pre>



<p>Next step is to register our AVRO schema in Schema Registry and obtain its ID. You can either call Schema Registry API directly or use PHP library<strong> </strong><a href="https://github.com/flix-tech/schema-registry-php-client">flix-tech/confluent-schema-registry-api</a><strong> </strong>(which I recommend).</p>



<p>Note &#8211; you should register your schema under Kafka topic name with suffix ‚-value&#8217;</p>



<pre class="wp-block-code"><code>$kafkaTopicName = 'logs';
$subject = $kafkaTopicName . '-value';
$schemaRegistry->register($subject, \AvroSchema::parse($schema));</code></pre>



<p>Once your schema is in the Schema Registry you need to retrieve ID for your subject:</p>



<pre class="wp-block-code"><code>$schema = $schemaRegistry->latestVersion($subject);
$id = $schemaRegistry->schemaId($subject, $schema);
</code></pre>



<p>The last part is to create AVRO serialized payload with header, which will be readable for Kafka Connect:</p>



<pre class="wp-block-code"><code>$io = new \AvroStringIO();
$io->write(pack('C', 0)); // magic byte - subject version
$io->write(pack('N', $id));
$encoder = new \AvroIOBinaryEncoder($io);
$writer = new \AvroIODatumWriter($schema);
$writer->write($data, $encoder);

$kafkaPayload = $io->string();</code></pre>



<p>Now we have <code>$kafkaPayload</code> which we can send to Kafka (using <a href="https://github.com/arnaud-lb/php-rdkafka">https://github.com/arnaud-lb/php-rdkafka</a>) and then it can be distributed to any sink, which you&#8217;ll register in Kafka Connect. In my case it&#8217;s often BigQuery and Redis.</p>



<pre class="wp-block-code"><code>$topic = $kafkaProducer->newTopic($kafkaTopicName);
$topic->produce(RD_KAFKA_PARTITION_UA, 0, $kafkaPayload);
$kafkaProducer->flush(1000);</code></pre>



<p>That&#8217;s it, happy streaming!</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/producing-avro-messages-with-php-for-kafka-connect/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Real-time big data processing with Spark Streaming</title>
		<link>https://blog.sznapka.pl/real-time-big-data-processing-with-spark-streaming/</link>
					<comments>https://blog.sznapka.pl/real-time-big-data-processing-with-spark-streaming/#respond</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Fri, 09 Sep 2016 13:15:08 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[big-data]]></category>
		<category><![CDATA[flume]]></category>
		<category><![CDATA[kafka]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[streaming]]></category>
		<guid isPermaLink="false">http://blog.sznapka.pl/?p=826</guid>

					<description><![CDATA[Big Data is a trending topic in the IT sector and has been for quite some time. Nowadays vast amounts of data are being produced, especially by web applications, HTTP logs, or Internet of Things devices.

For such volumes, traditional tools like Relational Database Management Systems are no longer suitable. Terabytes or even petabytes are quite common numbers in big data context, which is definitely not the capacity that MySQL, PostgreSQL, or any other database can pick up.

To harness huge amounts of data, Apache Hadoop would generally be the first and natural choice, and it’s probably right, with one assumption: Apache Hadoop is a great tool for batch processing. It proved to be extremely successful for many companies, such as Spotify. Their recommendations, radio, playlist workloads, etc. are suitable for batch processing. However, it has one downside – you need to wait for your turn. It usually takes about one day to process everything, scheduled accordingly and executed in a fail-over manner.]]></description>
										<content:encoded><![CDATA[<p>Big Data is a trending topic in the IT sector and has been for quite some time. Nowadays vast amounts of data are being produced, especially by web applications, HTTP logs, or Internet of Things devices.</p>
<p>For such volumes, traditional tools like Relational Database Management Systems are no longer suitable. Terabytes or even petabytes are quite common numbers in big data context, which is definitely not the capacity that MySQL, PostgreSQL, or any other database can pick up.</p>
<p>To harness huge amounts of data, Apache Hadoop would generally be the first and natural choice, and it’s probably right, with one assumption: Apache Hadoop is a great tool for batch processing. It proved to be extremely successful for many companies, such as Spotify. Their recommendations, radio, playlist workloads, etc. are suitable for batch processing. However, it has one downside – you need to wait for your turn. It usually takes about one day to process everything, scheduled accordingly and executed in a fail-over manner.</p>
<p>But what if we don’t want or can’t wait?</p>
<p><span id="more-826"></span></p>
<p>In this instance, streaming technology comes to the rescue. Everything started with Apache Storm project, released in September 2011 and later acquired by Twitter. It’s still a significant player in the streaming market, but nowadays Apache Spark and its streaming module have gained incredible popularity. Spark Streaming provides scalable, high-throughput, and fault-tolerant stream processing of live data streams.</p>
<p>Unlike Apache Storm, which is purely a stream events processor, Spark can be combined with streaming of other libraries, such as machine learning, SQL or graph processing.</p>
<p>That gives endless possibilities and use case coverage.</p>
<p>Overall, Spark Streaming doesn’t differ that much from regular Spark batch jobs. In both cases, we operate on some input, apply transformations and/or compute something out of our input data and then output it somewhere. The only difference is the continuous character of streaming jobs – they run indefinitely until we terminate them (just like a stream of water in a river compared to a bucket filled with water, as an analogy to batch processing). That said, we can choose one of the sources of our streams:</p>
<ul>
<li>Apache Kafka</li>
<li>Apache Flume</li>
<li>HDFS or S3 filesystem</li>
<li>Amazon Kinesis</li>
<li>Twitter</li>
<li>TCP socket</li>
<li>MQTT</li>
<li>ZeroMQ</li>
</ul>
<p>Custom source (not available for Python yet, just Java and Scala).<br />
As seen in the list above, there’s quite a rich choice of inputs. From those above, two are really fascinating: Kafka and Flume.</p>
<p>Apache Kafka is a natural way forward when one needs to deal with a huge throughput. Designed by LinkedIn engineering team, it’s capable of handling millions of requests per second and can guarantee the order of messages in the queue. It was built with scalability in mind and it is one of the easiest ways to integrate with Spark Streaming. What’s worth highlighting – Kafka comes very handy in terms of installation and publishing/subscribing (provides binding in the most popular languages).</p>
<p>On the other hand, it often happens that sources of streams are very various and have various formats. In this case, Apache Flume is an intelligent choice. It is a standalone Linux program, which takes care on collecting and moving large amounts of data. The concept is quite similar to Spark Streaming, with one exception – it’s only responsibility is to move data from a source to a sink.</p>
<p>Flume supports various sources, like execute command output, directory spool, Kafka, Syslog, HTTP. There’s also a decent choice of sinks, like HDFS, file roll, Kafka, ElasticSearch to name just a few. But the power of Flume rests in custom sinks and sources. One of them is custom Spark Sink, which is used to either push data to Spark directly (through Avro buffer) or expose that data, so Spark can pull it. There’s an impressive list of Flume plugins, be it sinks or sources, which allows you to connect multiple sources to Spark sink for further stream processing.</p>
<p>There are few points to keep in mind working with streaming as opposed to regular batch processing:</p>
<ul>
<li>Stateful operations – each DStream (discretized stream) consists of RDDs carried in defined intervals. We can process them with pretty much the same API functions as normal Map-Reduce tasks. The only difference is that we start from a blank point in every interval. To relate to data processed in previous intervals, the state functions need to be applied on the stream (mapWithState or updateStateByKey),</li>
<li>Window operations – they can be applied over a sliding window of data to apply some transformations,</li>
<li>Check-pointing – in order to ensure resilience and fault-tolerance, Spark Streaming comes with check-pointing, useful for recovering from node failures,</li>
<li>Machine Learning operations – MLlib can be used in streaming jobs and out there are especially streaming algorithms (like Streaming Linear Regression or Streaming KMeans). For other applications, traditional MLlib models, already created based on historical values could be used to be applied to streaming data.</li>
</ul>
<p>As seen above, Apache Spark is a tremendously powerful way to deal with streaming workloads. It leverages common API for RDD transformations which eases the way for achieving clean and coherent Lambda Architecture and allows to utilize Apache Spark set up for all kinds of big data processing.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/real-time-big-data-processing-with-spark-streaming/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Type Hinting is important</title>
		<link>https://blog.sznapka.pl/type-hinting-is-important/</link>
					<comments>https://blog.sznapka.pl/type-hinting-is-important/#comments</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Wed, 11 Jun 2014 19:57:19 +0000</pubDate>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[ddd]]></category>
		<category><![CDATA[OOP]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[software architecture]]></category>
		<guid isPermaLink="false">http://blog.sznapka.pl/?p=787</guid>

					<description><![CDATA[One of my favorite PHP interview questions, is: what is Type Hinting and why it&#8217;s important? Putting definition in one sentence, Type Hinting is a way to define type of parameter in function signature and it&#8217;s a sine qua non to leverage polymorphism. Because of dynamic typing in PHP, parameters don&#8217;t need to have type [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>One of my favorite PHP interview questions, is: what is Type Hinting and why it&#8217;s important? Putting definition in one sentence, <a href="http://www.php.net/manual/en/language.oop5.typehinting.php">Type Hinting</a> is a way to define type of parameter in function signature and it&#8217;s a sine qua non to leverage <a href="http://en.wikipedia.org/wiki/Polymorphism_(computer_science)">polymorphism</a>. Because of dynamic typing in PHP, parameters don&#8217;t need to have type used. Also, by type here, I mean complex types (class, abstract class, interface, array, closure), not primitives like integer or double.<br />
<span id="more-787"></span><br />
So given the fact, that Type Hinting is optional and we don&#8217;t need to specify types of parameters passed to the method &#8211; why bother? Answer is easy: well prepared method signatures defines your model and are part of the &#8222;contract&#8221; that your code reveals to its consumers. It also prevents many silly errors and keeps codebase clean and coherent.</p>
<p>Now, if we all agree that Type Hinting is the way to go, what should we put in method signatures? We have few options: concrete class, base class or an interface. It all depends on situation. The most flexible way is the interface, because it&#8217;s small definition of object behavior and one class can implement multiple interfaces. Moreover, interfaces can be very easily mocked (by both mock tools, like Mockery or by mocks written by hand). All those adds up into great flexibility</p>
<p>Other option is to set class as type hint. You&#8217;re limited to this class instances and their descendants. Having multi-tier hierarchy graph, it&#8217;s often good idea to use a class in hierarchy near the root or even abstract class, which gives us possibility to apply method to whole graph of inheritance.</p>
<p>The least flexible way is to set a concrete class (near to final or not ever extended in current sytem). In such case you limit method only to serve for such objects, which is understandable, when method does a specific job.</p>
<p>Three options were presented above (interface, base class and concrete class). There&#8217;s one more, very important thing, that need to be kept in mind. Although PHP allows you to call methods outside the type that is defined for the method, you <strong>should never</strong> do that! It can lead to some bizarre errors and breaks Liskov Substitution Principle. Simply said: if you use a type in method&#8217;s signature, then it should rely only on this type, not it&#8217;s descendants (even we know about their existence), so you can substitute give type with new subclass without altering method body.</p>
<p>Have a look at possible violation of Liskov Substitution Principle:</p>
<pre class="lang:php decode:true ">class UserRepository extends \Doctrine\ORM\EntityRepository
{
    public function findActiveUsers()
    {
        // do some query to retrieve the result
        return $activeUserCollection;
    }
}

// .. 

public function notifyActiveUsers(EntityRepository $repo)
{
    if ($repo instanceof UserRepository) {
         $usersCollection = $repo-&gt;findActiveUsers();
    } elseif ($repo instanceof ManagersRepository) {
         // .. do someting else
     }
    // do something with $usersCollection
}</pre>
<p>As we can see, type hinted method notifyActiveUsers internally relies on specific extension of EntityRepository. This breaks LSP and leads to unreadable model. Even worse situation can be following:</p>
<pre class="lang:php decode:true ">public function notifyActiveUsers(EntityRepository $repo)
{
    $usersCollection = $repo-&gt;findActiveUsers(); 
    // do something with $usersCollection
}</pre>
<p>In the stage of implementation, we knew, that only EntityRepository was passed to this method. But in some random point of time, someone can pass other implementation of EntityRepository, and we have a problem. In Java, compiler wouldn&#8217;t allow you to compile such code, but in PHP it&#8217;s allowed on interpretation stage, however it will fail during runtime.</p>
<p>To sum things up: Type Hinting is an essential concept in OOP and should be used whenever you pass an object as a method parameter. The most flexible way is to use interfaces, but base classes often are also sufficient. One should never check or rely on subclassing of the object passed to method, because type declared in method signature, should be the final type we operate in scope of method. Applying to those straightforward rules allows us to create robust and polymorphic OOP design.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/type-hinting-is-important/feed/</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
			</item>
		<item>
		<title>Immutable value objects in PHP</title>
		<link>https://blog.sznapka.pl/immutable-value-objects-in-php/</link>
					<comments>https://blog.sznapka.pl/immutable-value-objects-in-php/#comments</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Thu, 15 May 2014 21:07:50 +0000</pubDate>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[webdev]]></category>
		<category><![CDATA[ddd]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[software architecture]]></category>
		<category><![CDATA[value object]]></category>
		<guid isPermaLink="false">http://blog.sznapka.pl/?p=760</guid>

					<description><![CDATA[Value objects are one of building blocks in Domain Driven Design. They represents a value and does not have an identity. That said, two value objects are equal if their values are equal. Other important feature is that Value Objects are immutable, i.e. they can not be modified after creation. Only valid way to create [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Value objects are one of building blocks in Domain Driven Design. They represents a value and does not have an identity. That said, two value objects are equal if their values are equal.</p>
<p>Other important feature is that Value Objects are immutable, i.e. they can not be modified after creation. Only valid way to create Value Object is to pass all required informations to constructor (and should be validated somewhere there). No setter methods should take place.<span id="more-760"></span></p>
<p>This post isn&#8217;t about obvious advantages of representing domain logic with support of Value Object. As well, we wouldn&#8217;t elaborate here about pros and cons of immutable objects. I&#8217;d rather would like to show an attempt to change Value Object, keeping it still immutable and using one of most bizarre, in my opinion, feature of PHP language, which is <a href="http://blog.sznapka.pl/accessing-private-object-property-from-other-object/">accessing private fields from outside an object</a>.</p>
<p>Often times you would like to alter Value Object, by creating new one based on current (which is only valid way). Altering logic conceptually belongs to Value Object class, so should be located there. In such method, you clone current instance, set new information to given field and return the copy. And this is were accessing private fields for same class makes sense. You can do that without having additional setters which will break the design.</p>
<p><script src="https://gist.github.com/wowo/b49ac45b975d5c489214.js"></script></p>
<p>As illustrated above, all Value Object features are in place. This example is of course trivial, but you can imagine a lot more complicated VO&#8217;s, like values shared between bounded contexts.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/immutable-value-objects-in-php/feed/</wfw:commentRss>
			<slash:comments>9</slash:comments>
		
		
			</item>
		<item>
		<title>Software developers care too much about tools</title>
		<link>https://blog.sznapka.pl/software-developers-care-too-much-about-tools/</link>
					<comments>https://blog.sznapka.pl/software-developers-care-too-much-about-tools/#comments</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Sun, 27 Apr 2014 16:20:22 +0000</pubDate>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[webdev]]></category>
		<category><![CDATA[design patterns]]></category>
		<category><![CDATA[framework]]></category>
		<category><![CDATA[mvc]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[software development]]></category>
		<guid isPermaLink="false">http://blog.sznapka.pl/?p=754</guid>

					<description><![CDATA[Lately I see perilous situation in software development area. There are plenty of good devs so much bounded to tools. By tools, I mean mostly frameworks. I would like to elaborate a bit about that, but those are my personal opinions and they aren&#8217;t here to offend anyone. First of all, we all need to admit, [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Lately I see perilous situation in software development area. There are plenty of good devs so much bounded to tools. By tools, I mean mostly frameworks. I would like to elaborate a bit about that, but those are my personal opinions and they aren&#8217;t here to offend anyone.</p>
<p>First of all, we all need to admit, that quality of modern MVC framework raised a lot, comparing with state of things few years ago. Speaking about PHP &#8211; at the time, when I attracted my attention to this language, there were pure wilderness. We did not have any strong framework (unlike Ruby On Rails, which were sine qua non choice for Ruby web development). That caused multiple projects development, some of them are dead now (or should be), some hasn&#8217;t got good market adaptation and some of them are industry leaders at the moment (Symfony and Zend).<br />
<span id="more-754"></span><br />
On the other hand, there&#8217;s huge temptation to write own frameworks, ignoring the great work of community. That has some advantages, in case you know <span style="text-decoration: underline;">exactly</span> what you&#8217;re doing. Only one good reason for me is performance concerns. But still, doing everything by hand proves lack of understanding of tools and leads to giant problems with system maintenance. For me, it&#8217;s hard to imagine how one could create a complex system without usage of good framework. What&#8217;s more, its economically unreasonable to recreate the code, that already exists.</p>
<p>Alright, it&#8217;s clear &#8211;  applications which will serve their purpose are way easier to be created with modern framework. The choice isn&#8217;t easy (as well as choice of language), but if you ask me, I&#8217;ll say: pick the one you feel the most comfortable with and which is built on top of best design patterns. A framework won&#8217;t do the job by its own, though. And this is the point I&#8217;d like to make: don&#8217;t be bound to the framework. The best quote to reflect this point of view is:</p>
<blockquote><p>The architecture of an accounting app should scream &#8222;accounting&#8221; not Spring &amp; Hibernate.<br />
Robert C. Martin via <a href="https://twitter.com/unclebobmartin/status/118365858581069824">https://twitter.com/unclebobmartin/status/118365858581069824</a></p></blockquote>
<p>By decoupling from framework (see <a href="https://twitter.com/jakub_zalas">Jakub Zalas</a> <a href="https://speakerdeck.com/jakzal/decoupling-from-the-framework">slides</a>) you&#8217;ll benefit in multiple ways: your code will be loosely coupled, easier to understand, readable, testable and most important: it will be robust. If for some reason, you&#8217;ll have to change framework (because yours isn&#8217;t supported any more and super 3rd edition of famous framework comes to general availability), you&#8217;ll spend considerably less amount of time to migrate to new libraries.</p>
<p>A thing to remember is, that good software design practices, such as design patterns or SOLID principles, exists for years now. They are applied in all software languages and you&#8217;ll find similar concepts both in Java Spring and PHP Symfony. Frameworks, on the other hand comes and goes. In 3 years from now, there won&#8217;t be Symfony2 or Zend Framework2, but your code will be still alive and need to be maintained. It&#8217;s your choice, if it be completely dependent on framework or if it will rely on proven patterns.</p>
<p>I strongly encourage to read and apply philosophy of Domain Driven Design. It&#8217;s better to focus on a core domain and reflect business needs by modelling them with code. Once you&#8217;ll be focused on domain, you&#8217;ll start to see that framework is only implementation detail and you&#8217;ll stop calling yourself Symfony Developer or Zend Developer, but rather Software Developer.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/software-developers-care-too-much-about-tools/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Testing in isolation with Symfony2 and WebTestCase</title>
		<link>https://blog.sznapka.pl/testing-in-isolation-with-symfony2-and-webtestcase/</link>
					<comments>https://blog.sznapka.pl/testing-in-isolation-with-symfony2-and-webtestcase/#comments</comments>
		
		<dc:creator><![CDATA[Wojciech Sznapka]]></dc:creator>
		<pubDate>Thu, 24 Oct 2013 13:54:25 +0000</pubDate>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Symfony]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[test]]></category>
		<category><![CDATA[unit tests]]></category>
		<guid isPermaLink="false">http://blog.sznapka.pl/?p=735</guid>

					<description><![CDATA[It&#8217;s extremely important to have same state of the System Under Test. In most of the cases it will be possible by having same contents in a database for every test. I&#8217;ve decribed how to achieve it in Fully isolated tests in Symfony2 blog post about two years ago (btw. it&#8217;s most popular post on [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>It&#8217;s extremely important to have same state of the System Under Test. In most of the cases it will be possible by having same contents in a database for every test. I&#8217;ve decribed how to achieve it in <a href="http://blog.sznapka.pl/fully-isolated-tests-in-symfony2/">Fully isolated tests in Symfony2</a> blog post about two years ago (btw. it&#8217;s most popular post on this blog). It was the time, when PHP&#8217;s Traits weren&#8217;t that popular.<br />
<span id="more-735"></span><br />
In IsolatedTestTrait.php I introduced idea to rebuild schema with fixtures from scratch into sqlite file database. It will be done once, at the begining of test suite, then it will be copied and reused for each test in given test suite. In the end, file will be removed in tearDownAfterClass method. This significantly increases performance, since you don&#8217;t rebuild schema and don&#8217;t load fixtures for every test. It is also clean and non-intrusive, because you only type <em>use IsolatedTestTrait;</em> in tests cases that test something related to database state. Now you can easily conduct functional or integration tests within consistent, isolated environment.</p>
<p>PS. LiipFunctionalTestBundle comes with similar concepts, maybe it&#8217;s worth having a look.</p>
<p>IsolatedTestsTrait is available here as a <a title="IsolatedTestsTrait" href="https://gist.github.com/wowo/7137331" target="_blank" rel="noopener noreferrer">Gist</a>.</p>
<p>For your convenience I&#8217;m also putting the code below. Feel free to comment and propose improvements!<br />
[gist id=&#8221;7137331&#8243;]</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blog.sznapka.pl/testing-in-isolation-with-symfony2-and-webtestcase/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
	</channel>
</rss>
