<?xml version="1.0" encoding="utf-8" standalone="no"?><rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0"><channel><title>Cristian</title><description>I have experience utilizing data science tools to create and maintain robust data processing pipelines. I am familiar with building web applications using modern front-end JavaScript frameworks. I also have a background in managing IT infrastructure for large compute servers In my spare time, I like to build Internet of Things (IoT) applications.
</description><managingEditor>noemail@noemail.org (Cristian Brokate)</managingEditor><pubDate>Thu, 19 Feb 2026 20:22:45 GMT</pubDate><generator>Jekyll https://jekyllrb.com/</generator><link>https://cristianpb.github.io/</link><language>en-us</language><itunes:explicit>no</itunes:explicit><itunes:summary>I have experience utilizing data science tools to create and maintain robust data processing pipelines. I am familiar with building web applications using modern front-end JavaScript frameworks. I also have a background in managing IT infrastructure for large compute servers In my spare time, I like to build Internet of Things (IoT) applications. </itunes:summary><itunes:subtitle>I have experience utilizing data science tools to create and maintain robust data processing pipelines. I am familiar with building web applications using modern front-end JavaScript frameworks. I also have a background in managing IT infrastructure for l</itunes:subtitle><itunes:owner><itunes:email>noemail@noemail.org</itunes:email></itunes:owner><item><title>Artificial Content Generation</title><link>https://cristianpb.github.io/blog/automatic-content-generation</link><category>data science</category><category>python</category><category>llm</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Sat, 7 Dec 2024 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/automatic-content-generation</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In today’s digital age, creating artificial content has become simpler than ever before, thanks to advancements in technology and AI tools. This type of strategy can be an effective way to capture the attention of users and convey a particular message or idea.</p>

<p>For example, businesses and marketers can use AI-generated images or videos to create engaging content that stands out on social media platforms or other digital channels. These visuals can be tailored to specific audiences or trends, making them more likely to resonate with users and generate engagement.</p>

<p>However, it’s important to note that while artificial content can be effective in capturing attention, it should always be used ethically and responsibly. Businesses should ensure that their use of AI tools is transparent and clearly disclosed to users, and that they are not attempting to deceive or mislead their audience in any way.</p>

<p>Furthermore, while artificial content can be an effective tool for capturing attention, it’s only one piece of the puzzle when it comes to creating a successful digital marketing strategy. Businesses should also focus on creating high-quality content that provides value and relevance to their audience, as well as optimizing their site’s user experience and overall performance.</p>

<h2 id="text-generation">Text Generation</h2>

<p>When it comes to utilizing large language models (LLMs), there are typically two routes you can take: self-hosted models or available APIs. Self-hosting allows for greater customization and control, but also requires more technical expertise and resources. On the other hand, using an API provided by a service like OpenAI is often more convenient, as they offer a pre-built wrapper to call different models. One such example is OpenAI’s API, which enables users to easily access various LLMs with a single interface.</p>

<p>Another option for those looking to utilize LLMs without the need for self-hosting or paying for access is Groq.ai, which offers a free model for use. This can be an attractive option for those who want to experiment with LLMs without incurring any costs.</p>

<p>Once you have selected your preferred method of accessing an LLM, the following Python code example demonstrates how to query the model and obtain a text response:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
    <span class="c1"># Groq API
</span>    <span class="n">base_url</span><span class="o">=</span><span class="s">"https://api.groq.com/openai/v1"</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="n">GROQ_API_KEY</span>
    <span class="c1"># local instance
</span>    <span class="c1"># base_url = 'http://localhost:11434/v1',
</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">get_message</span><span class="p">(</span><span class="n">publication_date</span><span class="p">):</span>
    <span class="n">content</span> <span class="o">=</span> <span class="p">(</span>
        <span class="s">"Generate content for a web post about NERF the output has to be in markdown format, "</span>
        <span class="s">"the text starts with a title, a number signs (#), then a newline sign (</span><span class="se">\n</span><span class="s">), for example </span><span class="se">\'</span><span class="s"># Celebrate Winter with Nerf</span><span class="se">\n\'</span><span class="s">, "</span>
        <span class="sa">f</span><span class="s">"take into account important events that happens on </span><span class="si">{</span><span class="n">publication_date</span><span class="si">}</span><span class="s"> or during that week or month"</span>
    <span class="p">)</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="s">"llama-3.3-70b-versatile"</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
        <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"system"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="s">"You are a helpful assistant that writes content."</span><span class="p">},</span>
        <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">content</span><span class="p">},</span>
      <span class="p">]</span>
    <span class="p">)</span>
    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
</code></pre></div></div>

<p>If you want to generate text based on a specific date or day of the year, you can use a language model to generate a unique output for each input. Here’s an example of how to accomplish this using Python and the popular Pandas library:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_posts</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">df</span> <span class="o">=</span> <span class="p">(</span>
  <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
  <span class="p">.</span><span class="n">assign</span><span class="p">(</span>
    <span class="n">date</span> <span class="o">=</span> <span class="p">(</span>
      <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span>
          <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">([</span>
            <span class="n">datetime</span><span class="p">.</span><span class="n">combine</span><span class="p">(</span><span class="n">date</span><span class="p">(</span><span class="mi">2024</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span> \
            <span class="o">+</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="n">i</span><span class="o">*</span><span class="mf">3.5</span> <span class="o">+</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o">*</span> <span class="mi">1</span><span class="p">,</span> <span class="n">hours</span><span class="o">=</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span><span class="o">*</span><span class="mi">24</span><span class="p">)</span> \
            <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_posts</span><span class="p">)</span>
          <span class="p">])</span>
      <span class="p">)</span>
      <span class="p">.</span><span class="n">dt</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d %H:%M'</span><span class="p">)</span>
    <span class="p">),</span>
    <span class="n">message</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span>
      <span class="n">x</span><span class="p">[</span><span class="s">'date'</span><span class="p">].</span><span class="n">progress_apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> \
        <span class="n">get_message</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%B %-dth'</span><span class="p">))</span>
        <span class="p">)</span>
    <span class="p">)</span>
  <span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Below are some results of the generated text.
Is interesting to notice that the models are able to relate date with events like winter, women’s day, Halloween, between others.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">january</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Get</span><span class="nv"> </span><span class="s">Ready</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">NERF</span><span class="nv"> </span><span class="s">This</span><span class="nv"> </span><span class="s">Winter</span><span class="nv"> </span><span class="s">Season!'</span><span class="err">,</span>
<span class="na">february</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">Love</span><span class="nv"> </span><span class="s">and</span><span class="nv"> </span><span class="s">NERF</span><span class="nv"> </span><span class="s">this</span><span class="nv"> </span><span class="s">February'</span><span class="err">,</span>
<span class="na">march</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">International</span><span class="nv"> </span><span class="s">Women's</span><span class="nv"> </span><span class="s">Day</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF"</span><span class="err">,</span>
<span class="na">april</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">Earth</span><span class="nv"> </span><span class="s">Day</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF'</span><span class="err">,</span>
<span class="na">may</span><span class="pi">:</span> <span class="s1">'</span><span class="s">NERF:</span><span class="nv"> </span><span class="s">Celebrating</span><span class="nv"> </span><span class="s">Community</span><span class="nv"> </span><span class="s">and</span><span class="nv"> </span><span class="s">Creativity</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">May'</span><span class="err">,</span>
<span class="na">june</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">Summer</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF:</span><span class="nv"> </span><span class="s">Fun</span><span class="nv"> </span><span class="s">and</span><span class="nv"> </span><span class="s">Games</span><span class="nv"> </span><span class="s">in</span><span class="nv"> </span><span class="s">June!'</span><span class="err">,</span>
<span class="na">july</span><span class="pi">:</span> <span class="s1">'</span><span class="s">7</span><span class="nv"> </span><span class="s">Epic</span><span class="nv"> </span><span class="s">Ways</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">NERF</span><span class="nv"> </span><span class="s">Your</span><span class="nv"> </span><span class="s">Way</span><span class="nv"> </span><span class="s">Through</span><span class="nv"> </span><span class="s">Summer</span><span class="nv"> </span><span class="s">Fun'</span><span class="err">,</span>
<span class="na">august</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">National</span><span class="nv"> </span><span class="s">Watermelon</span><span class="nv"> </span><span class="s">Day</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF</span><span class="nv"> </span><span class="s">Blasters!'</span><span class="err">,</span>
<span class="na">september</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">World</span><span class="nv"> </span><span class="s">Noodle</span><span class="nv"> </span><span class="s">Day</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF'</span><span class="err">,</span>
<span class="na">october</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Make</span><span class="nv"> </span><span class="s">your</span><span class="nv"> </span><span class="s">Halloween</span><span class="nv"> </span><span class="s">Party</span><span class="nv"> </span><span class="s">Spook-tacular</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF!'</span><span class="err">,</span>
<span class="na">november</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Experience</span><span class="nv"> </span><span class="s">Winter</span><span class="nv"> </span><span class="s">Magic</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF</span><span class="nv"> </span><span class="s">this</span><span class="nv"> </span><span class="s">November!'</span><span class="err">,</span>
<span class="na">december</span><span class="pi">:</span> <span class="s1">'</span><span class="s">Celebrate</span><span class="nv"> </span><span class="s">Winter</span><span class="nv"> </span><span class="s">with</span><span class="nv"> </span><span class="s">NERF:</span><span class="nv"> </span><span class="s">Fun</span><span class="nv"> </span><span class="s">Activities</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">Enjoy</span><span class="nv"> </span><span class="s">During</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">Holiday</span><span class="nv"> </span><span class="s">Season!'</span>
</code></pre></div></div>

<h2 id="image-generation">Image generation</h2>

<p>There are various API-based language models and self-hosted solutions available for generating images from text inputs. These tools offer powerful capabilities for content creation and can be used for a wide range of applications, such as generating product images, creating art, and visualizing data.</p>

<p>One popular option is the Midjourney application, which offers endpoints for generating images from text using stable diffusion models. There are developer friendly API like <a href="https://stability.ai/">StabilityAI</a> which allow to use generate image using simple python code. Users can input text and receive an image in response, making it easy to create custom graphics for websites, social media, and other digital platforms. Additionally, solutions like <a href="https://github.com/AUTOMATIC1111/stable-diffusion-webui">Automatic 1111</a> or <a href="https://github.com/mcmonkeyprojects/SwarmUI">SwarmUI</a> allows to self-host a model for more advanced customization and control over the model architecture and training process.</p>

<p>OpenAI also offers a powerful API for generating images from text inputs. Their text2image endpoint takes a text prompt as input and generates a corresponding image using state-of-the-art machine learning algorithms. OpenAI’s API is widely used in research and industry and has been applied to a variety of applications, such as generating realistic product images for e-commerce sites and creating custom artwork for digital marketing campaigns.</p>

<p>Another interesting application of these tools is image2image translation, where the input is an image instead of text. By providing some instructions or guidance to the model, users can determine the output image result. For example, a user might provide an image of a landscape and ask the model to add a sunset or change the season. This capability has many potential applications in fields such as graphic design, gaming, and virtual reality.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">gen_image</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">output_name</span><span class="p">,</span> <span class="n">seed</span><span class="p">):</span>
    <span class="n">prompt</span> <span class="o">=</span> <span class="s">"(best quality:1.2), (masterpiece:1.2) (realistic:1.2), (intricately detailed:1.1) "</span> <span class="o">+</span>  <span class="n">prompt</span>
    <span class="n">payload</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"prompt"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">,</span>
        <span class="s">"seed"</span><span class="p">:</span> <span class="n">seed</span><span class="p">,</span>
        <span class="s">"batch_size"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
        <span class="s">"width"</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span>
        <span class="s">"height"</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span>
        <span class="s">"steps"</span> <span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
        <span class="s">"hr_scale"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">"refiner_switch_at"</span><span class="p">:</span> <span class="mf">0.8</span><span class="p">,</span>
        <span class="s">"refiner_checkpoint"</span><span class="p">:</span> <span class="s">"sd_xl_refiner_1.0.safetensors"</span><span class="p">,</span>
        <span class="s">"negative_prompt"</span><span class="p">:</span> <span class="s">"bad quality, blur, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured"</span><span class="p">,</span>
    <span class="p">}</span>

    <span class="c1"># For StabilityAI
</span>    <span class="c1"># url = f"https://api.stability.ai/v1/generation/{engine_id}/text-to-image",
</span>    <span class="c1"># For self hosted AUTOMATIC1111
</span>    <span class="c1"># url = http://localhost:7860/sdapi/v1/txt2img
</span>    <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="n">url</span><span class="o">=</span><span class="s">"url"</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>

    <span class="k">if</span> <span class="s">'images'</span> <span class="ow">in</span> <span class="n">r</span><span class="p">:</span>
        <span class="c1"># Decode and save the image.
</span>        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"images/raw/</span><span class="si">{</span><span class="n">output_name</span><span class="si">}</span><span class="s">.png"</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">base64</span><span class="p">.</span><span class="n">b64decode</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="s">'images'</span><span class="p">][</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div></div>

<p>When generating images from text inputs, it’s important to keep in mind that the results may not always match your expectations. While language models can capture relevant information and use it to generate coherent and contextually appropriate text or images, they are not perfect and may sometimes produce unexpected output.</p>

<p>For example, when attempting to generate cute content containing little puppies and Nerf toys using a simple prompt, the results may vary depending on the specific language model used and the input provided. While some models may be able to capture the desired concept and generate images that include both puppies and Nerf toys, others may only incorporate one or the other, or produce output that is unrelated to the prompt altogether.</p>

<p>To increase the likelihood of generating images that match your desired concept, it’s important to provide clear and specific instructions to the language model. This might involve breaking down the concept into smaller components or providing multiple prompts to ensure that all relevant elements are included. Additionally, experimenting with different models and input parameters can help improve the quality and consistency of the generated images.</p>

<p>It’s also worth noting that language models may not always be able to capture the nuances of human creativity and imagination. While they can generate coherent and contextually appropriate text or images based on a given prompt, they may struggle to produce truly unique or unexpected content. Nonetheless, these tools can be a powerful resource for generating creative and engaging content, particularly when used in conjunction with other design and editing tools.</p>

<div class="columns is-mobile is-multiline is-horizontal-center">
  <div class="column is-6-desktop is-12-mobile">
    <amp-image-lightbox id="lightbox1" layout="nodisplay"></amp-image-lightbox>
    <amp-img on="tap:lightbox1" role="button" tabindex="0" aria-describedby="puppynonerf" alt="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" title="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" src="/assets/img/automatic-content-generation/20241122OnNov.22ndasprypupnimblydodgesNerfblastsinthecoolautumnair._0.jpg" layout="responsive" width="737" height="697"></amp-img>
    <div id="puppynonerf">
      <p>Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air</p>
    </div>
  </div>
  <div class="column is-6-desktop is-12-mobile">
    <amp-img on="tap:lightbox1" role="button" tabindex="0" aria-describedby="puppycollar" alt="markdown html output" title="markdown html output" src="/assets/img/automatic-content-generation/20240219OnFeb19thamischievouspuppyevadesmultipleNerfblastershotsinthesnowcoveredparkleaving_1.jpg" layout="responsive" width="737" height="697"></amp-img>
    <div id="puppycollar">
      <p>Prompt: On Feb 19th a mischievous puppy evades multiple Nerf blaster shots in the snow covered park</p>
    </div>
  </div>
</div>

<p>To improve the quality of generated images using language models, attention emphasis can be employed. This technique involves adjusting the input prompt to emphasize or de-emphasize certain elements, resulting in output that better matches the desired concept.</p>

<p>One approach is to reorder the words in the input prompt, with those appearing first having the greatest impact on the generated image. For example, if generating an image of a dog playing with a ball, placing “dog” before “ball” in the prompt may result in an image that features the dog more prominently than the ball. This method is highly flexible and can be used to emphasize any element of the prompt, but it does not lend itself to algorithmic modification, as changing the order of words requires manual input.</p>

<p>Another approach is to use parenthetical tokens to adjust the attention by a given amount. For example, “(dog:2)” might result in an image that features the dog more prominently than other elements of the prompt, while “(ball:-1)” might de-emphasize the ball. This method allows for a great deal of nuance and fine-tuning, but it comes with some caveats. Specifically, using too many parenthetical tokens or values that are too large can introduce artifacts in the generated image, making it less coherent or realistic.</p>

<p>It’s also possible to use extra parentheses to strengthen a subject or brackets to weaken it instead of providing a value. For example, “(dog)” might result in an image that features the dog more prominently than other elements of the prompt, while “[ball]” might de-emphasize the ball. However, this method can also introduce artifacts if used excessively or with large values.</p>

<div class="columns is-mobile is-multiline is-horizontal-center">
  <div class="column is-6-desktop is-12-mobile">
    <amp-image-lightbox id="lightbox1" layout="nodisplay"></amp-image-lightbox>
    <amp-img on="tap:lightbox1" role="button" tabindex="0" aria-describedby="nerfpuppy" alt="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" title="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" src="/assets/img/automatic-content-generation/2025-10-24-AplayfulpuppydeftlydodgescolorfulNerfdartsinthebac_0.jpg" layout="responsive" width="737" height="697"></amp-img>
    <div id="nerfpuppy">
      <p>Prompt: A playful puppy dodges colourful ((Nerf)) darts in the back of a garden</p>
    </div>
  </div>
  <div class="column is-6-desktop is-12-mobile">
    <amp-img on="tap:lightbox1" role="button" tabindex="0" aria-describedby="nerfaccent" alt="markdown html output" title="markdown html output" src="/assets/img/automatic-content-generation/20240826CelebrateSummerwithNERF_1.jpg" layout="responsive" width="737" height="697"></amp-img>
    <div id="nerfaccent">
      <p>Prompt: On a afternoon of September a puppy plays with a ((Nerf)) blaster</p>
    </div>
  </div>
</div>

<h2 id="writing-markdown-text">Writing markdown text</h2>

<p>When generating content using language models, it’s often desirable to format that content for inclusion on a static website generator. One way to do this is by converting the content to Markdown format, which allows for easy formatting and customization.</p>

<p>Markdown is a lightweight markup language that enables users to add formatting such as headers, lists, and links using simple syntax. By converting generated content to Markdown format, users can ensure that it displays consistently across different platforms and devices, making it ideal for use on static website generators.</p>

<p>To convert content to Markdown format, users can simply wrap the text in Markdown syntax, such as using “#” for headers or “-“ for bullet points. This enables easy formatting and customization of the generated content, allowing users to add links, images, and other elements as needed.</p>

<p>Additionally, many language models offer built-in support for generating Markdown-formatted text directly, eliminating the need for manual conversion. By specifying the desired output format as Markdown, users can generate content that is ready for inclusion on a static website generator with minimal additional formatting required.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_header</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">post_date</span><span class="p">,</span> <span class="n">coverimage</span><span class="p">):</span>
    <span class="n">header</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""---
title: "</span><span class="si">{</span><span class="n">title</span><span class="si">}</span><span class="s">"
date: "</span><span class="si">{</span><span class="n">post_date</span><span class="si">}</span><span class="s">"
updated: "</span><span class="si">{</span><span class="n">post_date</span><span class="si">}</span><span class="s">"
categories:
  - "nerfs"
coverImage: "/images/posts/</span><span class="si">{</span><span class="n">coverimage</span><span class="si">}</span><span class="s">"
coverWidth: 16
coverHeight: 16
excerpt: Check out how heading links work with this starter in this post.
---
"""</span>
    <span class="n">base_dep</span> <span class="o">=</span> <span class="s">"""
&lt;script&gt;
  import { base } from '$app/paths';
&lt;/script&gt;
"""</span>
    <span class="k">return</span> <span class="n">header</span> <span class="o">+</span> <span class="n">base_dep</span>

<span class="k">def</span> <span class="nf">write_markdown</span><span class="p">(</span><span class="n">idx</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="n">post_date</span><span class="p">,</span> <span class="n">message</span><span class="p">,</span> <span class="n">coverimage</span><span class="p">):</span>
    <span class="n">markdown_text</span> <span class="o">=</span> <span class="n">get_header</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">post_date</span><span class="p">,</span> <span class="n">coverimage</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'-'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span> <span class="o">+</span> <span class="s">"_1.jpg"</span><span class="p">)</span> <span class="o">+</span> <span class="n">message</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s">"../src/lib/posts/</span><span class="si">{</span><span class="n">coverimage</span><span class="si">}</span><span class="s">.md"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">markdown_text</span><span class="p">)</span>
</code></pre></div></div>

<p>Once generated content has been processed using language models and formatted for display, it can be rendered in HTML format for use on static website generators or other platforms. This involves converting the content into HTML code that includes various elements such as headers, paragraphs, and images to create a visually appealing layout.</p>

<p>To capture the viewer’s attention and provide context for the content, it’s common to include a main image at the beginning of the generated content. This might involve selecting an image that is both relevant to the topic and visually engaging, as well as ensuring that it displays correctly on different devices and screen sizes.</p>

<p>Following the main image, text content can be developed with additional images included throughout the text to break up the content and provide visual interest. Including multiple images in this way helps to keep the viewer engaged and interested in the generated content, while also providing context and information through accompanying text.</p>

<p>When creating HTML code from generated content, it’s important to consider factors such as responsive design, which ensures that the layout adapts to different screen sizes and devices. Additionally, using semantic markup can help search engines understand the structure and meaning of the content, improving its visibility and discoverability.</p>

<center>
<amp-img src="/assets/img/automatic-content-generation/Screenshot_nerf_website.jpg" alt="nerf application webpage" height="562" width="323" layout="intrinsic"></amp-img>
<br /><i>Nerf application webpage, the website is 👉</i>  <a href="https://cristianpb.github.io/nerf">here</a>
</center>

<h2 id="posting-content-in-social-networks">Posting content in social networks</h2>

<p>Twitter offers a free API (Application Programming Interface) that enables users to programmatically post messages to the platform. This API is a powerful tool for developers, as it provides access to various features and functionalities of Twitter’s platform.</p>

<p>One key feature of the Twitter API is its rate limit, which specifies the number of requests that can be made within a given time period. For the free version of the API, this rate limit is set to 17 requests per hour. This means that developers must carefully manage their use of the API to avoid exceeding the limit and facing restrictions or penalties.</p>

<p>To make the most of the Twitter API’s rate limit, it’s important to optimize requests by combining multiple actions into a single request where possible. Additionally, scheduling requests during off-peak hours can help ensure that the rate limit is not exceeded and that messages are posted successfully.</p>

<p>In some cases, developers may need to upgrade to a paid version of the Twitter API to access higher rate limits and more advanced features. However, for many use cases, the free version of the API provides sufficient functionality and flexibility.</p>

<p>Remeber to configure the writing permissions in order to use the posting endpoints of the API.</p>

<center>
<amp-img src="/assets/img/automatic-content-generation/twitter_app_permissions.jpg" alt="twitter developper app permissions" height="407" width="548" layout="intrinsic"></amp-img>
<br /><i>X developper app permissions</i>
</center>

<p>To increase visibility and reach on social media platforms, it’s important to stay up-to-date with current trends and topics that are popular among users. One strategy for doing this is to use trending subjects as inspiration for creating content, such as short messages or posts that incorporate relevant keywords and hashtags.</p>

<p>For example, the following code demonstrates how to create short messages using Google Trends subjects:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pytrends.request</span> <span class="kn">import</span> <span class="n">TrendReq</span>

<span class="n">pytrend</span> <span class="o">=</span> <span class="n">TrendReq</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pytrend</span><span class="p">.</span><span class="n">trending_searches</span><span class="p">(</span><span class="n">pn</span><span class="o">=</span><span class="s">'france'</span><span class="p">).</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s">'daily trends'</span><span class="p">})</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">gen_prompt</span><span class="p">(</span><span class="n">subject</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">(</span>
        <span class="sa">f</span><span class="s">"Describe a controversial situation about </span><span class="si">{</span><span class="n">subject</span><span class="si">}</span><span class="s">, involving nerf blasters, shooting nerf darts.  "</span>
    <span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'message'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df</span><span class="p">[</span><span class="s">'daily trends'</span><span class="p">]</span>
    <span class="p">.</span><span class="n">progress_apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">get_message</span><span class="p">(</span>
       <span class="n">prompt</span><span class="o">=</span><span class="n">gen_prompt</span><span class="p">(</span><span class="n">x</span><span class="p">),</span>
       <span class="n">assistant_instructions</span><span class="o">=</span><span class="p">(</span>
            <span class="s">"You are a helpful assistant that writes prompts to generate realistic images. "</span>
            <span class="s">"Use only simple words, no hashtags. Detailled description of the situation but keep it short, no more than 150 characters."</span>
       <span class="p">)</span>
     <span class="p">)</span>
   <span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Once content has been generated using trending subjects or other sources, it’s important to distribute and promote that content on relevant social media platforms. This can help increase visibility, engagement, and reach among target audiences.</p>

<p>The following code demonstrates how to use the Twitter API to post a message with an image:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tweepy</span>

<span class="k">def</span> <span class="nf">post_on_twitter</span><span class="p">(</span><span class="n">tweet</span><span class="p">,</span> <span class="n">image</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="n">auth</span> <span class="o">=</span> <span class="n">tweepy</span><span class="p">.</span><span class="n">OAuthHandler</span><span class="p">(</span><span class="n">api_key</span><span class="p">,</span> <span class="n">api_key_secret</span><span class="p">)</span>
    <span class="n">auth</span><span class="p">.</span><span class="n">set_access_token</span><span class="p">(</span><span class="n">access_token</span><span class="p">,</span> <span class="n">access_token_secret</span><span class="p">)</span>
    <span class="n">api</span> <span class="o">=</span> <span class="n">tweepy</span><span class="p">.</span><span class="n">API</span><span class="p">(</span><span class="n">auth</span><span class="p">,</span> <span class="n">wait_on_rate_limit</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">client</span> <span class="o">=</span> <span class="n">tweepy</span><span class="p">.</span><span class="n">Client</span><span class="p">(</span>
            <span class="n">bearer_token</span><span class="o">=</span><span class="n">bearer_token</span><span class="p">,</span>
            <span class="n">consumer_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">,</span>
            <span class="n">consumer_secret</span><span class="o">=</span><span class="n">api_key_secret</span><span class="p">,</span>
            <span class="n">access_token</span><span class="o">=</span><span class="n">access_token</span><span class="p">,</span>
            <span class="n">access_token_secret</span><span class="o">=</span><span class="n">access_token_secret</span>
            <span class="p">)</span>

    <span class="c1"># Upload image
</span>    <span class="k">if</span> <span class="n">image</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">media</span> <span class="o">=</span> <span class="n">api</span><span class="p">.</span><span class="n">media_upload</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
        <span class="c1"># Create a tweet
</span>        <span class="n">post_result</span> <span class="o">=</span> <span class="n">api</span><span class="p">.</span><span class="n">update_status</span><span class="p">(</span><span class="n">status</span><span class="o">=</span><span class="n">tweet</span><span class="p">,</span> <span class="n">media_ids</span><span class="o">=</span><span class="p">[</span><span class="n">media</span><span class="p">.</span><span class="n">media_id</span><span class="p">])</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">post_result</span> <span class="o">=</span> <span class="n">api</span><span class="p">.</span><span class="n">update_status</span><span class="p">(</span><span class="n">status</span><span class="o">=</span><span class="n">tweet</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Tweet posted successfully."</span><span class="p">)</span>
</code></pre></div></div>

<p>Here are some examples of the content that can be produced.
While AI-generated images may not be perfect, they can still be effective in capturing attention and conveying a message. However, there may be some small artifacts or imperfections that suggest the image is not entirely real, such as excessive numbers of limbs or unusual body positions.</p>

<p>Despite these minor imperfections, AI-generated images can still be an effective tool for content creation and distribution on social media platforms. By leveraging tools like DALL-E, brands and marketers can quickly generate high-quality visuals that help capture attention and convey a message in a unique and engaging way.</p>

<div class="columns is-mobile is-multiline is-horizontal-center">
  <div class="column is-6-desktop is-12-mobile">
    <amp-image-lightbox id="lightbox1" layout="nodisplay"></amp-image-lightbox>
    <amp-img on="tap:lightbox1" role="button" tabindex="0" alt="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" title="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" src="/assets/img/automatic-content-generation/post_x1.jpg" layout="responsive" width="535" height="598"></amp-img>
  </div>

  <div class="column is-6-desktop is-12-mobile">
    <amp-image-lightbox id="lightbox1" layout="nodisplay"></amp-image-lightbox>
    <amp-img on="tap:lightbox1" role="button" tabindex="0" alt="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" title="Prompt: On nov 22nd a puppy dodge Nerf blast in the cool autumn air" src="/assets/img/automatic-content-generation/post_x3.jpg" layout="responsive" width="535" height="598"></amp-img>
  </div>

</div>
<div class="columns is-mobile is-multiline is-horizontal-center">

  <div class="column is-6-desktop is-12-mobile">
    <amp-img on="tap:lightbox1" role="button" tabindex="0" alt="markdown html output" title="markdown html output" src="/assets/img/automatic-content-generation/post_x2.jpg" layout="responsive" width="535" height="598"></amp-img>
  </div>

  <div class="column is-6-desktop is-12-mobile">
    <amp-img on="tap:lightbox1" role="button" tabindex="0" alt="markdown html output" title="markdown html output" src="/assets/img/automatic-content-generation/post_x4.jpg" layout="responsive" width="535" height="598"></amp-img>
  </div>

</div>

<h2 id="artifical-traffic-augmentation">Artifical Traffic Augmentation</h2>

<p>To improve website referencing and increase search engine rankings, there are several strategies that businesses and marketers can employ. One of the most effective is to focus on driving traffic to the website using reputable search engines like Google or Bing. By increasing the volume and quality of traffic to the site, businesses can signal to search engines that their content is valuable and relevant to users.</p>

<p>One strategy for simulating traffic to a website is to use Python tools like Selenium. Selenium is an open-source web automation framework that allows developers to control web browsers programmatically and automate tasks. With Selenium, businesses can execute analytics code like Matomo or Google Analytics, which can help track user behavior and provide insights into how to improve the site’s performance and user experience.</p>

<p>One of the key benefits of using Selenium for web automation is its ability to execute client-side content like JavaScript. This can be particularly useful for websites that rely heavily on JavaScript for functionality or dynamic content. By controlling a web browser programmatically, Selenium can simulate real user behavior and help ensure that all aspects of the site are functioning properly.</p>

<p>In addition to improving website referencing and analytics, Selenium can also be used for a variety of other tasks related to web automation. For example, businesses can use Selenium to automate repetitive tasks like form filling or data entry, which can help save time and reduce errors. They can also use Selenium to test their website’s functionality across different browsers and devices, ensuring that it is accessible and user-friendly for all visitors.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">selenium</span> <span class="kn">import</span> <span class="n">webdriver</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
    <span class="n">driver</span> <span class="o">=</span> <span class="n">webdriver</span><span class="p">.</span><span class="n">Firefox</span><span class="p">()</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://url.com"</span><span class="p">)</span>
    <span class="n">scheight</span> <span class="o">=</span> <span class="p">.</span><span class="mi">1</span>
    <span class="k">while</span> <span class="n">scheight</span> <span class="o">&lt;</span> <span class="mf">9.9</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0, document.body.scrollHeight/%s);"</span> <span class="o">%</span> <span class="n">scheight</span><span class="p">)</span>
        <span class="n">scheight</span> <span class="o">+=</span> <span class="p">.</span><span class="mi">01</span>
    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>

<p>Tracking the analytics of a website is an essential task for businesses and marketers looking to optimize their online presence and drive more traffic to their site. There are many tools available for tracking website analytics, but one of the most popular and widely used is Google Analytics. By integrating Google Analytics into their website, businesses can gain valuable insights into user behavior, demographics, and engagement metrics.</p>

<p>The following image shows an example of how simulated traffic can impact website metrics in Google Analytics. In this example, the traffic to the site increased from zero to several thousand visitors within a short period of time, thanks to the use of a web automation tool like Selenium. This type of sudden increase in traffic can be a strong signal to search engines that the site’s content is valuable and relevant to users, which can help improve its search engine rankings and visibility.</p>

<p>It’s important to note, however, that simulated traffic should be used responsibly and ethically. While it can be an effective way to improve website analytics and search engine rankings, it’s not a substitute for genuine user engagement and interaction. Businesses should focus on creating high-quality content and optimizing their site’s user experience to encourage real users to visit and engage with their site.</p>

<p>In addition to tracking traffic and user behavior, Google Analytics can also provide valuable insights into other metrics like bounce rate, conversion rate, and demographics. By analyzing these metrics over time, businesses can identify trends and patterns in user behavior and make data-driven decisions about how to improve their site’s performance and user experience.</p>

<center>
<amp-img src="/assets/img/automatic-content-generation/google_analytics_stats.jpg" alt="google analytics results" height="220" width="562" layout="responsive"></amp-img>
<br /><i>Google analytics results with traffic increase</i>
</center>

<h2 id="conclusion">Conclusion</h2>

<p>The strategies outlined earlier in this conversation can be useful for creating artificial content, which can be employed for a variety of purposes, including commercial use or spreading awareness about a particular ideology. However, it’s important to be aware that such content exists in the digital realm and to exercise caution when engaging with it.</p>

<p>With the rise of AI tools and other technology, generating artificial content has become more accessible than ever before. While this can present opportunities for businesses and marketers looking to create engaging content, it also raises ethical concerns about transparency and honesty in digital communication.</p>

<p>It’s crucial for users to be able to distinguish between what is real and what is not when interacting with digital content. This means that businesses and marketers should be transparent about their use of AI tools and ensure that their messaging is clear and accurate.</p>

<p>Moreover, while artificial content can be an effective tool for capturing attention, it’s only one aspect of a successful digital marketing strategy. Businesses should also focus on creating high-quality content that provides value and relevance to their audience, as well as optimizing their site’s user experience and overall performance.</p>]]></content:encoded><description>This articles presents how to generate content using large language and vision models. The content is then shared websites and social networks.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/automatic-content-generation/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/automatic-content-generation/main-16x9.jpg"/></item><item><title>Application revenue prediction</title><link>https://cristianpb.github.io/blog/application-revenue-prediction</link><category>data science</category><category>kaggle</category><category>python</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Mon, 7 Oct 2024 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/application-revenue-prediction</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In the competitive landscape of the digital applications industry, maximizing revenue generation is a paramount objective. This exercise aims to develop a predictive model capable of forecasting application-generated revenue by leveraging a combination of user features and engagement metrics. By understanding user characteristics and their interactions with the application, we can gain valuable insights into revenue potential and optimize monetization strategies.</p>

<h2 id="data">Data</h2>

<p>The dataset provides information about users application installation features and also the engagement features through 120 days.
The dataset is divided in 2 parts:</p>
<ul>
  <li>The train data, which includes 1.7 millions of lines.</li>
  <li>The test data, which includes 12.5 thousands of lines.</li>
</ul>

<p>Each line contains information about an user. The train dataset contains information of users that installed the application from February until the end of November.The test dataset contains information from users that installed the application the 1st of December.</p>

<h3 id="individual-user-features">Individual user features</h3>

<p>These informations are gather the first day that the user installs the application.</p>

<h4 id="install-date">Install date</h4>

<p>This represents the day when the users install the application. We can see that there has been a tendency increase from February until December. There are some periods where the installations have been higher than other like in the first part of May or October. However there are some outlayer days, where the installations have been extremely low, like August 16th, September 22th, or November 2nd, 4th or 12th.</p>

<amp-img src="/assets/img/application-revenue-prediction/install_date.png" alt="Application install date" height="289" width="862" layout="responsive"></amp-img>

<p>One can also notice that the install day has a weekly seasonality, more install occurs during the weekend.</p>

<amp-img src="/assets/img/application-revenue-prediction/week_seasonality.png" alt="Weekly seasonality of application install date" height="289" width="862" layout="responsive"></amp-img>

<h4 id="platform">Platform</h4>

<p>In this dataset applications are mainly installed in IOS systems.</p>

<amp-img src="/assets/img/application-revenue-prediction/platform.png" alt="Platforms devices where applications are installed" height="289" width="862" layout="responsive"></amp-img>

<h4 id="personalised-ads">Personalised ads</h4>

<p>This represents whether the user opted in for personalized ads or other services. The train dataset have a higher number of users that didn’t choose for personalized ads. In the test dataset, the repartition is almost at the same level.</p>

<amp-img src="/assets/img/application-revenue-prediction/is_optim.png" alt="The application have been optimized or not" height="289" width="862" layout="responsive"></amp-img>

<h4 id="app-id">App Id</h4>

<p>There are two application that have been installed. Both in the train and the test set.</p>

<amp-img src="/assets/img/application-revenue-prediction/app_id.png" alt="The id of the installed apps" height="289" width="862" layout="responsive"></amp-img>

<h4 id="country">Country</h4>

<p>This represents the country where the app has been downloaded. We can see that the installations are mainly in the US market.</p>

<amp-img src="/assets/img/application-revenue-prediction/country.png" alt="Country where the app have been downloaded" height="289" width="1062" layout="responsive"></amp-img>

<h4 id="ad-network">Ad Network</h4>

<p>This represents the ID of the ad network that displayed the ads to the user.
There are 3 main ad networks in the train and the test set.</p>

<amp-img src="/assets/img/application-revenue-prediction/ad_network.png" alt="ad network that displayed the ads to the user" height="289" width="1062" layout="responsive"></amp-img>

<h4 id="campaign-type">Campaign type</h4>

<p>In marketing terms, this is category of the ad campaign that was used to acquired the user.</p>

<amp-img src="/assets/img/application-revenue-prediction/campaign_type.png" alt="Type of ad campaign that acquired the user" height="289" width="1062" layout="responsive"></amp-img>

<h4 id="campaign-id">Campaign id</h4>

<p>There are more than 178 campaign_ids, counting some of them which are unique.</p>

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>index</th>
      <th>campaign_id</th>
      <th>count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>...</td>
      <td>559244</td>
    </tr>
    <tr>
      <th>1</th>
      <td>da2...</td>
      <td>317648</td>
    </tr>
    <tr>
      <th>2</th>
      <td>99a...</td>
      <td>203638</td>
    </tr>
    <tr>
      <th>3</th>
      <td>c6d...</td>
      <td>114357</td>
    </tr>
    <tr>
      <th>4</th>
      <td>281...</td>
      <td>70904</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>174</th>
      <td>blo...</td>
      <td>1</td>
    </tr>
    <tr>
      <th>175</th>
      <td>gam...</td>
      <td>1</td>
    </tr>
    <tr>
      <th>176</th>
      <td>blo...</td>
      <td>1</td>
    </tr>
    <tr>
      <th>177</th>
      <td>dow...</td>
      <td>1</td>
    </tr>
    <tr>
      <th>178</th>
      <td>ぶろっ...</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p>A common technique to deal with this behaviour is to group the less common ones.
This can be achieved with the following python function.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_most_popular</span><span class="p">(</span><span class="n">series</span><span class="p">,</span> <span class="n">threshold</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">top_values</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">threshold</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">popular_idx</span> <span class="o">=</span> <span class="n">series</span><span class="p">.</span><span class="n">value_counts</span><span class="p">().</span><span class="n">to_frame</span><span class="p">().</span><span class="n">query</span><span class="p">(</span><span class="s">'count &gt; @threshold'</span><span class="p">).</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
        <span class="n">new_series</span> <span class="o">=</span> <span class="n">series</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
        <span class="n">new_series</span><span class="p">.</span><span class="n">loc</span><span class="p">[:]</span> <span class="o">=</span> <span class="s">"other"</span>
        <span class="n">new_series</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">series</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">popular_idx</span><span class="p">)]</span> <span class="o">=</span> <span class="n">series</span>
        <span class="k">return</span> <span class="n">new_series</span>
    <span class="k">if</span> <span class="n">top_values</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">popular_idx</span> <span class="o">=</span> <span class="n">series</span><span class="p">.</span><span class="n">value_counts</span><span class="p">().</span><span class="n">to_frame</span><span class="p">().</span><span class="n">head</span><span class="p">(</span><span class="n">top_values</span><span class="p">).</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
        <span class="n">new_series</span> <span class="o">=</span> <span class="n">series</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
        <span class="n">new_series</span><span class="p">.</span><span class="n">loc</span><span class="p">[:]</span> <span class="o">=</span> <span class="s">"other"</span>
        <span class="n">new_series</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">series</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">popular_idx</span><span class="p">)]</span> <span class="o">=</span> <span class="n">series</span>
        <span class="k">return</span> <span class="n">new_series</span>
</code></pre></div></div>

<p>The threshold can be changed in order to have a number of categories that are representative for a group.</p>

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>campaign_id</th>
      <th>count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>...</td>
      <td>559244</td>
    </tr>
    <tr>
      <th>1</th>
      <td>da...</td>
      <td>317648</td>
    </tr>
    <tr>
      <th>2</th>
      <td>99...</td>
      <td>203638</td>
    </tr>
    <tr>
      <th>3</th>
      <td>c6...</td>
      <td>114357</td>
    </tr>
    <tr>
      <th>4</th>
      <td>28...</td>
      <td>70904</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>96</th>
      <td>20...</td>
      <td>11</td>
    </tr>
    <tr>
      <th>97</th>
      <td>ブロ...</td>
      <td>11</td>
    </tr>
    <tr>
      <th>98</th>
      <td>MA...</td>
      <td>10</td>
    </tr>
    <tr>
      <th>99</th>
      <td>xx...</td>
      <td>9</td>
    </tr>
    <tr>
      <th>100</th>
      <td>17...</td>
      <td>8</td>
    </tr>
  </tbody>
</table>

<h4 id="model">Model</h4>

<p>The model of the devices also contains several occurrences, with ids that sometimes are unique, so applying the same grouping function, one could obtain the top 100 models. 
Following the platform repartition, we find Iphones and Ipads in the top devices.
One should notice that <code class="language-plaintext highlighter-rouge">IPhoneUnkown</code> and <code class="language-plaintext highlighter-rouge">IPadUnknown</code> are not in test set.</p>

<amp-img src="/assets/img/application-revenue-prediction/model.png" alt="Device model" height="289" width="862" layout="responsive"></amp-img>

<h4 id="manufacturer">Manufacturer</h4>

<p>This represents the manufacturer of the user’s device. The main manufacturer is Apple, followed by Samsung and Google. The repartition of manufacturer follows the same distribution for the train and the test datasets.</p>

<amp-img src="/assets/img/application-revenue-prediction/manufacturer.png" alt="Device manufacturer" height="289" width="862" layout="responsive"></amp-img>

<h4 id="mobile-classification">Mobile classification</h4>

<p>This classification can be related with monetary value of the mobile. High end devices have the best classification <strong>Tier 1</strong>, like the last Iphone or Iphone. The cheapest telephones will be on <strong>Tier 5</strong>. Some of them doesn’t have classification.</p>

<amp-img src="/assets/img/application-revenue-prediction/mobile_classification.png" alt="Mobile classification" height="289" width="862" layout="responsive"></amp-img>

<h4 id="city">City</h4>

<p>The city where the user downloaded the app.
The most popular cities are Tokyo, Chicago and Houston.
There are come Japanese cities that doesn’t appear in the test dataset like Otemae, Tacoma or Nishikicho.</p>

<amp-img src="/assets/img/application-revenue-prediction/city.png" alt="City" height="289" width="862" layout="responsive"></amp-img>

<h4 id="other-variables">Other variables</h4>

<p>There are some variables that are not useful for revenue predictions such as ` game_type<code class="language-plaintext highlighter-rouge"> and </code>user_id`  since they are all unique values.</p>

<h3 id="user-engagement-features">User engagement features</h3>

<p>During the life of the application, the user might interact by clicking through ads or by buying things on the platform. These features are measures using a cumulative value for day: 0, 3, 7, 14, 30, 60, 90 and 120.</p>

<h4 id="revenues-generated-by-the-application">Revenues generated by the application</h4>

<p>The main revenues are measured by the variable <code class="language-plaintext highlighter-rouge">dX_rev</code>, where <code class="language-plaintext highlighter-rouge">X</code> is the number of days where it’s measured. This is variable that is going to be predicted.The distribution is similar for test and train dataset.
This variable is treated in <code class="language-plaintext highlighter-rouge">log</code> form since it variates mostly between 0 and 1. 
70.0% of users produces less than $1 of revenue.</p>

<amp-img src="/assets/img/application-revenue-prediction/d0_rev.png" alt="Total revenu at day zero" height="289" width="862" layout="responsive"></amp-img>

<p>This variable is decomposed in <code class="language-plaintext highlighter-rouge">dX_iap_rev</code> and <code class="language-plaintext highlighter-rouge">dX_ad_rev</code> which are the revenues for ads and purchased items respectively. It is possible to notice that the total revenue is mainly pushed by ads.</p>

<amp-img src="/assets/img/application-revenue-prediction/d0_ad_iap_rev.png" alt="Decomposed total revenue at day zero" height="489" width="862" layout="responsive"></amp-img>

<h4 id="correlation-of-values">Correlation of values</h4>

<p>The correlation matrix allows to find correlated features.</p>

<amp-img src="/assets/img/application-revenue-prediction/correlation_matrix.png" alt="Correlation matrix" height="489" width="862" layout="responsive"></amp-img>

<p>The following variables are correlated:</p>
<ul>
  <li>“iap_ads_rev_d0” and “iap_ads_count_d0”: Which means that the revenue from ads in correlated with the number of ads.</li>
  <li>“iap_coins_count_d0” and “iap_count_d0”: Which means that the number of coins bought by the user is correlated with the number of items bought by the user.</li>
  <li>“iap_coins_rev_d0” and “d0_iap_rev” and  “d0_rev”: Which means that revenue from coin purchases is correlated with the  revenue from in app purchases and the total revenue.</li>
</ul>

<p>The correlated values can be removed from the modelling stage.</p>

<h4 id="evolution-over-time">Evolution over time</h4>

<p>The cumulate value changes during the days. However, in the following figure, we can notice that the revenue distribution remains the same. The distribution reduces because there is less data with <code class="language-plaintext highlighter-rouge">d120_rev</code>.</p>

<amp-img src="/assets/img/application-revenue-prediction/d0_rev_evolution.png" alt="Evolution of total revenue over time" height="389" width="862" layout="responsive"></amp-img>

<h4 id="missing-values">Missing values</h4>

<p>The objective of the problem is to predict revenue on the day 120, but this value is available for every line in the dataset because we don’t have the information from the future. For instance, the test dataset is from 1 December, so we only have information for day zero.</p>

<amp-img src="/assets/img/application-revenue-prediction/missing_target.png" alt="Missing data for revenue at day 120" height="389" width="862" layout="responsive"></amp-img>

<h2 id="modelling">Modelling</h2>

<p>To predict revenue at day 120, a traditional approach can be employed using structured data as the training dataset and the <code class="language-plaintext highlighter-rouge">d120_rev</code> column as the target variable.
This approach leverages the existing information in the dataset to forecast future revenue.</p>

<h3 id="feature-selection">Feature selection</h3>

<p>Based on variable exploration, the following modifications were made to user installation features:</p>

<ul>
  <li>Categorical values such as campaign_id, model, manufacturer, and city were consolidated by grouping less frequent categories together.</li>
  <li>Utilizing the installation_date for each user, additional features were derived, including the number of installations per day, the day of the week, and the month.</li>
  <li>To highlight the distinct nature of Apple devices, a binary column was introduced to indicate whether the installation was made on an Apple device.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span>
        <span class="n">campaign_id</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">get_most_popular</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'campaign_id'</span><span class="p">],</span> <span class="n">threshold</span><span class="o">=</span><span class="n">THRESHOLD_POPULAR_CAMPAING</span><span class="p">),</span>
        <span class="n">model</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">get_most_popular</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'model'</span><span class="p">],</span> <span class="n">threshold</span><span class="o">=</span><span class="n">THRESHOLD_POPULAR_MODEL</span><span class="p">),</span>
        <span class="n">manufacturer</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">get_most_popular</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'manufacturer'</span><span class="p">],</span> <span class="n">top_values</span><span class="o">=</span><span class="n">TOP_MANUFACTURER</span><span class="p">),</span>
        <span class="n">city</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">get_most_popular</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'city'</span><span class="p">],</span> <span class="n">top_values</span><span class="o">=</span><span class="n">TOP_CITIES</span><span class="p">),</span>
        <span class="n">is_optin</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'is_optin'</span><span class="p">].</span><span class="n">replace</span><span class="p">({</span><span class="mi">1</span><span class="p">:</span><span class="s">"optin"</span><span class="p">,</span> <span class="mi">0</span><span class="p">:</span> <span class="s">"not_optin"</span><span class="p">}),</span>
        <span class="n">installations_perday</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'install_date'</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y-%m-%d'</span><span class="p">).</span><span class="n">to_frame</span><span class="p">().</span><span class="n">merge</span><span class="p">(</span><span class="n">installations_perday</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="s">'install_date'</span><span class="p">,</span> <span class="n">right_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)[</span><span class="mi">0</span><span class="p">],</span>
        <span class="n">installation_day</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'install_date'</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">day_name</span><span class="p">(),</span>
        <span class="n">installation_month</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'install_date'</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">month</span><span class="p">,</span>
        <span class="n">is_apple</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'manufacturer'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'apple'</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">,</span> <span class="n">na</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">replace</span><span class="p">({</span><span class="bp">True</span><span class="p">:</span><span class="s">"is_apple"</span><span class="p">,</span> <span class="bp">False</span><span class="p">:</span> <span class="s">"not_apple"</span><span class="p">}),</span>
        <span class="n">mobile_classification</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'mobile_classification'</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="sa">r</span><span class="s">'^\s*$'</span><span class="p">,</span> <span class="s">'unkown_mobile_classification'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span>
    <span class="p">)</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="o">**</span><span class="p">{</span><span class="n">col</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">col</span><span class="o">=</span><span class="n">col</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="s">'category'</span><span class="p">)</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">user_cols</span><span class="p">})</span>
    <span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">engagement_cols</span> <span class="o">+</span> <span class="n">user_cols</span> <span class="o">+</span> <span class="p">[</span><span class="n">target_col</span><span class="p">,</span> <span class="s">'dataset'</span><span class="p">])</span>
<span class="p">)</span>
</code></pre></div></div>

<p>To ensure that each class is treated equally and to accommodate the requirements of models that cannot handle categorical variables directly, certain columns were transformed using one-hot encoding.
This technique converts categorical values into numerical representations, enabling the model to process them effectively.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">one_hot_columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'is_optin'</span><span class="p">,</span> <span class="s">'platform'</span><span class="p">,</span> <span class="s">'app_id'</span><span class="p">,</span> <span class="s">'country'</span><span class="p">,</span> <span class="s">'ad_network_id'</span><span class="p">,</span> <span class="s">'campaign_type'</span><span class="p">,</span> <span class="s">'mobile_classification'</span><span class="p">]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">X</span><span class="p">,</span> <span class="n">prefix</span> <span class="o">=</span> <span class="s">'OHE'</span><span class="p">,</span> <span class="n">prefix_sep</span><span class="o">=</span><span class="s">'_'</span><span class="p">,</span>
               <span class="n">columns</span> <span class="o">=</span> <span class="n">one_hot_columns</span><span class="p">,</span>
               <span class="n">drop_first</span> <span class="o">=</span><span class="bp">False</span><span class="p">,</span>
              <span class="n">dtype</span><span class="o">=</span><span class="s">'int8'</span><span class="p">)</span>
</code></pre></div></div>

<p>While the numerical values were left intact, normalization might be considered to improve model performance.
Correlated features were removed based on exploratory data analysis.
To ensure realistic predictions on day zero, information from features like (d3, d7, d14, …) was excluded from the training dataset.
This prevents the model from over-relying on these features and improves its ability to make accurate predictions in cases where such information is unavailable.</p>

<h3 id="train-test-split">Train test split</h3>

<p>The dataset was randomly divided into training, testing, and validation sets, with proportions of 60%, 20%, and 20%, respectively. The training and validation sets were used during the model training phase to optimize hyperparameters and evaluate performance. The testing set was reserved for final model evaluation and comparison with other models.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
</code></pre></div></div>

<h3 id="model-selection">Model selection</h3>

<p>Gradient boosting models have demonstrated exceptional performance in similar Kaggle competitions, making them well-suited for this task.
I selected three popular gradient boosting models: XGBoost, CatBoost, and LightGBM.
To predict the revenue value, I employed the regression version of these models.
The RMSE metric was chosen to monitor the models’ performance throughout the training process.</p>

<p>The following figure illustrates the evolution of the loss function for both the training and validation datasets. Training was halted when the validation loss stopped decreasing to prevent overfitting.</p>

<amp-img src="/assets/img/application-revenue-prediction/train_val.png" alt="Train and validation loss evolution" height="489" width="862" layout="responsive"></amp-img>

<p>In order to find the best parameters of the model I used the open-source python library <a href="https://hyperopt.github.io/hyperopt/">hyperopt</a>. 
Hyperopt is used for hyperparameter optimization of machine learning models. It allows you to define the search space for your parameters and the optimization strategy. Hyperopt then tries a certain number of configurations to find the best possible set of parameters for your model.</p>

<p>In the context of gradient boosting models, some of the parameters you might want to tune using Hyperopt include:</p>
<ul>
  <li>The number of leaves and the maximum depth of the decision trees.</li>
  <li>The <code class="language-plaintext highlighter-rouge">min_child_weight</code> parameter, which stops the algorithm from splitting a node further if the number of samples in that node falls below a certain threshold.</li>
  <li>The learning rate and the number of estimators used by the model.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">space</span><span class="o">=</span><span class="p">{</span>
    <span class="s">'enable_categorical'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="s">'learning_rate'</span><span class="p">:</span> <span class="n">hp</span><span class="p">.</span><span class="n">quniform</span><span class="p">(</span><span class="s">"learning_rate"</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">),</span>
    <span class="s">'max_depth'</span><span class="p">:</span> <span class="n">hp</span><span class="p">.</span><span class="n">quniform</span><span class="p">(</span><span class="s">"max_depth"</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">18</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
    <span class="s">'max_leaves'</span><span class="p">:</span> <span class="n">hp</span><span class="p">.</span><span class="n">quniform</span> <span class="p">(</span><span class="s">'max_leaves'</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
    <span class="s">'min_child_weight'</span> <span class="p">:</span> <span class="n">hp</span><span class="p">.</span><span class="n">quniform</span><span class="p">(</span><span class="s">'min_child_weight'</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
    <span class="s">'n_estimators'</span><span class="p">:</span> <span class="mi">2000</span><span class="p">,</span>
    <span class="s">'early_stopping_rounds'</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>
    <span class="s">'device'</span><span class="p">:</span> <span class="s">"cuda:0"</span><span class="p">,</span>
    <span class="s">'seed'</span><span class="p">:</span> <span class="mi">0</span>
    <span class="p">}</span>


<span class="k">def</span> <span class="nf">objective</span><span class="p">(</span><span class="n">space</span><span class="p">):</span>
    <span class="n">clf</span><span class="o">=</span><span class="n">XGBRegressor</span><span class="p">(</span>
         <span class="n">enable_categorical</span><span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
        <span class="n">learning_rate</span><span class="o">=</span><span class="n">space</span><span class="p">[</span><span class="s">"learning_rate"</span><span class="p">],</span>
        <span class="n">max_depth</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">space</span><span class="p">[</span><span class="s">"max_depth"</span><span class="p">]),</span>
        <span class="n">max_leaves</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">space</span><span class="p">[</span><span class="s">'max_leaves'</span><span class="p">]),</span>
        <span class="n">min_child_weight</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">space</span><span class="p">[</span><span class="s">'min_child_weight'</span><span class="p">]),</span>
        <span class="n">n_estimators</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">space</span><span class="p">[</span><span class="s">'n_estimators'</span><span class="p">]),</span>
        <span class="n">early_stopping_rounds</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">space</span><span class="p">[</span><span class="s">'early_stopping_rounds'</span><span class="p">]),</span>
        <span class="n">device</span><span class="o">=</span> <span class="s">"cuda:0"</span><span class="p">,</span>
        <span class="n">seed</span><span class="o">=</span><span class="mi">0</span>
    <span class="p">)</span>
        
    <span class="n">evaluation</span> <span class="o">=</span> <span class="p">[(</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)]</span>

    <span class="n">clf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> 
            <span class="n">eval_set</span><span class="o">=</span><span class="p">[(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">X_val</span><span class="p">,</span> <span class="n">y_val</span><span class="p">)],</span>
            <span class="n">verbose</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    
    <span class="n">pred</span> <span class="o">=</span> <span class="n">clf</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
    <span class="n">mae</span> <span class="o">=</span> <span class="n">mean_absolute_error</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
    <span class="n">rmse</span> <span class="o">=</span> <span class="n">root_mean_squared_error</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"mae: </span><span class="si">{</span><span class="n">mae</span><span class="si">}</span><span class="s">, rmse: </span><span class="si">{</span><span class="n">rmse</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">'loss'</span><span class="p">:</span> <span class="n">rmse</span><span class="p">,</span> <span class="s">'status'</span><span class="p">:</span> <span class="n">STATUS_OK</span> <span class="p">}</span>
    
<span class="n">trials</span> <span class="o">=</span> <span class="n">Trials</span><span class="p">()</span>
<span class="n">best_hyperparams</span> <span class="o">=</span> <span class="n">fmin</span><span class="p">(</span><span class="n">fn</span> <span class="o">=</span> <span class="n">objective</span><span class="p">,</span>
                        <span class="n">space</span> <span class="o">=</span> <span class="n">space</span><span class="p">,</span>
                        <span class="n">algo</span> <span class="o">=</span> <span class="n">tpe</span><span class="p">.</span><span class="n">suggest</span><span class="p">,</span>
                        <span class="n">max_evals</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span>
                        <span class="n">trials</span> <span class="o">=</span> <span class="n">trials</span><span class="p">)</span>
</code></pre></div></div>

<p>The predictions from the three models have been combined using a weighted average.</p>

<h3 id="results">Results</h3>

<p>One of the advantages of gradient boosting models is their interpretability.
These models can identify the most influential variables in making predictions.
The following figure presents the feature importance for the XGBoost model.
The <code class="language-plaintext highlighter-rouge">installations per day</code> feature is the most significant, providing valuable context about the day the user installs the application.
Another important variable is the city where the user downloaded the app, as it implies information about the user’s demographics.
Engagement variables, such as total revenue from ads, are also crucial for the final decision.</p>

<amp-img src="/assets/img/application-revenue-prediction/feature_importance.png" alt="Feature importance for the model predictions" height="489" width="862" layout="responsive"></amp-img>

<p>The RMSE of the model on the test dataset is 0.748. Since the target variable was transformed using a logarithmic function, this RMSE corresponds to a predicted error of approximately $1.11, which represents the most common error for each use case prediction.
The following figure illustrates the distribution of the actual target values in the test dataset and the predicted values generated by the model.
The model tends to predict values between 0.1 and 1. It struggles to accurately predict the behavior of the target variable in the range between 0.001 and 0.01.</p>

<amp-img src="/assets/img/application-revenue-prediction/predictions.png" alt="Distribution of the predictions of the model" height="489" width="862" layout="responsive"></amp-img>

<h2 id="conclusion">Conclusion</h2>

<p>This study aimed to develop a model to predict user-generated revenue within a digital application. By leveraging user installation features and engagement metrics, the model can provide valuable insights for optimizing monetization strategies.</p>

<p>The analysis revealed several key factors influencing revenue generation. Features like installation day, user city, and total ad revenue were identified as the most impactful on the model’s predictions.</p>

<p>The XGBoost model achieved an RMSE of 0.748 on the test dataset, which translates to a predicted error of approximately $1.11 for most user cases. However, the model exhibits limitations in accurately predicting low-revenue users (target values between 0.001 and 0.01).</p>

<p>Overall, this study demonstrates the potential of gradient boosting models for user revenue prediction within digital applications. By incorporating additional features and refining the model further, one can potentially improve prediction accuracy and gain deeper insights into user behavior that can be harnessed for effective revenue generation strategies.</p>]]></content:encoded><description>This article presents a study on predicting user-generated revenue within a digital application. Using gradient boosting models and a dataset containing user installation features and engagement metrics, the study aims to forecast revenue at day 120. The model's performance and key influencing factors are analyzed, providing insights for optimizing monetization strategies in the digital applications industry.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/application-revenue-prediction/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/application-revenue-prediction/main-16x9.jpg"/></item><item><title>Personalized Plan Care Information</title><link>https://cristianpb.github.io/blog/plant-care-information</link><category>programming</category><category>ecology</category><category>python</category><category>llm</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Sun, 28 Jul 2024 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/plant-care-information</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"/><description>Using RAG and LLM to provide accurate information about plant care.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/external-articles-responsive/plant-care-information-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/external-articles-responsive/plant-care-information-16x9.jpg"/></item><item><title>Music playlists dashboard</title><link>https://cristianpb.github.io/blog/playlists-dashboard</link><category>visualization</category><category>programming</category><category>d3.js</category><category>python</category><category>opendata</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Fri, 23 Feb 2024 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/playlists-dashboard</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Music has a significant impact on the world in various ways.
Gaining insight into the patterns of popular music can be a fascinating endeavor.
In this post, we will demonstrate how to utilize Spotify’s trending music to stay up-to-date with current trends in a self-hosted manner.</p>

<center>
<amp-img src="/assets/img/playlists-dashboard/main.jpg" width="450" height="450" layout="intrinsic" alt="realistic photo of yoda as a DJ from behind listening music in front of a playlist dashboard in a big screen"></amp-img>
<br /><i>Stable diffusion: realistic photo of yoda as a DJ from behind listening music in front of a playlist dashboard in a big screen</i>
</center>

<h2 id="data">Data</h2>

<p>Spotify is a widely used music service, and its playlist data is publicly available on the internet.
There are several popular trending playlists that reflect current music preferences.
By utilizing GitHub Actions, you can automatically fetch this data at regular intervals and store it without needing to manage a database.
This data can then be easily accessed using straightforward HTTP requests directly to GitHub.
I utilize the project called <a href="https://github.com/spotDL/spotify-downloader">spotify-downloader</a> to download the playlist data and save it as a file.
Here is the follwing code snippet to do it:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">download</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
    <span class="n">cmd</span><span class="o">=</span><span class="sa">f</span><span class="s">"docker run --rm -v </span><span class="si">{</span><span class="n">CWD</span><span class="si">}</span><span class="s">/tmpplaylists:/music spotdl/spotify-downloader save </span><span class="si">{</span><span class="n">url</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span><span class="si">}</span><span class="s"> --save-file </span><span class="si">{</span><span class="n">key</span><span class="si">}</span><span class="s">.spotdl"</span>
    <span class="n">p</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span><span class="n">cmd</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">),</span>
                             <span class="n">stderr</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">STDOUT</span><span class="p">,</span>
                             <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">iter</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">stdout</span><span class="p">.</span><span class="n">readline</span><span class="p">,</span> <span class="sa">b</span><span class="s">''</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"&gt;&gt;&gt; </span><span class="si">{</span><span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">().</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Then I parse every playlists file and do simple preprocessing on the artists column in order to obtain a list of every artists that participates in the songs.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">read_data</span><span class="p">():</span>
    <span class="n">appended_data</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'artists'</span><span class="p">,</span> <span class="s">'album_name'</span><span class="p">,</span> <span class="s">'date'</span><span class="p">,</span> <span class="s">'song_id'</span><span class="p">,</span> <span class="s">'cover_url'</span><span class="p">,</span> <span class="s">'playlist'</span><span class="p">,</span> <span class="s">'position'</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'tmpplaylists/*.spotdl'</span><span class="p">):</span>
        <span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_json</span><span class="p">(</span><span class="n">f</span><span class="p">).</span><span class="n">assign</span><span class="p">(</span>
                <span class="n">artists</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'artists'</span><span class="p">].</span><span class="n">explode</span><span class="p">().</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"'"</span><span class="p">,</span><span class="s">""</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\"</span><span class="s">"</span><span class="p">,</span> <span class="s">""</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">groupby</span><span class="p">(</span><span class="s">'index'</span><span class="p">).</span><span class="n">agg</span><span class="p">({</span><span class="s">'artists'</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">y</span><span class="p">:</span> <span class="n">y</span><span class="p">.</span><span class="n">tolist</span><span class="p">()}),</span>
                <span class="n">playlist</span><span class="o">=</span><span class="n">f</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"/"</span><span class="p">)[</span><span class="mi">1</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"."</span><span class="p">)[</span><span class="mi">0</span><span class="p">],</span>
                <span class="n">position</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span>
                <span class="p">)</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">cols</span><span class="p">).</span><span class="n">difference</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">columns</span><span class="p">))</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="sa">f</span><span class="s">'Columns: </span><span class="si">{</span><span class="s">", "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span><span class="si">}</span><span class="s">'</span>
        <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">,</span> <span class="sa">f</span><span class="s">"Shape </span><span class="si">{</span><span class="n">data</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s"> and </span><span class="si">{</span><span class="n">data</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s"> columns"</span>
        <span class="n">appended_data</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    <span class="p">(</span>
            <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">appended_data</span><span class="p">,</span> <span class="n">ignore_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span>
            <span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'static/data/data.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">";"</span><span class="p">)</span>
    <span class="p">)</span>
</code></pre></div></div>

<p>By employing periodic GitHub Actions, it is possible to regularly save playlist positions every week, enabling further processing of this data through other tools.</p>

<h2 id="observable-dashboard">Observable dashboard</h2>

<p>I utilize the <a href="https://observablehq.com/">Observable framework</a>, which incorporates the D3 JavaScript library for generating swift and adaptable visualizations.</p>

<p>Observable Notebook combines the features of conventional text editors, code editors, and document processors into a unified interface, simplifying the creation of rich and dynamic documents that integrate text, code, data visualization, and other multimedia elements.</p>

<p>Observable employs the concept of “cells” to arrange content within a notebook, where each cell can either contain plain text or executable code written in JavaScript or any other supported language. Cells can be rearranged, grouped, and nested, enabling the creation of hierarchical structures that reflect the logical organization of the document.</p>

<p>One can write a markdown notebook and import data from multiple languages, for example I use  a python preprocessing pipeline, then I import the data in the notebook and plot it using the available visualizations functions.</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Playlist details</span>

const commit_date_old = Array.from(new Set(diffData.map(i =&gt; i.commit_date)))[1];
const commit_date_recent = Array.from(new Set(diffData.map(i =&gt; i.commit_date)))[0];

From ${commit_date_old} to ${commit_date_recent} new songs have been added to the playlist.

const playlistsNames = bestArtists.map(i =&gt; i.playlist)
const playlistChoosen = view(Inputs.select(new Set(playlistsNames), {value: playlistsNames[0], label: "Playlists"}));
const artistsNames = bestArtists.map(i =&gt; i.artists)

const tableRows = RecentSongAdds(diffData, playlistChoosen, commit_date_old, commit_date_recent)

<span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"card"</span> <span class="na">style=</span><span class="s">"margin: 1rem 0 2rem 0; padding: 0;"</span><span class="nt">&gt;</span>
  ${Inputs.table(tableRows, {
  columns: ["position", "artists", "name", "album_name", "attribute"],
  align: {"position": "left"},
  format: {
    attribute: (x) =&gt; x == "+" ? "New!" : x == "-" ? "🗑" : x &gt; 0 ? <span class="sb">`⬆${x}`</span> : x == 0 ? '--' : <span class="sb">`⬇${Math.abs(x)}`</span>
  }
})}
<span class="nt">&lt;/div&gt;</span>

<span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"grid grid-cols-1"</span> <span class="na">style=</span><span class="s">"grid-auto-rows: 560px;"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"card"</span><span class="nt">&gt;</span>
    ${BestArtistsPlot(bestArtists, playlistChoosen)}
  <span class="nt">&lt;/div&gt;</span>
<span class="nt">&lt;/div&gt;</span>

const mostPopularArtists = view(Inputs.select(mostFrequent(bestArtists.filter(i =&gt; i.playlist == playlistChoosen).map(i =&gt; i.artists)).slice(0,10), {value: artistsNames[0], label: "Popular artists"}));

<span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"grid grid-cols-1"</span> <span class="na">style=</span><span class="s">"grid-auto-rows: 560px;"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"card"</span><span class="nt">&gt;</span>
    ${BestSongsPlot(bestArtists, playlistChoosen, mostPopularArtists)}
  <span class="nt">&lt;/div&gt;</span>
<span class="nt">&lt;/div&gt;</span>
</code></pre></div></div>

<p>The dashboard is hosted on GitHub pages, the link is available at <a href="https://cristianpb.github.io/playlists">cristianpb.github.io/playlists</a>.</p>

<h2 id="analysis">Analysis</h2>

<p>The dashboard allows for the identification of patterns in the development of Spotify playlists over time.
The <a href="https://open.spotify.com/playlist/37i9dQZF1DXcBWIGoYBM5M">Today Top Hits</a> playlist reflects global music trends, having garnered more than 34 million likes at the time of writing this article.</p>

<center>
<amp-img src="/assets/img/playlists-dashboard/songs-popular-artists.png" width="901" height="450" layout="intrinsic" alt="realistic photo of yoda as a DJ from behind listening music in front of a playlist dashboard in a big screen"></amp-img>
<br />
</center>

<p>We can observe artists such as Olivia Rodrigo, who has multiple tracks featured in the “Today’s Top Hits” playlist. Some songs exhibit a consistent pattern, indicating that they have maintained popularity and catchiness over time, for example, “The Vampire Song,” which remained among the top 35 songs for more than four months. Conversely, other tracks like “Catch Me Now” may initially appear in the playlist due to the artist’s popularity but subsequently decline in ranking during subsequent weeks.</p>

<center>
<amp-img src="/assets/img/playlists-dashboard/artist-constant-radio.png" width="901" height="450" layout="intrinsic" alt="realistic photo of yoda as a DJ from behind listening music in front of a playlist dashboard in a big screen"></amp-img>
<br />
</center>

<p>One might also observe that artist-specific radio playlists, which are frequently updated, exhibit minimal fluctuations. For instance, “Muse Radio,” “Coldplay Radio,” and “The Strokes” playlists undergo infrequent changes.</p>

<h2 id="discusion">Discusion</h2>

<p>Observable is a practical platform for crafting data analyses, offering
versatile connectors and support for multiple programming languages. The
variety of available visualizations is crucial, and comprehensive documentation
plays a significant role in guiding users to create effective visualizations.</p>

<p>However, incorporating reactive filters or reusing variables within an
Observable notebook necessitates writing JavaScript code, which may be a
drawback for some users. Although the reactivity of Observable notebooks is
functional, it might not be the most advanced option available.</p>

<p>The code to process the data and build the dashboard is available at <a href="https://github.com/cristianpb/playlists">github.com/cristianpb/playlists</a>.</p>]]></content:encoded><description>This article shows how to collect data from article shows how to analyze logs using Kibana dashboards. Fluentbit is used for injecting logs to elasticsearch, then it is connected to kibana to get some insights.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/playlists-dashboard/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/playlists-dashboard/main-16x9.jpg"/></item><item><title>Log analysis using Fluentbit Elasticsearch Kibana</title><link>https://cristianpb.github.io/blog/fluentbit-elasticsearch-kibana</link><category>visualization</category><category>system management</category><category>elasticsearch</category><category>docker</category><category>traefik</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Mon, 11 Sep 2023 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/fluentbit-elasticsearch-kibana</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Logs are a valuable source of information about the health and performance of
an application or system. By analyzing logs, you can identify problems early on
and take corrective action before they cause outages or other disruptions.</p>

<p>One way to analyze logs is to use a tool like Fluent Bit to collect them from
different sources and send them to a central repository like Elasticsearch.
Elasticsearch is a distributed search and analytics engine that can store and
search large amounts of data quickly and efficiently.</p>

<p>Once the logs are stored in Elasticsearch, you can use Kibana to visualize and
analyze them. Kibana provides a variety of tools for exploring and
understanding log data, including charts, tables, and dashboards.</p>

<p>By analyzing logs using Fluent Bit, Elasticsearch, and Kibana, you can gain
valuable insights into the health and performance of your applications and
systems. This information can help you to identify and troubleshoot problems,
improve performance, and ensure the availability of your applications.</p>

<center>
<amp-img src="/assets/img/fluentbit-elasticsearch-kibana/drawing.png" width="901" height="450" layout="intrinsic" alt="log injection architecture"></amp-img>
<br /><i>Log injection architecture</i>
</center>

<h2 id="fluent-bit-log-injection">Fluent-bit: log injection</h2>

<p>Traefik, a modern reverse proxy and load balancer, generates access logs for
every HTTP request. These logs can be stored as plain text files and compressed
using the logrotation Unix utility. Fluent Bit, a lightweight log collector,
provides a simple way to insert logs into Elasticsearch. In fact, it provides
several input connectors for other sources, such as syslog logs, and output
connectors, such as Datadog or New Relic.</p>

<p>To send Traefik access logs to Elasticsearch using Fluent Bit, you will need to:</p>
<ul>
  <li>Install Fluent Bit on the machine where Traefik is running.</li>
  <li>Configure Fluent Bit to collect the Traefik access logs.</li>
  <li>Configure Elasticsearch to receive the logs from Fluent Bit</li>
</ul>

<p>The configuration file from fluent bit has the following sections:</p>
<ul>
  <li>Input: I use the <code class="language-plaintext highlighter-rouge">tail</code> connector to fetch data from access.log file</li>
  <li>Filter: I use MaxMind plugin to geocode IP adresses.</li>
  <li>Output: Points directly to elasticsearch database.</li>
</ul>

<p>The following is configuration file shows how to collect Traefik logs and send
them to Elasticsearch:</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># fluentbit.conf</span>
<span class="nn">[SERVICE]</span>
    <span class="err">flush</span>             <span class="mi">5</span>
    <span class="err">daemon</span>            <span class="err">off</span>
    <span class="err">http_server</span>       <span class="err">off</span>
    <span class="err">log_level</span>         <span class="mf">inf</span><span class="err">o</span>
    <span class="err">parsers_file</span>      <span class="err">parsers.conf</span>

<span class="nn">[INPUT]</span>
    <span class="err">name</span>              <span class="err">tail</span>
    <span class="err">path</span>              <span class="err">/var/log/traefik/access.log,/var/log/traefik/access.log.</span><span class="mi">1</span>
    <span class="err">Parser</span>            <span class="err">traefik</span>
    <span class="err">Skip_Long_Lines</span>   <span class="err">On</span>

<span class="nn">[FILTER]</span>
    <span class="err">Name</span>                  <span class="err">geoip</span><span class="mi">2</span>
    <span class="err">Match</span>                 <span class="err">*</span>
    <span class="err">Database</span>              <span class="err">/fluent-bit/etc/GeoLite</span><span class="mi">2</span><span class="err">-City.mmdb</span>
    <span class="err">Lookup_key</span>            <span class="err">host</span>
    <span class="err">Record</span> <span class="err">country</span> <span class="err">host</span>   <span class="err">%{country.names.en}</span>
    <span class="err">Record</span> <span class="err">isocode</span> <span class="err">host</span>   <span class="err">%{country.iso_code}</span>
    <span class="err">Record</span> <span class="err">latitude</span> <span class="err">host</span>  <span class="err">%{location.latitude}</span>
    <span class="err">Record</span> <span class="err">longitude</span> <span class="err">host</span> <span class="err">%{location.longitude}</span>
    
<span class="nn">[FILTER]</span>
    <span class="err">Name</span>                <span class="err">lua</span>
    <span class="err">Match</span>               <span class="err">*</span>
    <span class="err">Script</span>              <span class="err">/fluent-bit/etc/geopoint.lua</span>
    <span class="err">call</span>                <span class="err">geohash_gen</span>

<span class="nn">[OUTPUT]</span>
    <span class="err">Name</span>                <span class="err">es</span>
    <span class="err">Match</span>               <span class="err">*</span>
    <span class="err">Host</span>                <span class="err">esurl.com</span>
    <span class="err">Port</span>                <span class="mi">443</span>
    <span class="err">HTTP_User</span>           <span class="err">username</span>
    <span class="err">HTTP_Passwd</span>         <span class="err">password</span>
    <span class="err">tls</span>                 <span class="err">On</span>
    <span class="err">tls.verify</span>          <span class="err">On</span>
    <span class="err">Logstash_Format</span>     <span class="err">On</span>
    <span class="err">Replace_Dots</span>        <span class="err">On</span>
    <span class="err">Retry_Limit</span>         <span class="err">False</span>
    <span class="err">Suppress_Type_Name</span>  <span class="err">On</span>
    <span class="err">Logstash_DateFormat</span> <span class="err">all</span>
    <span class="err">Generate_ID</span>         <span class="err">On</span>
</code></pre></div></div>

<p>I use an additional filter function to produce a geohash record, which then
it’ll be used in kibana in geo maps plot.</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># geopoint.lua</span>
<span class="err">function</span> <span class="err">geohash_gen(tag,</span> <span class="err">timestamp,</span> <span class="err">record)</span>
        <span class="py">new_record</span> <span class="p">=</span> <span class="err">record</span>
        <span class="py">lat</span> <span class="p">=</span> <span class="err">record</span><span class="nn">["latitude"]</span>
        <span class="py">lon</span> <span class="p">=</span> <span class="err">record</span><span class="nn">["longitude"]</span>
        <span class="py">hash</span> <span class="p">=</span> <span class="err">lat</span> <span class="err">..</span> <span class="s">","</span> <span class="err">..</span> <span class="err">lon</span>
        <span class="py">new_record["geohash"]</span> <span class="p">=</span> <span class="err">hash</span>
        <span class="err">return</span> <span class="mi">1</span><span class="err">,</span> <span class="err">timestamp,</span> <span class="err">new_record</span>
<span class="err">end</span>
</code></pre></div></div>

<p>The parser uses a regex expression to obtain the different fields for each record.
By default all fields are process as strings, but you can other types, like integer for fields like <em>request size</em>, <em>request duration</em> and <em>number of requests</em>.</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># parsers.conf</span>
<span class="nn">[PARSER]</span>
    <span class="err">Name</span>   <span class="err">traefik</span>
    <span class="err">Format</span> <span class="err">regex</span>
    <span class="err">Regex</span>  <span class="err">^(?&lt;host&gt;</span><span class="nn">[\S]</span><span class="err">*)</span> <span class="err">[^</span> <span class="err">]*</span> <span class="err">(?&lt;user&gt;[^</span> <span class="err">]*)</span> <span class="err">\</span><span class="nn">[(?&lt;time&gt;[^\]]*)\]</span> <span class="err">"(?&lt;method&gt;\S+)(?:</span> <span class="err">+(?&lt;path&gt;</span><span class="nn">[^\"]</span><span class="err">*?)(?&lt;protocol&gt;\S*)?)?"</span> <span class="err">(?&lt;code&gt;[^</span> <span class="err">]*)</span> <span class="err">(?&lt;size&gt;[^</span> <span class="err">]*)(?:</span> <span class="err">"(?&lt;referer&gt;</span><span class="nn">[^\"]</span><span class="err">*)"</span> <span class="err">"(?&lt;agent&gt;</span><span class="nn">[^\"]</span><span class="err">*)")?</span> <span class="err">(?&lt;number_requests&gt;[^</span> <span class="err">]*)</span> <span class="err">"(?&lt;router_name&gt;</span><span class="nn">[^\"]</span><span class="err">*)"</span> <span class="err">"(?&lt;router_url&gt;</span><span class="nn">[^\"]</span><span class="err">*)"</span> <span class="err">(?&lt;request_duration&gt;</span><span class="nn">[\d]</span><span class="err">*)ms$</span>
    <span class="err">Time_Key</span> <span class="err">time</span>
    <span class="err">Time_Format</span> <span class="err">%d/%b/%Y:%H:%M:%S</span> <span class="err">%z</span>
    <span class="err">Types</span> <span class="err">request_duration:integer</span> <span class="err">size:integer</span> <span class="err">number_requests:integer</span>
</code></pre></div></div>

<p>Once you have configured Fluent Bit, you can start it by running the following command: <code class="language-plaintext highlighter-rouge">fluent-bit -c fluent-bit.conf</code> or by using docker compose:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># docker-compose.yml</span>
<span class="na">version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.7"</span>

<span class="na">services</span><span class="pi">:</span>
  <span class="na">fluent-bit</span><span class="pi">:</span>
    <span class="na">container_name</span><span class="pi">:</span> <span class="s">fluent-bit</span>
    <span class="na">restart</span><span class="pi">:</span> <span class="s">unless-stopped</span>
    <span class="na">image</span><span class="pi">:</span> <span class="s">fluent/fluent-bit</span>
    <span class="na">volumes</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">./parsers.conf:/fluent-bit/etc/parsers.conf</span>
      <span class="pi">-</span> <span class="s">./fluentbit.conf:/fluent-bit/etc/fluent-bit.conf</span>
      <span class="pi">-</span> <span class="s">./geopoint.lua:/fluent-bit/etc/geopoint.lua</span>
      <span class="pi">-</span> <span class="s">./GeoLite2-City.mmdb:/fluent-bit/etc/GeoLite2-City.mmdb</span>
      <span class="pi">-</span> <span class="s">/var/log/traefik:/var/log/traefik</span>
</code></pre></div></div>

<h2 id="elasticsearch-log-indexing">Elasticsearch: Log indexing</h2>

<p>Elasticsearch is a popular open-source search and analytics engine that can be
used for a variety of tasks, including log analysis. It is a good choice for
log analysis because it can be queried using complex queries, and it provides a
REST API to cast queries directly in readable JSON format.</p>

<p>Elasticsearch uses a distributed architecture, which means that it can be
scaled to handle large amounts of data. It also supports a variety of data
types, including text, numbers, and dates, which makes it a versatile tool for
log analysis.</p>

<p>To use Elasticsearch for log analysis, you would first need to index the logs
into Elasticsearch. This can be done using a variety of tools, such as Logstash
or Fluent Bit. Once the logs are indexed, you can then query them using
Elasticsearch’s powerful query language.</p>

<p>Elasticsearch’s query language is based on JSON, which makes it easy to read
and write. It also supports a variety of features, such as full-text search,
regular expressions, and aggregations.</p>

<h3 id="mappings">Mappings</h3>

<p>Elasticsearch creates a mapping for new indices by default, guessing the type
of each field. However, it is better to provide an explicit mapping to the
index. This will allow you to control the type of each field and the operations
that can be performed on it. For example, you can specify that a field is of
type ip so that it can be used to filter for IP address groups, or you can
specify that a field is of type geo_point so that it can be used to filter by
an specific location.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-XPUT</span> <span class="s2">"https://hostname/logstash-all"</span> <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="nt">-d</span> <span class="s1">'{ "mappings": { "properties": { "@timestamp": { "type": "date" }, "agent": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "code": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "country": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "geohash": { "type": "geo_point" }, "host": { "type": "ip" }, "isocode": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "latitude": { "type": "float" }, "longitude": { "type": "float" }, "method": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "number_requests": { "type": "long" }, "path": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "protocol": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "referer": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "request_duration": { "type": "long" }, "router_name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "router_url": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "size": { "type": "long" }, "user": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }'</span>
</code></pre></div></div>

<h3 id="managed-database">Managed database</h3>

<p><a href="https://bonsai.io/">Bonsai.io</a> is a managed Elasticsearch service that
provides high availability and scalability without the need to manage or deploy
the underlying infrastructure. Bonsai offers a variety of plans to suit
different project requirements.</p>

<center>
<amp-img src="/assets/img/fluentbit-elasticsearch-kibana/bonsai-io.jpg" width="901" height="450" layout="intrinsic" alt="bonsai.io overview dashboard"></amp-img>
<br /><i>Bonsai.io dashboard</i>
</center>

<p>The hobbyist tier is more than enough for this kind of use case, which comes
with a maximum of 35k documents, 125mb of data and 10 shards. At the moment of
writing this article its a free, you don’t have to enter a credit card to use
it.</p>

<p>In order to be compliant with the limits of the hobby tier, I use the following
cronjob to remove old documents:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-X</span> POST <span class="s2">"https://hostname/logstash-all/_delete_by_query"</span> <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="nt">-d</span> <span class="s1">'{ "query": { "bool": { "filter": [ { "range": { "@timestamp": { "lt": "now-10d" } } } ] } } }'</span>
</code></pre></div></div>

<p>For my use case, 10 days retention is enough to be compliant with the plan
limits.</p>

<h2 id="kibana-log-analysis">Kibana: Log analysis</h2>

<p><a href="https://bonsai.io">Bonsai.io</a> also provides a managed kibana service connected to the elasticsearch cluster.</p>

<p>There are certain limitations about the stack management, like there is no possibility to manage the index life cycle or alerting capacity.</p>

<p>Nevertheless, it provides basic functionality to create useful dashboards and
discover patterns inside the logs.</p>

<center>
<amp-img src="/assets/img/fluentbit-elasticsearch-kibana/kibana.jpg" width="901" height="450" layout="intrinsic" alt="kibana dashboard provided by bonsai.io"></amp-img>
<br /><i>Kibana dashboard</i>
</center>

<p>Its interesting to see bot request trying to explode vulnerability from
services like wordpress and also bot scrapping services.</p>

<h2 id="discusion">Discusion</h2>

<p>The following stack provides a simple and cost-effective way to analyze logs.
The computational footprint on your server is very low because most of the
infrastructure is in the cloud. There are many freemium services, such as
Bonsai.io and New Relic, that can be used to ingest and analyze logs.</p>

<p>Observability is important for infrastructure management, but it is also
important to have alerting capabilities to detect and respond to threats.
Unfortunately, these plugins are not typically included in the free plan, so
you will need to upgrade to a paid plan to get them.</p>]]></content:encoded><description>This article shows how to analyze logs using Kibana dashboards. Fluentbit is used for injecting logs to elasticsearch, then it is connected to kibana to get some insights.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/fluentbit-elasticsearch-kibana/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/fluentbit-elasticsearch-kibana/main-16x9.jpg"/></item><item><title>Magic wand gesture recognition using Tensorflow and SensiML</title><link>https://cristianpb.github.io/blog/magic-wand</link><category>data science</category><category>programming</category><category>fpga</category><category>python</category><category>sensiml</category><category>quickfeather</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Fri, 4 Jun 2021 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/magic-wand</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>During the last decade, IoT devices have become very popular. 
Their small factor size made they optimal for all kind of applications. 
Their technology has also improve in the last decade and now a days they are
are able to do machine learning in the edge.</p>

<p>I recently received a QuickFeather microcontroller from a
<a href="https://www.hackster.io/contests/quickfeather">Hackster.IO</a> contest. One of
the main features of this device is its built-in eFPGA, which can optimize
parallel computations on the edge.</p>

<p>This post will explore the capabilities of this little beast and show how to
run a machine learning model that was trained using Tensorflow.
The use case will be focused for gesture recognition, so the device will be
able to detect if the movement correspond to one alphabet letter.</p>

<h2 id="quickfeather">QuickFeather</h2>

<p>The QuickFeather is a very powerful device with a small form factor (58mm
x 22mm). It’s the first FPGA-enabled microcontroller to be fully supported with
Zephyr RTOS. Additionally it includes a MC3635 accelerometer, a pressure, a
microphone and an integrated Li-Po battery charger.</p>

<p>Unlike other development kits which are based on proprietary hardware and
software tools, QuickFeather is based on open source hardware and is built
around 100% open source software. QuickLogic provides <a href="https://github.com/QuickLogic-Corp/qorc-sdk">a nice
SDK</a> to flash some FreeRTOS
software and get started. There is a bunch of documentation and examples in
their github repository.</p>

<p>Since the QuickFeather is optimized for battery saving use cases, it doesn’t
include neither Wi-Fi nor Bluetooth connectivity. Therefore, the data can be
only transferred using UART serial connection.</p>

<h2 id="capture-data">Capture data</h2>

<p>The on-board accelerometer is the main sensor for this use case. I use a
USB-serial converter in order to read data directly from the accelerometer and
transfer it to another host that is connected to the other end of the usb
cable.</p>

<p>Data is captured and analysed using another machine. I personally connected a
raspberry pi, which has also a small form factor, in order to have flexibility
when performing the different gestures.</p>

<p>SensiML provides a <a href="https://github.com/sensiml/open-gateway">web application</a> to visualize and save data.
This application is a python application that runs a flask webserver and
provides nice functionalities such as capturing video at the save time in order
to correlate to saved data.
The code is available on github, so one can see how the code works and even
<a href="https://github.com/sensiml/open-gateway/pull/29">propose some modifications</a>, like I did.</p>

<p>I captured data from <em>O</em>, <em>W</em> and <em>Z</em> gestures as you can see in the following picture:</p>

<center>
<amp-img src="/assets/img/magic-wand/data-capture.gif" width="640" height="360" layout="intrinsic" alt="data capture using open-gateway application"></amp-img>
<br /><i>Data capture using open-gateway application</i>
</center>

<h2 id="label-data-with-label-studio">Label data with Label Studio</h2>

<p>Once data is collected one need to label it so that one can teach a machine
learning model how to associate a certain movement with a gesture. 
I used <a href="https://labelstud.io/">Label Studio</a>, which is a open source data
labelling tool. It can be used to label different kind of data such as image,
audio, text, time series and a combination of all the precedent.</p>

<p>It can be deployed on-premise using a docker image, which is very handy if you
want to go fast.</p>

<h3 id="setup">Setup</h3>

<p>Once Label Studio stars, it has to be configured for a label task. For this case, the
label task corresponds to a time series data. One can chose a graphical
configuration using preconfigured templates or you can customized your self
with some kind of html code. Here is the code I use to configure the data
coming from <em>X</em>, <em>Y</em> and <em>Z</em> accelerometers.</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;View&gt;</span>
  <span class="c">&lt;!-- Control tag for labels --&gt;</span>
  <span class="nt">&lt;TimeSeriesLabels</span> <span class="na">name=</span><span class="s">"label"</span> <span class="na">toName=</span><span class="s">"ts"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;Label</span> <span class="na">value=</span><span class="s">"O"</span> <span class="na">background=</span><span class="s">"red"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;Label</span> <span class="na">value=</span><span class="s">"Z"</span> <span class="na">background=</span><span class="s">"green"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;Label</span> <span class="na">value=</span><span class="s">"W"</span> <span class="na">background=</span><span class="s">"blue"</span><span class="nt">/&gt;</span>
  <span class="nt">&lt;/TimeSeriesLabels&gt;</span>
  <span class="c">&lt;!-- Object tag for time series data source --&gt;</span>
  <span class="nt">&lt;TimeSeries</span> <span class="na">name=</span><span class="s">"ts"</span> <span class="na">valueType=</span><span class="s">"url"</span> <span class="na">value=</span><span class="s">"$timeseriesUrl"</span> <span class="na">sep=</span><span class="s">","</span> <span class="nt">&gt;</span>
    <span class="nt">&lt;Channel</span> <span class="na">column=</span><span class="s">"AccelerometerX"</span> <span class="na">strokeColor=</span><span class="s">"#1f77b4"</span> <span class="na">legend=</span><span class="s">"AccelerometerX"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;Channel</span> <span class="na">column=</span><span class="s">"AccelerometerY"</span> <span class="na">strokeColor=</span><span class="s">"#ff7f0e"</span> <span class="na">legend=</span><span class="s">"AccelerometerY"</span><span class="nt">/&gt;</span>
    <span class="nt">&lt;Channel</span> <span class="na">column=</span><span class="s">"AccelerometerZ"</span>  <span class="na">strokeColor=</span><span class="s">"#111111"</span> <span class="na">legend=</span><span class="s">"AccelerometerZ"</span><span class="nt">/&gt;</span>
  <span class="nt">&lt;/TimeSeries&gt;</span>
<span class="nt">&lt;/View&gt;</span>
</code></pre></div></div>

<p>Label Studio has a nice preview feature, which shows how the labelling task will look with
the supplied configuration. The following screenshot shows how the interface
looks like for the setup process.</p>

<center>
<amp-img src="/assets/img/magic-wand/label-studio1.png" width="901" height="450" layout="intrinsic" alt="label studio setup configuration"></amp-img>
<br /><i>Label Studio setup configuration</i>
</center>

<h3 id="labelling">Labelling</h3>

<p>One of the nicest things from Label Studio is the fact that one can go really
fast using the keyboard shortcuts. It also provides some machine learning
plugins which make predictions with the partial labelled data.
The following screenshot shows the interface for some labelled data.</p>

<center>
<amp-img src="/assets/img/magic-wand/label-studio2.png" width="901" height="450" layout="intrinsic" alt="label studio data labelled"></amp-img>
<br /><i>Data labelled using Label Studio</i>
</center>

<p>From a machine learning perspective, the exported data should be a csv file with four different columns. Even is Label Studio is able to export in csv, it didn’t have the right format for me, instead it looks like the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>timeseriesUrl,id,label,annotator,annotation_id
/data/upload/W.csv,3,"[{""start"": 156, ""end"": 422, ""instant"": false, ""timeserieslabels"": [""W""]}, ... ]",admin@admin.com,3
/data/upload/Z.csv,2,"[{""start"": 141, ""end"": 419, ""instant"": false, ""timeserieslabels"": [""Z""]}, ...]",admin@admin.com,2
/data/upload/O.csv,1,"[{""start"": 77, ""end"": 389, ""instant"": false, ""timeserieslabels"": [""O""]}, ...]",admin@admin.com,1
</code></pre></div></div>

<p>So I decided to export labels in json format and then build a python script to
transform and combine them all.  The following script transforms three json files
from <em>Label Studio</em> into a single file with 4 columns <em>AccelerometerX</em>,
<em>AccelerometerY</em>, <em>AccelerometerZ</em> and <em>Label</em>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

<span class="n">df_all</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">LABELS</span> <span class="o">=</span> <span class="p">[</span><span class="s">'W'</span><span class="p">,</span> <span class="s">'Z'</span><span class="p">,</span> <span class="s">'O'</span><span class="p">]</span>
<span class="n">sensor_columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'AccelerometerX'</span><span class="p">,</span><span class="s">'AccelerometerY'</span><span class="p">,</span> <span class="s">'AccelerometerZ'</span><span class="p">,</span> <span class="s">'Label'</span><span class="p">]</span>

<span class="k">for</span> <span class="n">ind</span><span class="p">,</span> <span class="n">label</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">LABELS</span><span class="p">):</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">label</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">label</span><span class="si">}</span><span class="s">.csv'</span><span class="p">)</span>
    <span class="n">events</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">read_json</span><span class="p">(</span><span class="s">'WOZ.json'</span><span class="p">)[</span><span class="s">'label'</span><span class="p">][</span><span class="n">ind</span><span class="p">])</span>

    <span class="n">df</span><span class="p">[</span><span class="s">'Label'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">v</span> <span class="ow">in</span> <span class="n">events</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="s">'start'</span><span class="p">],</span> <span class="n">v</span><span class="p">[</span><span class="s">'end'</span><span class="p">]):</span>
            <span class="n">df</span><span class="p">[</span><span class="s">'Label'</span><span class="p">].</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span><span class="p">[</span><span class="s">'timeserieslabels'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">df</span><span class="p">[</span><span class="s">'LabelNumerical'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Categorical</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">Label</span><span class="p">)</span>

    <span class="n">df</span><span class="p">[</span><span class="n">sensor_columns</span><span class="p">].</span><span class="n">to_csv</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">label</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">label</span><span class="si">}</span><span class="s">_label.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">df_all</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_all</span><span class="p">,</span> <span class="n">df</span><span class="p">],</span> <span class="n">sort</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="n">df_all</span><span class="p">[</span><span class="n">sensor_columns</span><span class="p">].</span><span class="n">to_csv</span><span class="p">(</span><span class="sa">f</span><span class="s">'WOZ_label.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<p>The resulting data can be directly used as a time series data and a machine
learning model can be trained in order to recognise the patterns automatically.
The following picture shows data for W, O and Z patterns.</p>

<center>
<amp-img src="/assets/img/magic-wand/matplotlib-data.png" width="1200" height="350" layout="intrinsic" alt="labelled data plotted using matplotlib"></amp-img>
<br /><i>Labelled data for W, O and Z gestures</i>
</center>

<h2 id="training-a-model-using-sensiml">Training a model using SensiML</h2>

<p>SensiML provides a python package to build a data pipeline which can be used to
train a machine learning model. One need to create a free account in order to
use it. There is a lot of <a href="https://sensiml.com/tensorflow-lite/">documentation and examples</a>
available online.</p>

<h3 id="pipeline">Pipeline</h3>

<p>Pipelines are a key component of the SensiML workflow. Pipelines store the
preprocessing, feature extraction, and model building steps.</p>

<p>Model training can be done using either SensiML cloud or using Tensorflow to
train the model locally and the uploading it to SensiML in order to obtain the
firmware code to run on the embedded device.</p>

<p>In order to train the model locally, one needs to build a data pipeline to process data and calculate the feature vector.
This is done using the following pipeline:</p>
<ul>
  <li>The <strong>Input Query</strong> function which specifies what data is being fed into the model</li>
  <li>The <strong>Segmentation</strong> which specifies how the data should be feed to the classifier.</li>
  <li><strong>Windowing</strong> segmented which captures data depending on gesture expected length.</li>
  <li>The <strong>Feature Generator</strong> which specify which features should be extracted from the raw time-series data</li>
  <li>The <strong>Feature Selector</strong> which selects the best features. In this case, we are using the custom feature selector to downsample the data.</li>
  <li>The <strong>Feature Transform</strong> which specifies how to transform the features after extraction. In this case, it is to scale them to 1 byte each</li>
</ul>

<p>Here is the python code for the pipeline</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dsk</span><span class="p">.</span><span class="n">pipeline</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">dsk</span><span class="p">.</span><span class="n">pipeline</span><span class="p">.</span><span class="n">set_input_data</span><span class="p">(</span><span class="s">'wand_10_movements.csv'</span><span class="p">,</span> <span class="n">group_columns</span><span class="o">=</span><span class="p">[</span><span class="s">'Label'</span><span class="p">],</span> <span class="n">label_column</span><span class="o">=</span><span class="s">'Label'</span><span class="p">,</span> <span class="n">data_columns</span><span class="o">=</span><span class="n">sensor_columns</span><span class="p">)</span>

<span class="n">dsk</span><span class="p">.</span><span class="n">pipeline</span><span class="p">.</span><span class="n">add_segmenter</span><span class="p">(</span><span class="s">"Windowing"</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s">"window_size"</span><span class="p">:</span> <span class="mi">350</span><span class="p">,</span> <span class="s">"delta"</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span> <span class="s">"train_delta"</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span> <span class="s">"return_segment_index"</span><span class="p">:</span> <span class="bp">False</span><span class="p">})</span>

<span class="n">dsk</span><span class="p">.</span><span class="n">pipeline</span><span class="p">.</span><span class="n">add_feature_generator</span><span class="p">(</span>
    <span class="p">[</span>
        <span class="p">{</span><span class="s">'subtype_call'</span><span class="p">:</span> <span class="s">'Statistical'</span><span class="p">},</span>
        <span class="p">{</span><span class="s">'subtype_call'</span><span class="p">:</span> <span class="s">'Shape'</span><span class="p">},</span>
        <span class="p">{</span><span class="s">'subtype_call'</span><span class="p">:</span> <span class="s">'Column Fusion'</span><span class="p">},</span>
        <span class="p">{</span><span class="s">'subtype_call'</span><span class="p">:</span> <span class="s">'Area'</span><span class="p">},</span>
        <span class="p">{</span><span class="s">'subtype_call'</span><span class="p">:</span> <span class="s">'Rate of Change'</span><span class="p">},</span>
    <span class="p">],</span>
    <span class="n">function_defaults</span><span class="o">=</span><span class="p">{</span><span class="s">'columns'</span><span class="p">:</span> <span class="n">sensor_columns</span><span class="p">},</span>
<span class="p">)</span>

<span class="n">dsk</span><span class="p">.</span><span class="n">pipeline</span><span class="p">.</span><span class="n">add_feature_selector</span><span class="p">([{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'Tree-based Selection'</span><span class="p">,</span> <span class="s">'params'</span><span class="p">:{</span><span class="s">"number_of_features"</span><span class="p">:</span><span class="mi">12</span><span class="p">}},])</span>

<span class="n">dsk</span><span class="p">.</span><span class="n">pipeline</span><span class="p">.</span><span class="n">add_transform</span><span class="p">(</span><span class="s">"Min Max Scale"</span><span class="p">)</span> <span class="c1"># Scale the features to 1-byte
</span></code></pre></div></div>

<h3 id="tensorflow-model">TensorFlow model</h3>

<p>I use the TensorFlow Keras API to create a neural network. This model is very simplified because not all Tensorflow functions and layers are available in the microcontroller version.  I use a fully connected network to efficiently classify the gestures. It takes in input the features vectors created previously with the pipeline (12).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">tensorflow.keras</span> <span class="kn">import</span> <span class="n">layers</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>

<span class="n">tf_model</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">Sequential</span><span class="p">()</span>

<span class="n">tf_model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span><span class="n">kernel_regularizer</span><span class="o">=</span><span class="s">'l1'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">x_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],)))</span>
<span class="n">tf_model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.1</span><span class="p">))</span>
<span class="n">tf_model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">x_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],)))</span>
<span class="n">tf_model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.1</span><span class="p">))</span>
<span class="n">tf_model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="n">y_train</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">))</span>

<span class="c1"># Compile the model using a standard optimizer and loss function for regression
</span><span class="n">tf_model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</code></pre></div></div>

<h3 id="training">Training</h3>

<p>The training is performed by feeding the neural network with the dataset by
batches of data. For each batch of data a loss function is computed and the
weights of the network are adjusted.  Each time it loops through the entire training set, then is
called an epoch. In the following picture:</p>
<ul>
  <li>at the top left we can see the evolution of the loss function, it decreased, meaning that it converges to a optimal solution.</li>
  <li>at the bottom left we can see the evolution of the accuracy of the model, it increases!</li>
  <li>at the right we have the confusion matrix for the validation and train set.</li>
</ul>

<center>
<amp-img src="/assets/img/magic-wand/train-loss.png" width="1200" height="450" layout="intrinsic" alt="Model training performance"></amp-img>
<br /><i>Model training performance</i>
</center>

<p>The confusion matrix provides information not only about the accuracy but also
about the kind of errors of the model. It’s often the best way to understand
which classes are difficult to distinguish.</p>

<p>Once you are satisfied with the model results, it can be optimized using
Tensorflow quantize function. The quantization reduces the model size by
converting the network weights from 4-byte floating point values to 1-byte
unsigned int8. Tensorflow provides the following built-in tool:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Quantized Model
</span><span class="n">converter</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">lite</span><span class="p">.</span><span class="n">TFLiteConverter</span><span class="p">.</span><span class="n">from_keras_model</span><span class="p">(</span><span class="n">tf_model</span><span class="p">)</span>
<span class="n">converter</span><span class="p">.</span><span class="n">optimizations</span> <span class="o">=</span> <span class="p">[</span><span class="n">tf</span><span class="p">.</span><span class="n">lite</span><span class="p">.</span><span class="n">Optimize</span><span class="p">.</span><span class="n">OPTIMIZE_FOR_SIZE</span><span class="p">]</span>
<span class="n">converter</span><span class="p">.</span><span class="n">representative_dataset</span> <span class="o">=</span> <span class="n">representative_dataset_generator</span>
<span class="n">tflite_model_quant</span> <span class="o">=</span> <span class="n">converter</span><span class="p">.</span><span class="n">convert</span><span class="p">()</span>
</code></pre></div></div>

<p>There are more benefits by quantizing the model for Cortex-M processors like
the Quickfeather, which uses some instructions that gives a boost in performance.</p>

<p>The quantized model can be uploaded to SensiML in order to obtain a firmware to
flash to the QuickFeather.  One can download the model using the jupyter
notebook widget or in <a href="https://app.sensiml.cloud/">sensiml cloud application</a>.
There are two available formats:</p>
<ul>
  <li><strong>binary</strong>: this can be flashed directly to the QuickFeather. The results
are transferred using serial output.</li>
  <li><strong>library</strong>: this is a <em>knowledgepack</em> form, which can be used in <em>Qorc SDK</em>
to compile. There is more flexibility for this option, because one can
modify source code before compiling.</li>
</ul>

<h2 id="export-model-to-quickfeather">Export model to Quickfeather</h2>

<p>The knowledgepack can be customized in order to light the QuickFeather led with
a different colour depending on the prediction made.
This can be done by adding the following function to the <em>src/sml_output.c</em> file.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// src/sml_output.c</span>
<span class="k">static</span> <span class="kt">intptr_t</span> <span class="n">last_output</span><span class="p">;</span>

<span class="kt">uint32_t</span> <span class="nf">sml_output_results</span><span class="p">(</span><span class="kt">uint16_t</span> <span class="n">model</span><span class="p">,</span> <span class="kt">uint16_t</span> <span class="n">classification</span><span class="p">)</span>
<span class="p">{</span>

    <span class="c1">//kb_get_feature_vector(model, recent_fv_result.feature_vector, &amp;recent_fv_result.fv_len);</span>

    <span class="cm">/* LIMIT output to 100hz */</span>

    <span class="k">if</span><span class="p">(</span> <span class="n">last_output</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">){</span>
        <span class="n">last_output</span> <span class="o">=</span> <span class="n">ql_lw_timer_start</span><span class="p">();</span>
    <span class="p">}</span>

    <span class="k">if</span><span class="p">(</span> <span class="n">ql_lw_timer_is_expired</span><span class="p">(</span> <span class="n">last_output</span><span class="p">,</span> <span class="mi">10</span> <span class="p">)</span> <span class="p">){</span>
        <span class="n">last_output</span> <span class="o">=</span> <span class="n">ql_lw_timer_start</span><span class="p">();</span>

        <span class="k">if</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">classification</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">HAL_GPIO_Write</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">HAL_GPIO_Write</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">classification</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">HAL_GPIO_Write</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">HAL_GPIO_Write</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">if</span> <span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">classification</span> <span class="o">==</span> <span class="mi">3</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">HAL_GPIO_Write</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="n">HAL_GPIO_Write</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
        <span class="p">}</span>
    	<span class="n">sml_output_serial</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">classification</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Finally the model can be compiled using <em>Qorc SDK</em> and flashed again to the QuickFeather.</p>

<h2 id="test-model-using-quickfeather">Test model using Quickfeather</h2>

<p>One can use a Li-Po battery with the battery connector of the QuickFeather in order to have complete autonomy.
Then using a nice spoon like the following one can improvise a magic wand 🪄:</p>

<center>
<amp-img src="/assets/img/magic-wand/magic-wand.jpg" width="337" height="600" layout="intrinsic" alt="quickfeather as a magic wand"></amp-img>
<br /><i>QuickFeather as a magic wand</i>
</center>

<p>The following video shows the recognition system in action, the colours mean he following:</p>
<ul>
  <li><strong>red</strong> for O gesture</li>
  <li><strong>green</strong> for W gesture</li>
  <li><strong>blue</strong> for Z gesture</li>
</ul>

<center>
<amp-video width="360" height="640" src="/assets/img/magic-wand/magic-wand.mp4" poster="/assets/img/magic-wand/magic-wand.jpg" layout="intrinsic" controls="" loop="" autoplay="">
  <div fallback="">
    <p>Your browser doesn't support HTML5 video.</p>
  </div>
</amp-video>
</center>

<h2 id="conclusions">Conclusions</h2>

<p>QuickFeather is a device completely adapted for tiny machine learning models.
This use case provides a simple example to demystify the whole workflow for
implementing machine learning algorithms to microcontrollers, but it can be
extended for more complex use cases, like the one provided in the <a href="https://www.hackster.io/climate-change-challengers/hydroponic-agriculture-learning-with-sensiml-ai-framework-5289ea">Hackster.io
Climate Change
Challenge</a>.</p>

<p>SensiML provides provides nice tools to simplify machine learning
implementation for microcontrollers. They provide software like <a href="https://sensiml.com/services/toolkit/open-gateway/">Data Capture
Lab</a>, which capture data and
also provides a labelling module. However, for this case I prefer to use Label
Studio, which is more generic tool, that works for most use cases.</p>

<p>The notebook with the complete details about the model training can be found in <a href="https://gist.github.com/cristianpb/4b86c5176d9d305aaa2974ce9c3c83f8">this gist</a>.</p>]]></content:encoded><description>This article aims to demystify the implementation of machine learning algorithms into microcontrollers. It uses runs a TensorflowLite model for gesture recognition in a QuickFeather microcontroller.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/magic-wand/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/magic-wand/main-16x9.jpg"/></item><item><title>TheFifthDriver: Machine learning driving assistance on FPGA</title><link>https://cristianpb.github.io/blog/fifth-drive-machine-learning-fpga</link><category>programming</category><category>fpga</category><category>python</category><category>opencv</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Mon, 7 Dec 2020 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/fifth-drive-machine-learning-fpga</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"/><description>FPGA implementation of a highly efficient real-time machine learning driving assistance application using a camera circuit.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/external-articles-responsive/the-fifth-driver-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/external-articles-responsive/the-fifth-driver-16x9.jpg"/></item><item><title>Hydroponic Agriculture Learning with SensiML AI Framework</title><link>https://cristianpb.github.io/blog/hydroponic-agriculture-learning</link><category>programming</category><category>fpga</category><category>sensiml</category><category>quickfeather</category><category>ecology</category><category>python</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Mon, 7 Dec 2020 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/hydroponic-agriculture-learning</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"/><description>New methodologies of horticulture based-on high-end technology are urgently required to transform the way in which the world is fed. In this project, we present the results of a hydroponic agriculture PoC, which was developed using Quicklogic's QuickFeather in conjuntion with SensiML to highlight the enormous benefits that the growth of crops without soil brings to the climate change.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/external-articles-responsive/hydroponic-agriculture-learning-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/external-articles-responsive/hydroponic-agriculture-learning-16x9.jpg"/></item><item><title>Writing notes with Vimwiki and Hugo static generator</title><link>https://cristianpb.github.io/blog/vimwiki-hugo</link><category>programming</category><category>vim</category><category>hugo</category><category>jekyll</category><category>github-actions</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Wed, 1 Jul 2020 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/vimwiki-hugo</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Taking notes is important when you want to remember things. I used to have
notebooks, which worked fine, but it’s complicated to search things when you
have a lot of things. As a die hard Vim user, I decided to give a try to
<a href="https://github.com/vimwiki/vimwiki">Vimwiki</a> to help me organize my notes and ideas.</p>

<p>Vimwiki makes easy for you to create a personal wiki using the Vim text editor.
A wiki is a collection of text documents linked together and formatted with
plain text syntax that can be highlighted for readability using Vim’s
syntax highlighting feature.</p>

<p>The plain text notes can be exported to HTML, which improves readability. In
addition, it’s possible to connect external HTML converters like Jekyll or
Hugo.</p>

<p>In this post I will show the main functionalities of Vimwiki and how to
connect the Hugo fast markdown static generator.</p>

<div class="columns is-mobile is-multiline is-horizontal-center">
  <div class="column is-6-desktop is-12-mobile">
    <amp-image-lightbox id="lightbox1" layout="nodisplay"></amp-image-lightbox>
    <amp-img on="tap:lightbox1" role="button" tabindex="0" aria-describedby="vim1" alt="Markdown writing in vim" title="Markdown writing in vim" src="/assets/img/vimwiki-hugo/main.png" layout="responsive" width="737" height="697"></amp-img>
    <div id="vim1">
      <p>Vimwiki notes writing in Vim</p>
    </div>
  </div>
  <div class="column is-6-desktop is-12-mobile">
    <amp-img on="tap:lightbox1" role="button" tabindex="0" aria-describedby="markdown1" alt="markdown html output" title="markdown html output" src="/assets/img/vimwiki-hugo/markdown.png" layout="responsive" width="650" height="633"></amp-img>
    <div id="markdown1">
      <p>Markdown notes converted into HTML</p>
    </div>
  </div>
</div>

<h2 id="vimwiki">Vimwiki</h2>

<p>With Vimwiki you can:</p>
<ul>
  <li>organize your notes and ideas in files that are linked together;</li>
  <li>manage todo lists;</li>
  <li>maintain a diary, writing notes for every day;</li>
  <li>write documentation in simple markdown files;</li>
  <li>export your documents to HTML.</li>
</ul>

<h3 id="vim-shortcuts">Vim shortcuts</h3>

<p>One of the main Vim advantages is the fact that it’s a modal editor,
which means that it has different edition modes. 
Each edition mode gives different functionalities to each key.
This increases the number of shortcuts without having to include multiple keyboard combinations. In Vimwiki this allows to write notes with ease.</p>

<p>When I want to write some notes, I just open Vim and then I use
<code class="language-plaintext highlighter-rouge">&lt;Leader&gt;w&lt;Leader&gt;w</code> to create a new note for today with a name based on the
current date. <code class="language-plaintext highlighter-rouge">&lt;Leader&gt;</code> is a key that can be configured in Vim, in my case it’s <em>comma</em> character (,).</p>

<p>If I want to look at my notes I can use <code class="language-plaintext highlighter-rouge">&lt;Leader&gt;ww</code> to open the wiki index
file. I can use Enter key to follow links in the index. Backspace acts a return
to the previous page.</p>

<p>I use <a href="https://github.com/neoclide/coc-snippets">CoC snippets</a> to improve autocompletion. In markdown, I find this plugin very useful to create tables, code blocks and links. You can use snippets for almost every programming language, just take a look at the documentation.</p>

<p>When I want to preview the markdown file, I use <code class="language-plaintext highlighter-rouge">&lt;Leader&gt;wh</code> to convert the current
wiki page to HTML and I added also a shortcut to open HTML with the browser.</p>

<p>In the following video you can see an example of this workflow in action.</p>

<amp-video width="1024" height="610" src="/assets/img/vimwiki-hugo/video.mp4" poster="/assets/img/vimwiki-hugo/main.png" layout="responsive" controls="" loop="" autoplay="">
  <div fallback="">
    <p>Your browser doesn't support HTML5 video.</p>
  </div>
</amp-video>

<h3 id="searching-in-your-notes">Searching in your notes</h3>

<p>One of the advantages of digital notes are the fact that you can search quickly in multiple files using queries.</p>

<p>Vimwiki comes with a <em>VimWikiSearch</em> command (<code class="language-plaintext highlighter-rouge">VWS</code>) which is only a wrapper
for Unix <em>grep</em> command. This command can search for patterns in case insensitive mode in all your notes.</p>

<p>An excellent way to implement labels and contexts for cross-correlating
information is to assign tags to headlines. If you add tags to your Vimwiki
notes, you can also use a <em>VimwikiSearchTags</em> command.</p>

<p>In both cases, when you are searching in your notes, the results will populate
your local list, where you can move using Vim commands <code class="language-plaintext highlighter-rouge">lopen</code> to open the list, <code class="language-plaintext highlighter-rouge">lnext</code> to go to the next occurence and <code class="language-plaintext highlighter-rouge">lnext</code> for the previous occurence.</p>

<h2 id="hugo">Hugo</h2>

<p>Vimwiki has a custom filetype called <em>wiki</em>, which is a little bit different
from markdown.  The native vimwiki2html command only works for <em>wiki</em>
filetypes. If you want to transform your files to HTML using other filetypes, like markdown, you have to use a custom parser. Even if I’m not able to use Vimwiki native parser, I prefer markdown format because it’s very popular and simple.</p>

<p>These are some options to use as an external markdown parser:</p>
<ul>
  <li><em>Pandoc</em>, which I think works pretty good, but requires a lot of haskell dependencies.</li>
  <li><a href="https://github.com/WnP/vimwiki_markdown/">Python vimwiki markdown</a> libraries, which I think has a lot of potential.</li>
  <li>Static website generators like Jekyll, Hugo or Hexo.</li>
</ul>

<p>I started using static website generators because it can also publish easily the notes as static webpages, which I wanted to publish in Github Pages.</p>

<p>My first option was Jekyll, which is the Github native supported static website generator. It’s easy to use and the syntax is very straightforward, but I started to regret it when I accumulated a lot of notes. Then I decided to use Hugo, which is claimed to be faster and since it’s been coded in Go, it has no dependencies. In the following table I show my compiling time results for both:</p>

<table>
  <thead>
    <tr>
      <th>Compiling time</th>
      <th>Jekyll</th>
      <th>Hugo</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>10 pages</td>
      <td>0.1 seconds</td>
      <td>0.1 seconds</td>
    </tr>
    <tr>
      <td>150 pages</td>
      <td>48 seconds</td>
      <td>0.5 seconds</td>
    </tr>
  </tbody>
</table>

<p>I should say that I used Jekyll Github gems, which includes some unnecessary
ruby Gems, so I think Jekyll performance can be increased. It’s a nice software that I use to publish this post, but still Hugo is faster.</p>

<h3 id="building-vimiki-with-hugo">Building Vimiki with Hugo</h3>

<p>The <code class="language-plaintext highlighter-rouge">.vimrc</code> file  contains vim configuration and it’s the place where one can
put the definition about Vimwiki syntax and writing directory.  As
you can see in my configuration, I use markdown syntax and save my files under
<code class="language-plaintext highlighter-rouge">~/Documents/vimwiki/</code>.</p>

<div class="language-vim highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">" ~/.vimrc</span>

<span class="k">let</span> <span class="nv">g:vimwiki_list</span> <span class="p">=</span> <span class="p">[{</span>
<span class="se">  \</span> <span class="s1">'auto_export'</span><span class="p">:</span> <span class="m">1</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'automatic_nested_syntaxes'</span><span class="p">:</span><span class="m">1</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'path_html'</span><span class="p">:</span> <span class="s1">'$HOME/Documents/vimwiki/_site'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'path'</span><span class="p">:</span> <span class="s1">'$HOME/Documents/vimwiki/content'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'template_path'</span><span class="p">:</span> <span class="s1">'$HOME/Documents/vimwiki/templates/'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'syntax'</span><span class="p">:</span> <span class="s1">'markdown'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'ext'</span><span class="p">:</span><span class="s1">'.md'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'template_default'</span><span class="p">:</span><span class="s1">'markdown'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'custom_wiki2html'</span><span class="p">:</span> <span class="s1">'$HOME/.dotfiles/wiki2html.sh'</span><span class="p">,</span>
<span class="se">  \</span> <span class="s1">'template_ext'</span><span class="p">:</span><span class="s1">'.html'</span>
<span class="se">\</span><span class="p">}]</span>
</code></pre></div></div>

<p>The custom wiki2html file correspond to a script which is executed to transform
markdown into HTML. This scripts calls Hugo executable file and tells Hugo to
use the Vimwiki file path as a baseurl in order to maintain link dependencies.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># ~/.dotfiles/wiki2html.sh</span>

<span class="nb">env </span><span class="nv">HUGO_baseURL</span><span class="o">=</span><span class="s2">"file:///home/</span><span class="k">${</span><span class="nv">USER</span><span class="k">}</span><span class="s2">/Documents/vimwiki/_site/"</span> <span class="se">\</span>
    hugo <span class="nt">--themesDir</span> ~/Documents/ <span class="nt">-t</span> vimwiki <span class="se">\</span>
    <span class="nt">--config</span> ~/Documents/vimwiki/config.toml <span class="se">\</span>
    <span class="nt">--contentDir</span> ~/Documents/vimwiki/content <span class="se">\</span>
    <span class="nt">-d</span> ~/Documents/vimwiki/_site <span class="nt">--quiet</span> <span class="o">&gt;</span> /dev/null
</code></pre></div></div>

<p>The complete version of my <code class="language-plaintext highlighter-rouge">~/.vimrc</code> can be found in <a href="https://github.com/cristianpb/dotfiles">my dotfiles repository</a>.</p>

<h3 id="deploy-vimwiki-to-github-pages">Deploy Vimwiki to Github Pages</h3>

<p>Hugo projects can be easily published to Github using Github Actions. The
following script tells GitHub worker to use Hugo to build html at each push
and publish the HTML files to Github pages.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">🚀 Publish Github Pages</span>

<span class="na">on</span><span class="pi">:</span> <span class="s">push</span>
<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">deploy</span><span class="pi">:</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Git checkout</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v2</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Setup hugo</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">peaceiris/actions-hugo@v2</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">hugo-version</span><span class="pi">:</span> <span class="s1">'</span><span class="s">latest'</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Build</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">hugo --config config-gh.toml</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">🚀 Deploy to GitHub pages</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">peaceiris/actions-gh-pages@v3</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">deploy_key</span><span class="pi">:</span> <span class="s">${{ secrets.ACTIONS_DEPLOY_KEY }}</span>
          <span class="na">publish_branch</span><span class="pi">:</span> <span class="s">gh-pages</span>
          <span class="na">publish_dir</span><span class="pi">:</span> <span class="s">./public</span>
          <span class="na">force_orphan</span><span class="pi">:</span> <span class="no">true</span>
</code></pre></div></div>

<p>I like having one part of my notes published on Github Pages, at least the
configuration notes, which can be found in <a href="https://cristianpb.github.io/vimwiki">my Github
page</a>. But there is also a part of notes
that I keep private, for example my diary notes, where I may have some sensible
information, so I keep it away from publication just by adding it to the
<code class="language-plaintext highlighter-rouge">.gitignore</code> file. Here you can find <a href="https://github.com/cristianpb/vimwiki">my Github notes
repository</a>.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I like that fact that Hugo has no dependencies since it’s written in Go, so
it’s very easy to install, just download it from the github project releases
page. In addition is also a blazing fast static website converter, you can find
benchmarks in the internet.</p>

<p>I have been using Vimwiki very often, it allows me to take notes very easily
and also find information about things that happen in the past. When people ask
things about last month meeting I can find what I have written easily by
searching by dates, tags or just words.</p>

<p>Publishing my notes to github allows me to have a have a place where I can keep
track of my vimwiki configuration and also publish simple notes that are not
meant to be a blog post, like my install for arch linux or my text editor
configuration.</p>]]></content:encoded><description>Vim is a simple and ubiquitous text editor that I use daily. In this article I show how to use Vim to take and publish diary notes using Vimwiki and Hugo.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/vimwiki-hugo/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/vimwiki-hugo/main-16x9.jpg"/></item><item><title>Reverse proxy management with Traefik and GoAccess</title><link>https://cristianpb.github.io/blog/traefik-goaccess</link><category>system management</category><category>traefik</category><category>goaccess</category><category>docker</category><author>noemail@noemail.org (Cristian Brokate)</author><pubDate>Sun, 7 Jun 2020 00:00:00 GMT</pubDate><guid isPermaLink="false">https://cristianpb.github.io/blog/traefik-goaccess</guid><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The micro services philosophy consists on dividing applications in simple components.
This approach increases maintainability, since each application is not
coupled with others, so it can be tested, replaced and deployed independently.
This approach has become popular since the adoption of docker container technology.</p>

<p>A micro service project typically includes multiple docker containers, where
each container includes a separated functionality.</p>

<p>These containers can communicate in a private network and map ports with
the external network in order to expose services.</p>

<p>However, not every service includes a security layer, so it’s better to expose
a single application that serves as a router which controls every incoming requests
and send it to the right service.
This avoid exposing a service like a whole database connection.
Some applications that are able to act as a router are: Nginx, Apache server,
Caddy and Traefik.</p>

<ul>
  <li>Apache server is the oldest one but it has been loosing followers since the
arrival of Nginx.</li>
  <li>Nginx is very popular and powerful web server, which can be adapted to multiple
kind of situations.</li>
  <li>Traefik is the new kid on the block. It has native docker
support, so it means that you don’t have to define custom Nginx routing
configurations, because it can connect directly to docker socket to
automatically detect changes on containers.</li>
</ul>

<p>In this article I will show how to setup traefik using file system
configuration and also how to implement offline metric analysis using GoAccess
tool.</p>

<h2 id="traefik">Traefik</h2>

<p>Traefik is a reverse proxy, which routes incoming request to microservices.
It has been conceived for environments with multiple microservices, where a main
configuration is done to set-up Traefik, and then it dynamically detects 
new services comming from docker, kubernetes, rancher or a plain file system.
More information about <a href="https://docs.traefik.io/providers/overview/">traefik automatic discovery is available here</a>.</p>

<p>This automatic discovery behaviour was the main thing that attracted me to use
Traefik, unlike Nginx, which refuse to start if a declared service is not available.
Traefik on the other side, it can run even if a declared service won’t run, and
if the docker starts it will be automatically detected by Traefik.</p>

<h3 id="traefik-api">Traefik API</h3>

<p>Trafik has a modern web interface to graphically inspect configuration. 
It shows information about:</p>
<ul>
  <li>the entrypoints, which are the ports that Traefik is listening ;</li>
  <li>the running services and</li>
  <li>the routing rules, which defined how to direct the incoming request.</li>
</ul>

<p>Traefik interface can be easily enabled in the configuration file. The following
lines tell Traefik to serve the interface in the Traefik entrypoint (8080 by
default). The debug option is useful for profiling performance and debugging Traefik.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">api</span><span class="pi">:</span>
  <span class="na">insecure</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">dashboard</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">debug</span><span class="pi">:</span> <span class="no">true</span> 
</code></pre></div></div>

<p>Here is a screen shot of the web interface, where one can see how one service is
configured.</p>

<center>
<amp-img src="/assets/img/traefik-goaccess/traefik.png" width="640" height="400" layout="intrinsic" alt="goaccess"></amp-img>
<br /><i>Traefik API web interface. Mopidy service is encrypted using TLS.</i>
</center>

<p>In the following gist you can find the complete configuration file for Traefik.
The basic parameters to define are the entrypoints, where Traefik should be
listening and the encryption method. 
The providers configuration can be done in other plain file, or by adding
labels to docker, kubernetes, rancher, etc. In any case it dynamically detects 
changes on providers.</p>

<amp-gist data-gistid="1d77f178884569da6a3b904ef867a30a" data-file="traefik.yml" layout="fixed-height" height="1123">
</amp-gist>

<p>This configuration can be done in plain format if running outside a docker
container, but it can also be done by setting labels to Traefik docker
container.</p>

<h3 id="traefik-security">Traefik security</h3>

<p>By default Traefik will watch for all containers running on the Docker daemon,
and attempt to automatically configure routes and services for each container.
If you’d like to have more refined control, you can pass the
<code class="language-plaintext highlighter-rouge">--providers.docker.exposedByDefault=false</code> option and selectively enable
routing for your containers by adding a <code class="language-plaintext highlighter-rouge">traefik.enable=true</code> label.</p>

<p>Regarding HTTPS security, SSL connections can be easily configured in Traefik,
one can use a self signed certificate or connect automatically to <a href="https://letsencrypt.org/">Let’s
Encrypt</a> in order to get an SSL certificate. The
renewal is also taken into account by Traefik. HTTPS redirection is also
available into Traefik parameters.</p>

<h3 id="traefik-as-a-service">Traefik as a service</h3>

<p>Traefik has been conceived to run as a docker container, but since it’s written
in GO, then it’s possible to run the compiled version as a standalone file in
several operating systems.</p>

<p>In the docker version, Traefik runs automatically when the container is power
on and the logs are scoped to the standard output.
However if you run the standalone file, then you have to configure Traefik as a
system service. I used the excellent information from <a href="https://gist.github.com/ubergesundheit/7c9d875befc2d7bfd0bf43d8b3862d85">this Gerald Pape
gist</a>
to configure the Traefik service.</p>

<p>I prefer the standalone version in development environments like the raspberry pi
or jetson nano, where building docker images can be a little long.</p>

<h2 id="goaccess-to-monitor-logs">GoAccess to monitor logs</h2>

<p>GoAccess is a simple tool to analyse logs. It provides fast and valuable HTTP
statistics for system administrators that require a visual server report on the
fly. It can generate reports in terminal format, which is nice if your are
connecting on SSH, but it can also generate CSV, JSON or HTML reports.</p>

<p>Alternatives for this services are Matomo, which has the advantage of being
self hostable and open source.  Then you can be sure about how your colected
data is being used and that is not being sold to 3rd parties and advertisers.
However, Matomo has an extra client side javascript library which is required
in order to parse data, which is another dependency that I don’t want for
internal off-line environments.</p>

<p>Other popular alternative is Google Analytics, which has very powerful
reports and multiple of options that go beyond the scope of this article. The
only problem is that it’s not privacy compliant.</p>

<p>What makes GoAccess interesting, is that it generates detailed analytics based
purely on access logs from a web server, such as Apache, Nginx or in my case
Traefik. It’s written in C, and features both a terminal interface, as well as
a web interface. The way it’s designed to be used is by piping the <em>access.log</em>
contents into the GoAccess binary and providing any number of switches to
customize the output. Switches such as which log format you’re sending it, as
well as how to parse Geolocation from IP addresses.</p>

<p>In the following image you can see an example for GoAccess HTML dashboard. On
the top there is global information about the number of total requests, the
number of unique visitors, the log size, the bandwidth, etc.</p>

<center>
<amp-img src="/assets/img/traefik-goaccess/goaccess.png" width="640" height="400" layout="intrinsic" alt="goaccess"></amp-img>
<br /><i>GoAccess HTML dashboard</i>
</center>

<h3 id="real-time-dashboard">Real-time dashboard</h3>

<p>GoAccess can be called using the command line, you can configure log format
using a command line parameter or using a configuration file. Default
configuration file can be found at <code class="language-plaintext highlighter-rouge">/etc/goaccess.conf</code>, but you can also pass
other configuration file using <code class="language-plaintext highlighter-rouge">--config-file</code> option.</p>

<p>Default output format is in the command line, but one can configure an <em>html</em>
using a specific output file. This option will create a static html report,
which can be continuously updated using the <code class="language-plaintext highlighter-rouge">--real-time-html</code> option.</p>

<p>The following code shows the systemctl file that I use to configure GoAccess as
a service for real time use.</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[Unit]</span>
<span class="py">Description</span><span class="p">=</span><span class="err">Goaccess</span> <span class="err">Web</span> <span class="err">log</span> <span class="err">report.</span>
<span class="py">After</span><span class="p">=</span><span class="err">network.target</span>

<span class="nn">[Service]</span>
<span class="py">Type</span><span class="p">=</span><span class="err">simple</span>
<span class="py">User</span><span class="p">=</span><span class="err">root</span>
<span class="py">Group</span><span class="p">=</span><span class="err">root</span>
<span class="py">Restart</span><span class="p">=</span><span class="err">always</span>
<span class="py">ExecStart</span><span class="p">=</span><span class="err">/usr/bin/goaccess</span> <span class="err">-a</span> <span class="err">-g</span> <span class="err">-f</span> <span class="err">/var/log/traefik/access.log</span> <span class="err">-o</span> <span class="err">/var/www/html/report.html</span> <span class="err">--real-time-html</span>
<span class="py">StandardOutput</span><span class="p">=</span><span class="err">null</span>
<span class="py">StandardError</span><span class="p">=</span><span class="err">null</span>

<span class="nn">[Install]</span>
<span class="py">WantedBy</span><span class="p">=</span><span class="err">multi-user.target</span>
</code></pre></div></div>

<p>GoAccess doesn’t include a static web server, so it can not expose the produced
<em>html</em> by himself. But one can easily configure an Nginx static server to
expose the static files, as show in the following Nginx virtual server:</p>

<div class="language-nginx highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">server</span> <span class="p">{</span>
    <span class="kn">listen</span> <span class="mi">8082</span><span class="p">;</span>
    <span class="kn">listen</span> <span class="s">[::]:8082</span><span class="p">;</span>
    <span class="kn">server_name</span>  <span class="s">locahost</span><span class="p">;</span>

    <span class="kn">gzip</span> <span class="no">on</span><span class="p">;</span>
    <span class="kn">gzip_types</span>      <span class="nc">text/plain</span> <span class="nc">application/xml</span> <span class="nc">image/jpeg</span><span class="p">;</span>
    <span class="kn">gzip_proxied</span>    <span class="s">no-cache</span> <span class="s">no-store</span> <span class="s">private</span> <span class="s">expired</span> <span class="s">auth</span><span class="p">;</span>
    <span class="kn">gzip_min_length</span> <span class="mi">1000</span><span class="p">;</span>

    <span class="kn">root</span> <span class="n">/var/www/html</span><span class="p">;</span>

    <span class="c1"># Add index.php to the list if you are using PHP</span>
    <span class="kn">index</span> <span class="s">index.html</span> <span class="s">index.htm</span> <span class="s">index.nginx-debian.html</span><span class="p">;</span>

    <span class="kn">location</span> <span class="n">/</span> <span class="p">{</span>
            <span class="kn">try_files</span> <span class="nv">$uri</span> <span class="nv">$uri</span><span class="n">/</span> <span class="p">=</span><span class="mi">404</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Traefik is a static webserver which is well adapted for dynamic configurations.
Even if still a young project and is not as performant as Nginx, it has an
interesting approach and some nice features.
For example in docker applications, it automatically knows the internal IP
address of a service to redirect the incoming request.</p>

<p>GoAccess is a very good tool to provide insights from logs in a close
environment where you can not share your stats with the exterior. Since it has
been written in C, the reading performances are very good, being able to parse
400 millions of hits in 1 hour and 20 minutes, according to <a href="https://goaccess.io/faq">GoAccess
FAQ</a>.</p>]]></content:encoded><description>Traefik is a modern and dynamic reverse proxy with a native support with docker containers. This article compares Traefik with existing solutions and shows how to setup a privacy compliant monitoring tool with GoAccess.</description><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://cristianpb.github.io/assets/img/traefik-goaccess/main-16x9.jpg"/><media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" url="https://cristianpb.github.io/assets/img/traefik-goaccess/main-16x9.jpg"/></item></channel></rss>