<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Site-Server v6.0.0-30097-30097 (http://www.squarespace.com) on Sun, 29 Aug 2021 01:45:15 GMT
--><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://www.rssboard.org/media-rss" version="2.0"><channel><title>Thoughts on Analytics - Conaxon</title><link>https://www.conaxon.org/projects/</link><lastBuildDate>Sun, 29 Aug 2021 00:22:28 +0000</lastBuildDate><language>en-US</language><generator>Site-Server v6.0.0-30097-30097 (http://www.squarespace.com)</generator><description><![CDATA[]]></description><item><title>How to: Parse Android Logs for Analytics and Machine Learning Applications</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Sun, 29 Aug 2021 00:53:22 +0000</pubDate><link>https://www.conaxon.org/projects/how-to-parse-android-logs-for-analytics-and-machine-learning-applications</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:612a9869931e9073047295cd</guid><description><![CDATA[“Where are the logs? I can’t do anything if there aren’t logs!”

Software Engineers most useful tool for debugging is through the analysis 
of logs. Logs are the ledger that helps keep track of various states of 
systems at any single time. It is imperative that this historical record be 
analyzed properly to get the most out of monitoring performance—from 
complex to simple systems. In this post, Conaxon goes over how to parse 
Android LogCat Logs for use in analytics and machine learning applications.]]></description><content:encoded><![CDATA[<h2>Introduction: What are Logs?</h2><p class="">Building Android based apps, or any software for that matter, will eventually end up in understanding why a bug is occurring. Bugs are just a natural part of software development. A key tool in understanding the state of your software at the time an issue happens are logs. Think of logs as a ledger for what is happening when the code is running. Engineers can print almost anything to the logs that might help them understand problems that pop up in the future.</p><p class="">Given that logs are often structured, contain a ton of useful data, easy to acquire, and key to development software logs are ripe for sophisticated analysis and maybe even applying machine learning to them. There are lots of tools for log analytics like: Scalyr, Logz.io, Sematext, GrayLog, Nagios, and many others (<a href="https://opensource.com/article/19/4/log-analysis-tools).">https://opensource.com/article/19/4/log-analysis-tools).</a> In many cases, utilizing an open-source, pre-built, will work in a pinch and be pretty reliable when a mission critical bug plagues the backlog. However, it might be useful to have a way of creating your own customized solution. </p><h2>Android LogCat Logs:</h2><p class="">The structure of the Android Logs are as follows:</p>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          <a class="
                sqs-block-image-link
                
          
        
              " href="https://budhdisharma.medium.com/android-log-analysis-176f9b9dafaf"
              
          >
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1630191424994-1KYYA7OUJTB8ET9MC4WY/1_OKc8XLG9VIuDVMn3EZhdqg.png" data-image-dimensions="515x399" data-image-focal-point="0.5,0.5" alt="1_OKc8XLG9VIuDVMn3EZhdqg.png" data-load="false" data-image-id="612abf405d3d1a3a1416f799" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1630191424994-1KYYA7OUJTB8ET9MC4WY/1_OKc8XLG9VIuDVMn3EZhdqg.png?format=1000w" />
          
        
          </a>
        

        
      
        </figure>
      

    
  


  


<p class="">The main files that can be analyzed are the radio, main, event, and system logs. Each log file contains different characteristics about the system at any given time.</p><p class="">Each message in the log consists of the following elements:</p><ul data-rte-list="default"><li><p class="">A tag indicating the part of the system or application that the message came from</p></li><li><p class="">A timestamp (at what time this message came)</p></li><li><p class="">The message log level (or priority of the event represented by the message) </p></li><li><p class="">The log message itself( detail description of error or exception or information)</p></li></ul><p class="">There are a few different log types:</p><p class="">Application log - </p><ul data-rte-list="default"><li><p class="">Utilize the <em>android.util.Log</em> class methods to write messages of different priority to the log file</p></li><li><p class="">Java classes declare their tag statically as a string and can be many layers deep</p></li></ul><p class="">System log - </p><ul data-rte-list="default"><li><p class="">Utilize the <em>android.util.Slog</em> class </p></li><li><p class="">Many frameworks use the system logs to separate certain messages from a potentially messy application log</p></li></ul><p class="">&nbsp;Event log - </p><ul data-rte-list="default"><li><p class="">Event logs messages are created using <em>android.util.EventLog</em> class</p></li><li><p class="">Log entries consist of binary tags and they are followed by binary parameters</p></li><li><p class="">The message tag codes are stored on the system at: <em>/system/etc/event-log-tags</em></p></li></ul><p class="">Radio log</p><ul data-rte-list="default"><li><p class="">Used for radio and phone(modem) related information</p></li><li><p class="">Log entries consist of binary tags code and message for Network info</p></li></ul><h2>Android Log Structure:</h2><blockquote><p class="">tv_sec tv_nsec priority pid tid tag messageLen Message</p></blockquote><ul data-rte-list="default"><li><p class="">tag: log tag</p></li><li><p class="">tv_sec &amp; tv_nsec: the timestamp of the log messages</p><ul data-rte-list="default"><li><p class="">In the logs we are going to parse the date and timestamp (down to the milliseconds) </p></li></ul></li><li><p class="">pid: process Id</p></li><li><p class="">tid: thread id</p></li><li><p class="">Priority value is one of the following character values:</p><ul data-rte-list="default"><li><p class="">V: Verbose (lowest priority)*</p></li><li><p class="">D: Debug*</p></li><li><p class="">I: Info*</p></li><li><p class="">W: Warning*</p></li><li><p class="">E: Error*</p></li><li><p class="">F: Fatal*</p></li><li><p class="">S: Silent (highest priority, on which nothing is ever printed)</p></li></ul></li></ul><h2>Code for Parsing:</h2><p class="">The parsing of the files is fairly straightforward—especially because the text files are delimited by simple whitespace.</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">pandas</span> <span class="cm-keyword">as</span> <span class="cm-def">pd</span>
<span class="cm-keyword">import</span> <span class="cm-def">numpy</span> <span class="cm-keyword">as</span> <span class="cm-def">np</span>
<span class="cm-keyword">import</span> <span class="cm-def">seaborn</span> <span class="cm-keyword">as</span> <span class="cm-def">sns</span>
<span class="cm-keyword">import</span> <span class="cm-def">re</span>
<span class="cm-keyword">import</span> <span class="cm-def">os</span>, <span class="cm-def">zipfile</span>
<span class="cm-keyword">import</span> <span class="cm-def">gzip</span>
<span class="cm-keyword">import</span> <span class="cm-def">shutil</span>
<span class="cm-keyword">import</span> <span class="cm-def">datetime</span>
<span class="cm-keyword">import</span> <span class="cm-def">matplotlib</span>.<span class="cm-variable">pyplot</span> <span class="cm-variable">as</span> <span class="cm-variable">plt</span></pre><p class="">After the import of key libraries, then you will check the working directory and assign it as a variable. This will all be done to allow for the script to be placed in the directory of the log files:</p><pre class="source-code"><span class="cm-error"># define the current working directory as a variable for extracting all the log files that ar</span>
<span class="cm-variable">cwd</span> <span class="cm-operator">=</span> <span class="cm-variable">os</span>.<span class="cm-property">getcwd</span>()
<span class="cm-error"># define the search path for the rest of the script to reference</span>
<span class="cm-variable">search_path</span> <span class="cm-operator">=</span> <span class="cm-variable">os</span>.<span class="cm-property">getcwd</span>()
<span class="cm-error">#print</span>
<span class="cm-variable">print</span>(<span class="cm-variable">cwd</span>)</pre><p class="">The cwd should be within the folder where the log files are located. We’ll define a function to be used later that will programmatically level out the arrays. Then, we get to work decompressing the log files so everything ends up as a text file:</p><pre class="source-code"><span class="cm-error"># Function to make the array lengths the same later</span>
<span class="cm-variable">def</span> <span class="cm-variable">pad_dict_list</span>(<span class="cm-variable">dict_list</span>, <span class="cm-variable">padel</span>):
    <span class="cm-variable">lmax</span> <span class="cm-operator">=</span> <span class="cm-number">0</span>
    <span class="cm-keyword">for</span> <span class="cm-variable">lname</span> <span class="cm-keyword">in</span> <span class="cm-variable">dict_list</span>.<span class="cm-property">keys</span>():
        <span class="cm-variable">lmax</span> <span class="cm-operator">=</span> <span class="cm-variable">max</span>(<span class="cm-variable">lmax</span>, <span class="cm-variable">len</span>(<span class="cm-variable">dict_list</span>[<span class="cm-variable">lname</span>]))
    <span class="cm-keyword">for</span> <span class="cm-variable">lname</span> <span class="cm-keyword">in</span> <span class="cm-variable">dict_list</span>.<span class="cm-property">keys</span>():
        <span class="cm-variable">ll</span> <span class="cm-operator">=</span> <span class="cm-variable">len</span>(<span class="cm-variable">dict_list</span>[<span class="cm-variable">lname</span>])
        <span class="cm-keyword">if</span>  <span class="cm-variable">ll</span> <span class="cm-operator">&lt;</span> <span class="cm-variable">lmax</span>:
            <span class="cm-variable">dict_list</span>[<span class="cm-variable">lname</span>] <span class="cm-operator">+=</span> [<span class="cm-variable">padel</span>] <span class="cm-operator">*</span> (<span class="cm-variable">lmax</span> <span class="cm-operator">-</span> <span class="cm-variable">ll</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">dict_list</span>

<span class="cm-variable">file_type</span> <span class="cm-operator">=</span> <span class="cm-string">".gz"</span>
<span class="cm-keyword">for</span> <span class="cm-variable">fname</span> <span class="cm-keyword">in</span> <span class="cm-variable">os</span>.<span class="cm-property">listdir</span>(<span class="cm-variable">path</span><span class="cm-operator">=</span><span class="cm-variable">search_path</span>):
    <span class="cm-keyword">if</span> <span class="cm-variable">fname</span>.<span class="cm-property">endswith</span>(<span class="cm-variable">file_type</span>):
        <span class="cm-keyword">with</span> <span class="cm-variable">gzip</span>.<span class="cm-property">open</span>(<span class="cm-variable">fname</span>,<span class="cm-string">'rb'</span>) <span class="cm-variable">as</span> <span class="cm-variable">f_in</span>:
            <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-variable">fname</span><span class="cm-operator">+</span><span class="cm-string">'.log'</span>,<span class="cm-string">'wb'</span>) <span class="cm-variable">as</span> <span class="cm-variable">f_out</span>:
                <span class="cm-variable">shutil</span>.<span class="cm-property">copyfileobj</span>(<span class="cm-variable">f_in</span>,<span class="cm-variable">f_out</span>)</pre><p class="">Next lines do the following:</p><ul data-rte-list="default"><li><p class="">need to get a list of all the main.log files into a list</p></li><li><p class="">need to loop through the list</p></li><li><p class="">read / parse each file</p></li><li><p class="">append each parsed line to the appropriate empty list</p></li><li><p class="">strip out some of the files from the list of files we are going to loop over and read</p></li></ul><pre class="source-code"><span class="cm-variable">mainLogs</span> <span class="cm-operator">=</span> []        
<span class="cm-variable">keyword</span> <span class="cm-operator">=</span> <span class="cm-string">'main'</span>
<span class="cm-keyword">for</span> <span class="cm-variable">fname</span> <span class="cm-keyword">in</span> <span class="cm-variable">os</span>.<span class="cm-property">listdir</span>(<span class="cm-variable">cwd</span>):
    <span class="cm-keyword">if</span> <span class="cm-variable">keyword</span> <span class="cm-keyword">in</span> <span class="cm-variable">fname</span>:
        <span class="cm-variable">mainLogs</span>.<span class="cm-property">append</span>(<span class="cm-variable">fname</span>)  
        
<span class="cm-variable">mainLogs</span> <span class="cm-operator">=</span> [<span class="cm-variable">item</span> <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">mainLogs</span> <span class="cm-keyword">if</span> <span class="cm-variable">not</span> <span class="cm-variable">item</span>.<span class="cm-variable">endswith</span>(<span class="cm-string">'.gz'</span>)]
    
<span class="cm-variable">date</span> <span class="cm-operator">=</span> []
<span class="cm-variable">time</span> <span class="cm-operator">=</span> []
<span class="cm-variable">processID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">threadID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">priority</span> <span class="cm-operator">=</span> []
<span class="cm-variable">app</span> <span class="cm-operator">=</span> []
<span class="cm-variable">tagsText</span> <span class="cm-operator">=</span> []
<span class="cm-variable">readLine</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">main</span> <span class="cm-keyword">in</span> <span class="cm-variable">mainLogs</span>:
    <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-variable">main</span>,<span class="cm-variable">encoding</span><span class="cm-operator">=</span><span class="cm-string">'utf8'</span>,<span class="cm-variable">errors</span><span class="cm-operator">=</span><span class="cm-string">'surrogateescape'</span>,<span class="cm-variable">newline</span><span class="cm-operator">=</span><span class="cm-string">'\n'</span>) <span class="cm-variable">as</span> <span class="cm-variable">logs</span>:
        <span class="cm-keyword">try</span>:
            <span class="cm-keyword">for</span> <span class="cm-variable">line</span> <span class="cm-keyword">in</span> <span class="cm-variable">logs</span>:
                <span class="cm-variable">lines</span> <span class="cm-operator">=</span> <span class="cm-variable">line</span>.<span class="cm-property">split</span>()
                <span class="cm-error">#for debugging</span>
                <span class="cm-variable">readLine</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>)
                <span class="cm-variable">date</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">0</span>])
                <span class="cm-variable">time</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">1</span>])
                <span class="cm-variable">processID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">2</span>])
                <span class="cm-variable">threadID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">3</span>])
                <span class="cm-variable">priority</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">4</span>])
                <span class="cm-variable">app</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">5</span>])
                <span class="cm-variable">tagsText</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">6</span>:])
        <span class="cm-variable">except</span> <span class="cm-variable">IndexError</span>:
             <span class="cm-variable">pass</span></pre><p class="">After we have written our parsed files to the lists we need to combine the messages and tags together since we split by whitespace. This next little piece of code will recombine tags and texts to a human readable string:</p><pre class="source-code"><span class="cm-variable">tagsTextComb</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">innerlist</span> <span class="cm-keyword">in</span> <span class="cm-variable">tagsText</span>:
    <span class="cm-variable">tagsTextComb</span>.<span class="cm-property">append</span>(<span class="cm-string">' '</span>.<span class="cm-property">join</span>(<span class="cm-variable">innerlist</span>)<span class="cm-operator">+</span><span class="cm-string">" "</span>)</pre><p class="">Next lines of code will assess the length of each list. In order for a dictionary of lists to be transformed into a pandas dataframe, each of the lists must be the same length. </p><pre class="source-code"><span class="cm-variable">print</span>(<span class="cm-string">"length of Date"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">date</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of Time"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">time</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of processID"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">processID</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of threadID"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">threadID</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of priority"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">priority</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of app"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">app</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of tagsText"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">tagsText</span>)))
<span class="cm-variable">print</span>(<span class="cm-string">"length of tagsTextComb"</span><span class="cm-operator">+</span><span class="cm-string">' '</span><span class="cm-operator">+</span><span class="cm-variable">str</span>(<span class="cm-variable">len</span>(<span class="cm-variable">tagsTextComb</span>)))

<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">Date</span> <span class="cm-number">3829775</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">Time</span> <span class="cm-number">3829775</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">processID</span> <span class="cm-number">3829775</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">threadID</span> <span class="cm-number">3829775</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">priority</span> <span class="cm-number">3829775</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">app</span> <span class="cm-number">3829770</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">tagsText</span> <span class="cm-number">3829770</span>
<span class="cm-variable">length</span> <span class="cm-variable">of</span> <span class="cm-variable">tagsTextComb</span> <span class="cm-number">3829770</span></pre><p class="">The following code finalizes the processing of the main log:</p><ul data-rte-list="default"><li><p class="">Combine the lists into a dictionary</p></li><li><p class="">Call the function that pads the lists and evens them out</p></li><li><p class="">Create the dataframe for the main log</p></li></ul><pre class="source-code"><span class="cm-variable">mainDict</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'date'</span>: <span class="cm-variable">date</span>, <span class="cm-string cm-property">'time'</span>: <span class="cm-variable">time</span>,<span class="cm-string cm-property">'processID'</span>:<span class="cm-variable">processID</span>,<span class="cm-string cm-property">'threadID'</span>:<span class="cm-variable">threadID</span>,<span class="cm-string cm-property">'priority'</span>:<span class="cm-variable">priority</span>,<span class="cm-string cm-property">'app'</span>:<span class="cm-variable">app</span>,<span class="cm-string cm-property">'tagsText'</span>:<span class="cm-variable">tagsTextComb</span>}

<span class="cm-variable">pad_dict_list</span>(<span class="cm-variable">mainDict</span>,<span class="cm-string">'x'</span>)

<span class="cm-variable">dfMain</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">mainDict</span>)</pre><p class="">For the remainder of this post, we will process the remainder of the log files, combine them together, and cleaned for a bit of analysis:</p><pre class="source-code"><span class="cm-variable">crashLogs</span> <span class="cm-operator">=</span> []        
<span class="cm-variable">keyword</span> <span class="cm-operator">=</span> <span class="cm-string">'crash'</span>
<span class="cm-keyword">for</span> <span class="cm-variable">fname</span> <span class="cm-keyword">in</span> <span class="cm-variable">os</span>.<span class="cm-property">listdir</span>(<span class="cm-variable">cwd</span>):
    <span class="cm-keyword">if</span> <span class="cm-variable">keyword</span> <span class="cm-keyword">in</span> <span class="cm-variable">fname</span>:
        <span class="cm-variable">crashLogs</span>.<span class="cm-property">append</span>(<span class="cm-variable">fname</span>)
        
<span class="cm-variable">crashLogs</span> <span class="cm-operator">=</span> [<span class="cm-variable">item</span> <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">crashLogs</span> <span class="cm-keyword">if</span> <span class="cm-variable">item</span>.<span class="cm-variable">endswith</span>(<span class="cm-string">'.log'</span>)]

<span class="cm-variable">crashDate</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashTime</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashProcessID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashThreadID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashPriority</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashApp</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashTagsText</span> <span class="cm-operator">=</span> []
<span class="cm-variable">crashReadLine</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">crash</span> <span class="cm-keyword">in</span> <span class="cm-variable">crashLogs</span>:
    <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-variable">crash</span>,<span class="cm-variable">encoding</span><span class="cm-operator">=</span><span class="cm-string">'utf8'</span>,<span class="cm-variable">errors</span><span class="cm-operator">=</span><span class="cm-string">'surrogateescape'</span>,<span class="cm-variable">newline</span><span class="cm-operator">=</span><span class="cm-string">'\n'</span>) <span class="cm-variable">as</span> <span class="cm-variable">logs</span>:
        <span class="cm-variable">next</span>(<span class="cm-variable">logs</span>)
        <span class="cm-keyword">try</span>:
            <span class="cm-keyword">for</span> <span class="cm-variable">line</span> <span class="cm-keyword">in</span> <span class="cm-variable">logs</span>:
                <span class="cm-variable">lines</span> <span class="cm-operator">=</span> <span class="cm-variable">line</span>.<span class="cm-property">split</span>()
                <span class="cm-error">#for debugging</span>
                <span class="cm-variable">crashReadLine</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>)
                <span class="cm-variable">crashDate</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">0</span>])
                <span class="cm-variable">crashTime</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">1</span>])
                <span class="cm-variable">crashProcessID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">2</span>])
                <span class="cm-variable">crashThreadID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">3</span>])
                <span class="cm-variable">crashPriority</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">4</span>])
                <span class="cm-variable">crashApp</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">5</span>])
                <span class="cm-variable">crashTagsText</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">6</span>:])
        <span class="cm-variable">except</span> <span class="cm-variable">IndexError</span>:
             <span class="cm-variable">pass</span>

<span class="cm-variable">crashTagsTextComb</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">innerlist</span> <span class="cm-keyword">in</span> <span class="cm-variable">crashTagsText</span>:
    <span class="cm-variable">crashTagsTextComb</span>.<span class="cm-property">append</span>(<span class="cm-string">' '</span>.<span class="cm-property">join</span>(<span class="cm-variable">innerlist</span>)<span class="cm-operator">+</span><span class="cm-string">" "</span>)

<span class="cm-variable">crashDict</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'date'</span>:<span class="cm-variable">crashDate</span>,<span class="cm-string cm-property">'time'</span>:<span class="cm-variable">crashTime</span>,<span class="cm-string cm-property">'processID'</span>:<span class="cm-variable">crashProcessID</span>,<span class="cm-string cm-property">'threadID'</span>:<span class="cm-variable">crashThreadID</span>,<span class="cm-string cm-property">'priority'</span>:<span class="cm-variable">crashPriority</span>,<span class="cm-string cm-property">'app'</span>:<span class="cm-variable">crashApp</span>,<span class="cm-string cm-property">'tagsText'</span>:<span class="cm-variable">crashTagsTextComb</span>}
<span class="cm-variable">pad_dict_list</span>(<span class="cm-variable">crashDict</span>,<span class="cm-string">'x'</span>)
<span class="cm-variable">dfCrash</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">crashDict</span>)

<span class="cm-variable">eventsLogs</span> <span class="cm-operator">=</span> []
<span class="cm-variable">keyword</span> <span class="cm-operator">=</span> <span class="cm-string">'event'</span>
<span class="cm-keyword">for</span> <span class="cm-variable">fname</span> <span class="cm-keyword">in</span> <span class="cm-variable">os</span>.<span class="cm-property">listdir</span>(<span class="cm-variable">cwd</span>):
    <span class="cm-keyword">if</span> <span class="cm-variable">keyword</span> <span class="cm-keyword">in</span> <span class="cm-variable">fname</span>:
        <span class="cm-variable">eventsLogs</span>.<span class="cm-property">append</span>(<span class="cm-variable">fname</span>)

<span class="cm-variable">eventsLogs</span> <span class="cm-operator">=</span> [<span class="cm-variable">item</span> <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">eventsLogs</span> <span class="cm-keyword">if</span> <span class="cm-variable">not</span> <span class="cm-variable">item</span>.<span class="cm-variable">endswith</span>(<span class="cm-string">'.gz'</span>)]

<span class="cm-variable">date</span> <span class="cm-operator">=</span> []
<span class="cm-variable">time</span> <span class="cm-operator">=</span> []
<span class="cm-variable">processID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">threadID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">priority</span> <span class="cm-operator">=</span> []
<span class="cm-variable">app</span> <span class="cm-operator">=</span> []
<span class="cm-variable">tagsText</span> <span class="cm-operator">=</span> []
<span class="cm-variable">readLine</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">event</span> <span class="cm-keyword">in</span> <span class="cm-variable">eventsLogs</span>:
    <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-variable">event</span>,<span class="cm-variable">encoding</span><span class="cm-operator">=</span><span class="cm-string">'utf8'</span>,<span class="cm-variable">errors</span><span class="cm-operator">=</span><span class="cm-string">'surrogateescape'</span>,<span class="cm-variable">newline</span><span class="cm-operator">=</span><span class="cm-string">'\n'</span>) <span class="cm-variable">as</span> <span class="cm-variable">logs</span>:
        <span class="cm-variable">next</span>(<span class="cm-variable">logs</span>)
        <span class="cm-keyword">try</span>:
            <span class="cm-keyword">for</span> <span class="cm-variable">line</span> <span class="cm-keyword">in</span> <span class="cm-variable">logs</span>:
                <span class="cm-variable">lines</span> <span class="cm-operator">=</span> <span class="cm-variable">line</span>.<span class="cm-property">split</span>()
                <span class="cm-error">#for debugging</span>
                <span class="cm-variable">readLine</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>)
                <span class="cm-variable">date</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">0</span>])
                <span class="cm-variable">time</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">1</span>])
                <span class="cm-variable">processID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">2</span>])
                <span class="cm-variable">threadID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">3</span>])
                <span class="cm-variable">priority</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">4</span>])
                <span class="cm-variable">app</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">5</span>])
                <span class="cm-variable">tagsText</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">6</span>:])
        <span class="cm-variable">except</span> <span class="cm-variable">IndexError</span>:
             <span class="cm-variable">pass</span>
             
<span class="cm-variable">tagsTextComb</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">innerlist</span> <span class="cm-keyword">in</span> <span class="cm-variable">tagsText</span>:
    <span class="cm-variable">tagsTextComb</span>.<span class="cm-property">append</span>(<span class="cm-string">' '</span>.<span class="cm-property">join</span>(<span class="cm-variable">innerlist</span>)<span class="cm-operator">+</span><span class="cm-string">" "</span>)

<span class="cm-variable">eventsDict</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'date'</span>:<span class="cm-variable">date</span>,<span class="cm-string cm-property">'time'</span>:<span class="cm-variable">time</span>,<span class="cm-string cm-property">'processID'</span>:<span class="cm-variable">processID</span>,<span class="cm-string cm-property">'threadID'</span>:<span class="cm-variable">threadID</span>,<span class="cm-string cm-property">'priority'</span>:<span class="cm-variable">priority</span>,<span class="cm-string cm-property">'app'</span>:<span class="cm-variable">app</span>,<span class="cm-string cm-property">'tagsText'</span>:<span class="cm-variable">tagsTextComb</span>}
<span class="cm-variable">pad_dict_list</span>(<span class="cm-variable">eventsDict</span>,<span class="cm-string">'x'</span>)
<span class="cm-variable">dfEvents</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">eventsDict</span>)

<span class="cm-variable">sysLogs</span> <span class="cm-operator">=</span> []
<span class="cm-variable">keyword</span> <span class="cm-operator">=</span> <span class="cm-string">'system'</span>
<span class="cm-keyword">for</span> <span class="cm-variable">fname</span> <span class="cm-keyword">in</span> <span class="cm-variable">os</span>.<span class="cm-property">listdir</span>(<span class="cm-variable">cwd</span>):
    <span class="cm-keyword">if</span> <span class="cm-variable">keyword</span> <span class="cm-keyword">in</span> <span class="cm-variable">fname</span>:
        <span class="cm-variable">sysLogs</span>.<span class="cm-property">append</span>(<span class="cm-variable">fname</span>)

<span class="cm-variable">sysLogs</span> <span class="cm-operator">=</span> [<span class="cm-variable">item</span> <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">sysLogs</span> <span class="cm-keyword">if</span> <span class="cm-variable">not</span> <span class="cm-variable">item</span>.<span class="cm-variable">endswith</span>(<span class="cm-string">'.gz'</span>)]       

<span class="cm-variable">date</span> <span class="cm-operator">=</span> []
<span class="cm-variable">time</span> <span class="cm-operator">=</span> []
<span class="cm-variable">processID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">threadID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">priority</span> <span class="cm-operator">=</span> []
<span class="cm-variable">app</span> <span class="cm-operator">=</span> []
<span class="cm-variable">tagsText</span> <span class="cm-operator">=</span> []
<span class="cm-variable">readLine</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">sys</span> <span class="cm-keyword">in</span> <span class="cm-variable">sysLogs</span>:
    <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-variable">sys</span>,<span class="cm-variable">encoding</span><span class="cm-operator">=</span><span class="cm-string">'utf8'</span>,<span class="cm-variable">errors</span><span class="cm-operator">=</span><span class="cm-string">'surrogateescape'</span>,<span class="cm-variable">newline</span><span class="cm-operator">=</span><span class="cm-string">'\n'</span>) <span class="cm-variable">as</span> <span class="cm-variable">logs</span>:
        <span class="cm-keyword">try</span>:
            <span class="cm-keyword">for</span> <span class="cm-variable">line</span> <span class="cm-keyword">in</span> <span class="cm-variable">logs</span>:
                <span class="cm-variable">lines</span> <span class="cm-operator">=</span> <span class="cm-variable">line</span>.<span class="cm-property">split</span>()
                <span class="cm-error">#for debugging</span>
                <span class="cm-variable">readLine</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>)
                <span class="cm-variable">date</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">0</span>])
                <span class="cm-variable">time</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">1</span>])
                <span class="cm-variable">processID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">2</span>])
                <span class="cm-variable">threadID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">3</span>])
                <span class="cm-variable">priority</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">4</span>])
                <span class="cm-variable">app</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">5</span>])
                <span class="cm-variable">tagsText</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">6</span>:])
        <span class="cm-variable">except</span> <span class="cm-variable">IndexError</span>:
             <span class="cm-variable">pass</span>
             
<span class="cm-variable">tagsTextComb</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">innerlist</span> <span class="cm-keyword">in</span> <span class="cm-variable">tagsText</span>:
    <span class="cm-variable">tagsTextComb</span>.<span class="cm-property">append</span>(<span class="cm-string">' '</span>.<span class="cm-property">join</span>(<span class="cm-variable">innerlist</span>)<span class="cm-operator">+</span><span class="cm-string">" "</span>)

<span class="cm-variable">sysDicts</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'date'</span>:<span class="cm-variable">date</span>,<span class="cm-string cm-property">'time'</span>:<span class="cm-variable">time</span>,<span class="cm-string cm-property">'processID'</span>:<span class="cm-variable">processID</span>,<span class="cm-string cm-property">'threadID'</span>:<span class="cm-variable">threadID</span>,<span class="cm-string cm-property">'priority'</span>:<span class="cm-variable">priority</span>,<span class="cm-string cm-property">'app'</span>:<span class="cm-variable">app</span>,<span class="cm-string cm-property">'tagsText'</span>:<span class="cm-variable">tagsTextComb</span>}
<span class="cm-variable">pad_dict_list</span>(<span class="cm-variable">sysDicts</span>,<span class="cm-string">'x'</span>)
<span class="cm-variable">dfSys</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">sysDicts</span>)

<span class="cm-variable">radioLogs</span> <span class="cm-operator">=</span> []
<span class="cm-variable">keyword</span> <span class="cm-operator">=</span> <span class="cm-string">'radio'</span>
<span class="cm-keyword">for</span> <span class="cm-variable">fname</span> <span class="cm-keyword">in</span> <span class="cm-variable">os</span>.<span class="cm-property">listdir</span>(<span class="cm-variable">cwd</span>):
    <span class="cm-keyword">if</span> <span class="cm-variable">keyword</span> <span class="cm-keyword">in</span> <span class="cm-variable">fname</span>:
        <span class="cm-variable">radioLogs</span>.<span class="cm-property">append</span>(<span class="cm-variable">fname</span>)

<span class="cm-variable">radioLogs</span> <span class="cm-operator">=</span> [<span class="cm-variable">item</span> <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">sysLogs</span> <span class="cm-keyword">if</span> <span class="cm-variable">not</span> <span class="cm-variable">item</span>.<span class="cm-variable">endswith</span>(<span class="cm-string">'.gz'</span>)]       

<span class="cm-variable">date</span> <span class="cm-operator">=</span> []
<span class="cm-variable">time</span> <span class="cm-operator">=</span> []
<span class="cm-variable">processID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">threadID</span> <span class="cm-operator">=</span> []
<span class="cm-variable">priority</span> <span class="cm-operator">=</span> []
<span class="cm-variable">app</span> <span class="cm-operator">=</span> []
<span class="cm-variable">tagsText</span> <span class="cm-operator">=</span> []
<span class="cm-variable">readLine</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">radio</span> <span class="cm-keyword">in</span> <span class="cm-variable">radioLogs</span>:
    <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-variable">radio</span>,<span class="cm-variable">encoding</span><span class="cm-operator">=</span><span class="cm-string">'utf8'</span>,<span class="cm-variable">errors</span><span class="cm-operator">=</span><span class="cm-string">'surrogateescape'</span>,<span class="cm-variable">newline</span><span class="cm-operator">=</span><span class="cm-string">'\n'</span>) <span class="cm-variable">as</span> <span class="cm-variable">logs</span>:
        <span class="cm-keyword">try</span>:
            <span class="cm-keyword">for</span> <span class="cm-variable">line</span> <span class="cm-keyword">in</span> <span class="cm-variable">logs</span>:
                <span class="cm-variable">lines</span> <span class="cm-operator">=</span> <span class="cm-variable">line</span>.<span class="cm-property">split</span>()
                <span class="cm-error">#for debugging</span>
                <span class="cm-variable">readLine</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>)
                <span class="cm-variable">date</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">0</span>])
                <span class="cm-variable">time</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">1</span>])
                <span class="cm-variable">processID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">2</span>])
                <span class="cm-variable">threadID</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">3</span>])
                <span class="cm-variable">priority</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">4</span>])
                <span class="cm-variable">app</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">5</span>])
                <span class="cm-variable">tagsText</span>.<span class="cm-property">append</span>(<span class="cm-variable">lines</span>[<span class="cm-number">6</span>:])
        <span class="cm-variable">except</span> <span class="cm-variable">IndexError</span>:
             <span class="cm-variable">pass</span>
             
<span class="cm-variable">tagsTextComb</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">innerlist</span> <span class="cm-keyword">in</span> <span class="cm-variable">tagsText</span>:
    <span class="cm-variable">tagsTextComb</span>.<span class="cm-property">append</span>(<span class="cm-string">' '</span>.<span class="cm-property">join</span>(<span class="cm-variable">innerlist</span>)<span class="cm-operator">+</span><span class="cm-string">" "</span>)

<span class="cm-variable">radioDicts</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'date'</span>:<span class="cm-variable">date</span>,<span class="cm-string cm-property">'time'</span>:<span class="cm-variable">time</span>,<span class="cm-string cm-property">'processID'</span>:<span class="cm-variable">processID</span>,<span class="cm-string cm-property">'threadID'</span>:<span class="cm-variable">threadID</span>,<span class="cm-string cm-property">'priority'</span>:<span class="cm-variable">priority</span>,<span class="cm-string cm-property">'app'</span>:<span class="cm-variable">app</span>,<span class="cm-string cm-property">'tagsText'</span>:<span class="cm-variable">tagsTextComb</span>}
<span class="cm-variable">pad_dict_list</span>(<span class="cm-variable">radioDicts</span>,<span class="cm-string">'x'</span>)
<span class="cm-variable">dfRadio</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">radioDicts</span>)

<span class="cm-variable">frames</span> <span class="cm-operator">=</span> [<span class="cm-variable">dfRadio</span>, <span class="cm-variable">dfSys</span>, <span class="cm-variable">dfMain</span>, <span class="cm-variable">dfCrash</span>, <span class="cm-variable">dfEvents</span>]
<span class="cm-variable">df</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">concat</span>(<span class="cm-variable">frames</span>)</pre><p class="">This code should help you get started! In a follow up piece, we’ll go over some basic analytics, cleaning, and applications.</p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1630181615929-EDX1K2OQG275XHYMSOG2/unsplash-image-XmZ4GDAp9G0.jpg?format=1500w" medium="image" isDefault="true" width="1500" height="1000"><media:title type="plain">How to: Parse Android Logs for Analytics and Machine Learning Applications</media:title></media:content></item><item><title>Keep Analytics, Machine Learning, and Artificial Intelligence to Simple Use-Cases with Collective Vantage</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Tue, 18 May 2021 00:27:49 +0000</pubDate><link>https://www.conaxon.org/projects/sales-forecast-gbr</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:60a25d53b0905607d71eec48</guid><description><![CDATA[Conaxon takes the reader through a simple sales forecasting project for a 
retail store. We cover data cleaning (pandas), feature engineering 
(encoding categorical/cyclical features), building a base 
GradientBoostedRegressor Model (Sklearn), and hyper-parameter tuning 
(GridSearchCV, RandomizedSearchCV, KFold).]]></description><content:encoded><![CDATA[<h2>Keep it Simple, Stupid (KISS):</h2><p class="">We’ve talked to businesses and economic development organizations about digitization, feelings about data, typical challenges with implementation of analytics, and the future adoption of machine learning / artificial intelligence. Previously, these struggles were discussed:</p><ul data-rte-list="default"><li><p class="">Incentives to digitalize early (or at all) in a small or micro business are quite small--especially if there is little realized return on investment due to an inability to derive insights from the data collected.</p></li><li><p class="">Data is expensive</p></li><li><p class="">Data is hard to collect and synthesize correctly</p></li><li><p class="">There isn't enough data</p></li><li><p class="">Data is not timely or difficult to keep timely</p></li></ul><p class="">We set up Collective Vantage to combat these challenges in an easy, compelling, and affordable technology that spreads the load across networks of businesses that are onboarded. This post is going to detail a couple hour project that demonstrates just how effective a simple machine learning model can be for a micro/small business looking to understand how their future sales might look.</p><h2>The Dataset:</h2><p class="">For this project, we are using a retail dataset from Kaggle: <a href="https://www.kaggle.com/manjeetsingh/retaildataset).">https://www.kaggle.com/manjeetsingh/retaildataset</a> . Since retail organizations are some of the most plentiful of the small/micro businesses at 2.6 million as of 2020 it makes sense to work on something related to retail.</p><h3>What’s in this data?</h3><p class="">We are given historical sales data for 45 stores located in different regions. The company runs several promotional markdown events throughout the year in their stores. These markdowns precede prominent holidays, the four largest of which are the&nbsp;Super Bowl, Labor Day, Thanksgiving, and Christmas. Store data and macro-economic dats is also provided as well.</p><h3>What kind of Features are Included?</h3><p class="">Contains additional data related to the store, department, and regional activity for the given dates.</p><ul data-rte-list="default"><li><p class="">Store Number</p></li><li><p class="">Date</p></li><li><p class="">Temperature</p></li><li><p class="">Fuel_Prices</p></li><li><p class="">MarkDown1-5 - % markdown amounts</p></li><li><p class="">CPI  (Consumer Price Index)</p></li><li><p class="">Unemployment Rate</p></li><li><p class="">IsHoliday - True/False indicator of Holiday</p></li></ul><h3>What does the Sales Data Look Like?</h3><p class="">Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:</p><ul data-rte-list="default"><li><p class="">Store Number</p></li><li><p class="">Department Number</p></li><li><p class="">Date</p></li><li><p class="">Sales Number</p></li></ul><h2>Yep, you can Collect this Data Yourself (and maybe already do)!</h2><p class="">A theme when talking with small and micro business mentors is that most do not have reliable methods in place to collect operational and financial data. If you are reading this and head up a retail establishment, we hope to convey that tools like QuickBooks, Salesforce, Ecommerce Software, and many others can hold all of this very basic data. It is also quite easy to extract the data out of these systems for analysis. Even Excel can be a reliable when first starting out. </p><p class="">A key advantage to Collective Vantage is that the technology is designed around making data collection and aggregation easier across various tools used by businesses. It is often a daunting task to attempt this key step. </p><h2>Onto Some Code! </h2><p class="">Import the libraries we need</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">pandas</span> <span class="cm-keyword">as</span> <span class="cm-def">pd</span>
<span class="cm-keyword">import</span> <span class="cm-def">numpy</span> <span class="cm-keyword">as</span> <span class="cm-def">np</span>
<span class="cm-keyword">import</span> <span class="cm-def">seaborn</span> <span class="cm-keyword">as</span> <span class="cm-def">sns</span>
<span class="cm-keyword">import</span> <span class="cm-def">sklearn</span>
<span class="cm-keyword">import</span> <span class="cm-def">matplotlib</span>.<span class="cm-variable">pyplot</span> <span class="cm-variable">as</span> <span class="cm-variable">plt</span>
<span class="cm-keyword">import</span> <span class="cm-def">datetime</span>
<span class="cm-operator">%</span><span class="cm-variable">matplotlib</span> <span class="cm-variable">inline</span>
<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">ensemble</span> <span class="cm-keyword">import</span> <span class="cm-def">GradientBoostingRegressor</span>,<span class="cm-def">AdaBoostRegressor</span>,<span class="cm-def">RandomForestRegressor</span></pre><p class="">Read in the data</p><pre class="source-code"><span class="cm-variable">store</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">"store.csv"</span>)
<span class="cm-variable">feature</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">"features.csv"</span>)
<span class="cm-variable">sales</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">"sales.csv"</span>)</pre><p class="">Get our bearings on what the structure of the data looks like for the store. We can see there is the store Id, Type, and the Size. Given that this is anonymized data, the values are somewhat non-sensical. However, the main concept is that these high-level data points can still be useful in prediction tasks. Plus, these can be very easy for a business to collect.</p><pre class="source-code"><span class="cm-variable">store</span>.<span class="cm-property">describe</span>().<span class="cm-property">transpose</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621266665424-OI4MGE5HAD61DT9RRXKN/blog+post+1.PNG" data-image-dimensions="635x369" data-image-focal-point="0.5,0.5" alt="blog post 1.PNG" data-load="false" data-image-id="60a290e963de2f0a1366f431" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621266665424-OI4MGE5HAD61DT9RRXKN/blog+post+1.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">The features table holds good information about the store, markdowns, holidays, and macro-economic data. Again, all data that can easily be acquired, stored and used for analytics. The main thing to notice is that we have to clean up some of the blank fields.</p><pre class="source-code"><span class="cm-variable">feature</span>.<span class="cm-property">describe</span>().<span class="cm-property">transpose</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621267267095-GY6L965H80ON16RVG54O/blog+post+2.PNG" data-image-dimensions="790x746" data-image-focal-point="0.5,0.5" alt="blog post 2.PNG" data-load="false" data-image-id="60a29343a6ff1a19855c01d1" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621267267095-GY6L965H80ON16RVG54O/blog+post+2.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Finally, a cursory look at the sales data </p>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621270222261-E5MAHSKQ7YWMBLZ3WCPL/blog+post+3.PNG" data-image-dimensions="777x427" data-image-focal-point="0.5,0.5" alt="blog post 3.PNG" data-load="false" data-image-id="60a29ece4d8c624d35f1164b" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621270222261-E5MAHSKQ7YWMBLZ3WCPL/blog+post+3.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Next step is to combine all the tables together to a single view</p><pre class="source-code"><span class="cm-variable">store_feat</span> <span class="cm-operator">=</span> <span class="cm-variable">store</span>.<span class="cm-property">merge</span>(<span class="cm-variable">right</span> <span class="cm-operator">=</span> <span class="cm-variable">feature</span>, <span class="cm-variable">on</span> <span class="cm-operator">=</span> <span class="cm-string">'Store'</span>)
<span class="cm-variable">df</span> <span class="cm-operator">=</span> <span class="cm-variable">store_feat</span>.<span class="cm-property">merge</span>(<span class="cm-variable">right</span> <span class="cm-operator">=</span> <span class="cm-variable">sales</span>, <span class="cm-variable">on</span> <span class="cm-operator">=</span> [<span class="cm-string">'Store'</span>, <span class="cm-string">'Date'</span>, <span class="cm-string">'IsHoliday'</span>])
<span class="cm-variable">df</span>.<span class="cm-property">sample</span>(<span class="cm-number">10</span>)</pre><p class="">As mentioned previously, the first thing to tackle is making sure there are no blank records. In this project, if the week did not have a markdown, then the record is blank instead of 0. Since there is likely information in the weeks with and without markdown we will simply fill the blanks  with 0’s.</p><pre class="source-code"><span class="cm-variable">df</span>.<span class="cm-property">isna</span>().<span class="cm-property">sum</span>()
<span class="cm-variable">df</span>[<span class="cm-string">'MarkDown1'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'MarkDown1'</span>].<span class="cm-property">fillna</span>(<span class="cm-number">0</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'MarkDown2'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'MarkDown2'</span>].<span class="cm-property">fillna</span>(<span class="cm-number">0</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'MarkDown3'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'MarkDown3'</span>].<span class="cm-property">fillna</span>(<span class="cm-number">0</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'MarkDown4'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'MarkDown4'</span>].<span class="cm-property">fillna</span>(<span class="cm-number">0</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'MarkDown5'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'MarkDown5'</span>].<span class="cm-property">fillna</span>(<span class="cm-number">0</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621270806251-50AJRAMR23UG48VUOIUW/blog+post+4.PNG" data-image-dimensions="291x338" data-image-focal-point="0.5,0.5" alt="blog post 4.PNG" data-load="false" data-image-id="60a2a116c16bc25dd8d763c5" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621270806251-50AJRAMR23UG48VUOIUW/blog+post+4.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Next some features will be created to derive more information from the dimensions that already exist:</p><pre class="source-code"><span class="cm-variable">def</span> <span class="cm-variable">day_of_year</span>(<span class="cm-variable">date_str</span>):
    <span class="cm-variable">date</span> <span class="cm-operator">=</span> <span class="cm-variable">datetime</span>.<span class="cm-property">datetime</span>.<span class="cm-property">strptime</span>(<span class="cm-variable">date_str</span>, <span class="cm-string">'%d/%m/%Y'</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">date</span>.<span class="cm-property">timetuple</span>().<span class="cm-property">tm_yday</span>

<span class="cm-variable">def</span> <span class="cm-variable">day</span>(<span class="cm-variable">date_str</span>):
    <span class="cm-variable">date</span> <span class="cm-operator">=</span> <span class="cm-variable">datetime</span>.<span class="cm-property">datetime</span>.<span class="cm-property">strptime</span>(<span class="cm-variable">date_str</span>, <span class="cm-string">'%d/%m/%Y'</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">date</span>.<span class="cm-property">timetuple</span>().<span class="cm-property">tm_mon</span>

<span class="cm-variable">def</span> <span class="cm-variable">year</span>(<span class="cm-variable">date_str</span>):
    <span class="cm-variable">date</span> <span class="cm-operator">=</span> <span class="cm-variable">datetime</span>.<span class="cm-property">datetime</span>.<span class="cm-property">strptime</span>(<span class="cm-variable">date_str</span>, <span class="cm-string">'%d/%m/%Y'</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">date</span>.<span class="cm-property">timetuple</span>().<span class="cm-property">tm_year</span>

<span class="cm-variable">def</span> <span class="cm-variable">woy</span>(<span class="cm-variable">date_str</span>):
    <span class="cm-variable">date</span> <span class="cm-operator">=</span> <span class="cm-variable">datetime</span>.<span class="cm-property">datetime</span>.<span class="cm-property">strptime</span>(<span class="cm-variable">date_str</span>, <span class="cm-string">'%d/%m/%Y'</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">date</span>.<span class="cm-property">timetuple</span>().<span class="cm-property">tm_year</span>

<span class="cm-variable">df</span>[<span class="cm-string">'DayOfYear'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'Date'</span>].<span class="cm-property">map</span>(<span class="cm-variable">day_of_year</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'MonthOfYear'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'Date'</span>].<span class="cm-property">map</span>(<span class="cm-variable">day</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'Year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'Date'</span>].<span class="cm-property">map</span>(<span class="cm-variable">year</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'DayOfYearCos'</span>] <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">cos</span>(<span class="cm-variable">df</span>[<span class="cm-string">'DayOfYear'</span>])
<span class="cm-variable">df</span>[<span class="cm-string">'DayOfYearSin'</span>] <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sin</span>(<span class="cm-variable">df</span>[<span class="cm-string">'DayOfYear'</span>])
<span class="cm-variable">df</span>[<span class="cm-string">'Date'</span>] <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">to_datetime</span>(<span class="cm-variable">df</span>[<span class="cm-string">'Date'</span>])
<span class="cm-variable">df</span>[<span class="cm-string">"WeekofYear"</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">Date</span>.<span class="cm-property">dt</span>.<span class="cm-property">week</span>

<span class="cm-variable">df</span>[<span class="cm-string">'IsHoliday'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'IsHoliday'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'category'</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'Dept'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'Dept'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'category'</span>)
<span class="cm-variable">df</span>[<span class="cm-string">'Store'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'Store'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'category'</span>)</pre><p class="">Some key things to note on the features created with the code above:</p><ul data-rte-list="default"><li><p class="">Deconstruct the date to get the year, day, month, and week of the year to make sure we increase the opportunity for our model to pick up on trends—hopefully increasing predictive power</p></li><li><p class="">Calculate the Sin and Cosine of the day of the year as these features help maximize our ability to fit the cyclical nature of the retail data</p><ul data-rte-list="default"><li><p class="">https://medium.com/swlh/time-series-forecasting-with-a-twist-27350e97a2cb</p></li><li><p class="">https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca</p></li><li><p class="">https://towardsdatascience.com/taking-seasonality-into-consideration-for-time-series-analysis-4e1f4fbb768f</p></li></ul></li><li><p class="">Convert the IsHoliday, Dept, and Store dimensions to categories so they may be encoded using pd.get_dummies. Department and Store are given an ordinal encoding in the dataset. Because these encodings represent unique categories we do not want the ordinal nature of the encodings to be picked up by the algorithm since these data really should be totally distinct. The Department dimension should be treated along the same lines</p></li></ul><p class="">After the cleaning the data and creating some additional features, visualizing the data highlights some key things to keep in mind later. Firstly, a correlation plot is generated to assess how each of the variables are correlated with each other—not to be confused with causation.</p><pre class="source-code"><span class="cm-variable">df_temp</span> <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">copy</span>(<span class="cm-variable">deep</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">df_temp</span>[<span class="cm-string">'tot_MarkDown'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df_temp</span>[<span class="cm-string">'MarkDown1'</span>] <span class="cm-operator">+</span> <span class="cm-variable">df_temp</span>[<span class="cm-string">'MarkDown2'</span>] <span class="cm-operator">+</span><span class="cm-variable">df_temp</span>[<span class="cm-string">'MarkDown3'</span>] <span class="cm-operator">+</span><span class="cm-variable">df_temp</span>[<span class="cm-string">'MarkDown4'</span>] <span class="cm-operator">+</span> <span class="cm-variable">df_temp</span>[<span class="cm-string">'MarkDown5'</span>]
<span class="cm-variable">df_temp</span>.<span class="cm-property">drop</span>([<span class="cm-string">'MarkDown1'</span>, <span class="cm-string">'MarkDown2'</span>, <span class="cm-string">'MarkDown3'</span>, <span class="cm-string">'MarkDown4'</span>, <span class="cm-string">'MarkDown5'</span>,<span class="cm-string">'Year'</span>], <span class="cm-variable">inplace</span> <span class="cm-operator">=</span> <span class="cm-variable">True</span>, <span class="cm-variable">axis</span> <span class="cm-operator">=</span> <span class="cm-number">1</span>)
<span class="cm-variable">fig</span>, <span class="cm-variable">ax</span> <span class="cm-operator">=</span> <span class="cm-variable">plt</span>.<span class="cm-property">subplots</span>(<span class="cm-variable">figsize</span><span class="cm-operator">=</span>(<span class="cm-number">20</span>,<span class="cm-number">12</span>))
<span class="cm-variable">sns</span>.<span class="cm-property">heatmap</span>(<span class="cm-variable">df_temp</span>.<span class="cm-property">corr</span>(),<span class="cm-variable">annot</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-error"># df_temp.head</span></pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621274175339-2ZW5N5BL687JWL39S6X4/blog+post+5.PNG" data-image-dimensions="1145x706" data-image-focal-point="0.5,0.5" alt="blog post 5.PNG" data-load="false" data-image-id="60a2ae3faa09d140dc339327" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621274175339-2ZW5N5BL687JWL39S6X4/blog+post+5.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">There are not a ton of highly correlated features in this dataset. MonthOfYear and DayOfYear are going to be somewhat correlated.</p><p class="">The next chart looks at the timeseries data to establish an understanding of how the trends relate:</p><pre class="source-code"><span class="cm-variable">df</span>[[<span class="cm-string">'Date'</span>, <span class="cm-string">'Temperature'</span>, <span class="cm-string">'Fuel_Price'</span>, <span class="cm-string">'CPI'</span>, <span class="cm-string">'Unemployment'</span>, 
    <span class="cm-string">'MarkDown1'</span>, <span class="cm-string">'MarkDown2'</span>, <span class="cm-string">'MarkDown3'</span>, <span class="cm-string">'MarkDown4'</span>, <span class="cm-string">'MarkDown5'</span>]].<span class="cm-property">plot</span>(<span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-string">'Date'</span>, <span class="cm-variable">subplots</span><span class="cm-operator">=</span><span class="cm-variable">True</span>, <span class="cm-variable">figsize</span><span class="cm-operator">=</span>(<span class="cm-number">20</span>,<span class="cm-number">15</span>))
<span class="cm-variable">plt</span>.<span class="cm-property">show</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621275123635-R5TSBBAVBZ1OZC7CS62X/blog+post+6.PNG" data-image-dimensions="1099x738" data-image-focal-point="0.5,0.5" alt="blog post 6.PNG" data-load="false" data-image-id="60a2b1f3e2d1e010d13d4062" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621275123635-R5TSBBAVBZ1OZC7CS62X/blog+post+6.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">It is easy to notice that there really aren’t a ton of variables that trend along the sales cycle. A lack of observable trend is not necessarily a problem, but it certainly makes the model more abstract and potentially less interpretable. What is great about machine learning, the obscure connections between various inputs can be found and exploited to produce awesome insights—a level of inference that human intuition just can’t have without extensive time and effort.</p><p class="">Next, visualize the weekly sales numbers over time to get a closer look at the sales over time:</p><pre class="source-code"><span class="cm-variable">df_time</span> <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">groupby</span>(<span class="cm-string">'Date'</span>).<span class="cm-property">sum</span>()[<span class="cm-string">'Weekly_Sales'</span>].<span class="cm-property">reset_index</span>()
<span class="cm-variable">fig</span>, <span class="cm-variable">ax</span> <span class="cm-operator">=</span> <span class="cm-variable">plt</span>.<span class="cm-property">subplots</span>(<span class="cm-variable">figsize</span><span class="cm-operator">=</span>(<span class="cm-number">20</span>,<span class="cm-number">12</span>))
<span class="cm-variable">ax</span>.<span class="cm-property">plot</span>(<span class="cm-string">'Date'</span>, <span class="cm-string">'Weekly_Sales'</span>, <span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">df_time</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621276224062-JUIPUHN3K69ACRRXZT1I/blog+post+7.PNG" data-image-dimensions="1141x681" data-image-focal-point="0.5,0.5" alt="blog post 7.PNG" data-load="false" data-image-id="60a2b640d4e6c932771187ca" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621276224062-JUIPUHN3K69ACRRXZT1I/blog+post+7.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">A closer look reveals more of the cyclical nature of the sales over time. Dips in sales seems to occur just after the New Year, but peak around Christmas, Thanksgiving, Memorial Day, and 4th of July. So, pretty typical sales behavior from a retail establishment. Funnily enough, the trend also looks like a WAVE! T</p><p class="">Furthermore, the seasonality can be further visualized by looking at the sales split by month:</p><pre class="source-code"><span class="cm-variable">df_seas</span> <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">groupby</span>(<span class="cm-variable">df</span>.<span class="cm-property">Date</span>.<span class="cm-property">apply</span>(<span class="cm-variable">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">x</span>.<span class="cm-variable">month</span>)).<span class="cm-property">sum</span>()[<span class="cm-string">'Weekly_Sales'</span>].<span class="cm-property">reset_index</span>()
<span class="cm-variable">plt</span>.<span class="cm-property">figure</span>(<span class="cm-variable">figsize</span><span class="cm-operator">=</span>(<span class="cm-number">10</span>, <span class="cm-number">5</span>))
<span class="cm-variable">sns</span>.<span class="cm-property">barplot</span>(<span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-variable">df_seas</span>.<span class="cm-property">Date</span>,<span class="cm-variable">y</span><span class="cm-operator">=</span><span class="cm-variable">df_seas</span>.<span class="cm-property">Weekly_Sales</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621279500286-GYQ6GGFR7XQP6IVUPHCF/blog+post+8.PNG" data-image-dimensions="612x354" data-image-focal-point="0.5,0.5" alt="blog post 8.PNG" data-load="false" data-image-id="60a2c30cd4e6c9327713094f" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621279500286-GYQ6GGFR7XQP6IVUPHCF/blog+post+8.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Interesting the peaks in April and December.. The dips in January and November are also interesting given that the largest peaks in the sales timeseries charts.</p><p class="">Finally, the sales by store type will be visualized:</p><pre class="source-code"><span class="cm-variable">df_store_type</span> <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">groupby</span>(<span class="cm-string">'Type'</span>).<span class="cm-property">sum</span>()[<span class="cm-string">'Weekly_Sales'</span>].<span class="cm-property">reset_index</span>()
<span class="cm-variable">fig</span>, <span class="cm-variable">ax</span> <span class="cm-operator">=</span> <span class="cm-variable">plt</span>.<span class="cm-property">subplots</span>(<span class="cm-variable">figsize</span><span class="cm-operator">=</span>(<span class="cm-number">20</span>,<span class="cm-number">12</span>))
<span class="cm-variable">ax</span>.<span class="cm-property">bar</span>(<span class="cm-string">'Type'</span>, <span class="cm-string">'Weekly_Sales'</span>, <span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">df_store_type</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621280388428-SDUVQZ59D41IY7NJZVHQ/blog+post+9.PNG" data-image-dimensions="1097x680" data-image-focal-point="0.5,0.5" alt="blog post 9.PNG" data-load="false" data-image-id="60a2c684376a6e7ddadb5faa" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621280388428-SDUVQZ59D41IY7NJZVHQ/blog+post+9.PNG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Interesting to note here that Store Type A has a significant advantage over Types B and C in terms of predictive capability because of the amount of sales is not distributed evenly. The model we will develop will include sales from all the store types. In the future, it might be useful to develop a model specific to each store type.</p><p class="">Data preparation for modeling is next:</p><pre class="source-code"><span class="cm-variable">model</span> <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">set_index</span>([<span class="cm-string">'Date'</span>, <span class="cm-string">'Store'</span>, <span class="cm-string">'Dept'</span>]).<span class="cm-property">sort_index</span>()
<span class="cm-variable">model_data</span> <span class="cm-operator">=</span> <span class="cm-variable">model</span>.<span class="cm-property">reset_index</span>()
<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">preprocessing</span> <span class="cm-keyword">import</span> <span class="cm-def">MinMaxScaler</span>

<span class="cm-variable">mms</span> <span class="cm-operator">=</span> <span class="cm-variable">MinMaxScaler</span>()
<span class="cm-variable">model_data</span>[[<span class="cm-string">'Temperature'</span>,<span class="cm-string">'Fuel_Price'</span>,<span class="cm-string">'MarkDown1'</span>,<span class="cm-string">'MarkDown2'</span>,
            <span class="cm-string">'MarkDown3'</span>,<span class="cm-string">'MarkDown4'</span>,<span class="cm-string">'MarkDown5'</span>,<span class="cm-string">'CPI'</span>,<span class="cm-string">'Unemployment'</span>,
            <span class="cm-string">'Size'</span>]] <span class="cm-operator">=</span> <span class="cm-variable">mms</span>.<span class="cm-property">fit_transform</span>(<span class="cm-variable">model_data</span>[[<span class="cm-string">'Temperature'</span>,<span class="cm-string">'Fuel_Price'</span>,<span class="cm-string">'MarkDown1'</span>,<span class="cm-string">'MarkDown2'</span>,
                                                     <span class="cm-string">'MarkDown3'</span>,<span class="cm-string">'MarkDown4'</span>,<span class="cm-string">'MarkDown5'</span>,<span class="cm-string">'CPI'</span>,
                                                     <span class="cm-string">'Unemployment'</span>,<span class="cm-string">'Size'</span>]])</pre><pre class="source-code"><span class="cm-variable">model_data</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">get_dummies</span>(<span class="cm-variable">model_data</span>,<span class="cm-variable">drop_first</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">final_model</span> <span class="cm-operator">=</span> <span class="cm-variable">model_data</span>.<span class="cm-property">set_index</span>(<span class="cm-string">'Date'</span>)</pre><p class="">The block of code does a few key things:</p><ul data-rte-list="default"><li><p class="">Sort the table by the Date, Store and Department</p></li><li><p class="">Scale the numeric features down to between 0 and 1. This step is necessary in order to bring everything to the same scale</p></li></ul><p class="">Splitting the data to a training and prediction set occurs next—in addition to defining the features and what we are actually trying to predict. Training will be used to evaluate the basic model and prediction will be used to test how good our model actually is on unseen observations:</p><pre class="source-code"><span class="cm-variable">training_model</span> <span class="cm-operator">=</span> <span class="cm-variable">final_model</span>[:<span class="cm-string">'2012-01-01'</span>]
<span class="cm-variable">training_model</span>.<span class="cm-property">reset_index</span>(<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">pred</span> <span class="cm-operator">=</span> <span class="cm-variable">final_model</span>[<span class="cm-string">'2012-01-01'</span>:]
<span class="cm-variable">pred</span>.<span class="cm-property">reset_index</span>(<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">X_model_train</span> <span class="cm-operator">=</span> <span class="cm-variable">training_model</span>.<span class="cm-property">drop</span>(<span class="cm-variable">columns</span><span class="cm-operator">=</span>[<span class="cm-string">'Weekly_Sales'</span>, <span class="cm-string">'Year'</span>, <span class="cm-string">'DayOfYear'</span>,<span class="cm-string">'Date'</span>])
<span class="cm-variable">y_model_train</span> <span class="cm-operator">=</span> <span class="cm-variable">training_model</span>[<span class="cm-string">'Weekly_Sales'</span>]
<span class="cm-variable">X_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">pred</span>.<span class="cm-property">drop</span>(<span class="cm-variable">columns</span><span class="cm-operator">=</span>[<span class="cm-string">'Weekly_Sales'</span>, <span class="cm-string">'Year'</span>, <span class="cm-string">'DayOfYear'</span>,<span class="cm-string">'Date'</span>])
<span class="cm-variable">y_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">pred</span>[<span class="cm-string">'Weekly_Sales'</span>]</pre><p class="">For this project, we are going to split the data a bit differently. Both training and prediction will be split into their own train and test sets. </p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">train_test_split</span>

<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>, <span class="cm-variable">y_train</span>, <span class="cm-variable">y_test</span> <span class="cm-operator">=</span> <span class="cm-variable">train_test_split</span>(<span class="cm-variable">X_model_train</span>, <span class="cm-variable">y_model_train</span>, <span class="cm-variable">test_size</span><span class="cm-operator">=</span><span class="cm-number">0.10</span>, <span class="cm-variable">random_state</span><span class="cm-operator">=</span><span class="cm-number">0</span>)</pre><p class="">Splitting the data allows us to then build the basic model that will be the benchmark. Gradient Boosting Regressors are one of my go-to algorithms from Sklearn because of the resilience to overfitting, tunability, and flexibility.</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">ensemble</span> <span class="cm-keyword">import</span> <span class="cm-def">GradientBoostingRegressor</span>

<span class="cm-variable">gbr_regressor</span> <span class="cm-operator">=</span> <span class="cm-variable">GradientBoostingRegressor</span>()

<span class="cm-variable">gbr_regressor</span> <span class="cm-operator">=</span> <span class="cm-variable">gbr_regressor</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">y_train</span>)
<span class="cm-variable">gbr_regressor</span>.<span class="cm-property">score</span>(<span class="cm-variable">X_test</span>, <span class="cm-variable">y_test</span>)

<span class="cm-variable">future_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">gbr_regressor</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_pred</span>)
<span class="cm-variable">gbr_regressor</span>.<span class="cm-property">score</span>(<span class="cm-variable">X_pred</span>, <span class="cm-variable">y_pred</span>)</pre><ul data-rte-list="default"><li><p class="">Training score = 0.697 (69%)</p></li><li><p class="">Test score = 0.743 (74%)</p></li><li><p class="">The base model mean squared error (MSE) on test set: $166,493,765.3 </p></li><li><p class="">The base model mean absolute error (MAE) on test set: $8201.4 </p></li><li><p class="">The base model root mean squared error (RMSE) on test set: $12,903.2</p></li></ul><p class="">When looking at the scores for the Gradient Boosted Regressor, it is important to note that the Score is actually an R^2 value. The R^2 value is, generally, a metric that provides insight into how well the model fits the data. There’s lots of room for interpretability. Domain experience will dictate whether or not the model performs the fit well enough. In this project, the score is not terrible but might be able to be optimized. An area of concern is the assessment of the mean squared error, mean absolute error, and root mean squared error. A very high mean squared error indicates that there are significant errors that exist and can be problematic in future predictions. A Mean Absolute Error of $8201.4 seems to be acceptable, but would need to be evaluated by the stakeholders in the retail shop. RMSE is a less interpretable metric for accuracy. However, RMSE does a nice job of balancing the extreme errors against the more ‘normal’ prediction errors. There is definitely some room to improve.</p><p class="">Sklearn offers GridSearchCV and RandomizedSearchCV in order to test lots of parameters. GridSearchCV is very slow. RandomizedSearchCV  is much faster and returns results faster:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">KFold</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">RandomizedSearchCV</span>

<span class="cm-variable">parameters</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'learning_rate'</span>:[<span class="cm-number">0.05</span>, <span class="cm-number">0.1</span>, <span class="cm-number">0.5</span>, <span class="cm-number">1</span>], 
              <span class="cm-string cm-property">'min_samples_split'</span>:[<span class="cm-number">2</span>,<span class="cm-number">5</span>,<span class="cm-number">10</span>], 
              <span class="cm-string cm-property">'max_depth'</span>:[<span class="cm-number">2</span>,<span class="cm-number">3</span>,<span class="cm-number">5</span>],
             <span class="cm-string cm-property">'n_estimators'</span>:[<span class="cm-number">100</span>,<span class="cm-number">150</span>,<span class="cm-number">250</span>]}

<span class="cm-variable">gbr_regressor</span> <span class="cm-operator">=</span> <span class="cm-variable">GradientBoostingRegressor</span>()
<span class="cm-variable">cv_test</span><span class="cm-operator">=</span> <span class="cm-variable">KFold</span>(<span class="cm-variable">n_splits</span><span class="cm-operator">=</span><span class="cm-number">5</span>)
<span class="cm-variable">clf</span> <span class="cm-operator">=</span> <span class="cm-variable">RandomizedSearchCV</span>(<span class="cm-variable">gbr_regressor</span>, <span class="cm-variable">parameters</span>,<span class="cm-variable">cv</span><span class="cm-operator">=</span><span class="cm-variable">cv_test</span>,<span class="cm-variable">n_jobs</span><span class="cm-operator">=</span><span class="cm-operator">-</span><span class="cm-number">1</span>)
<span class="cm-variable">clf</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">y_train</span>)

{<span class="cm-string">'n_estimators'</span>: <span class="cm-number">250</span>, <span class="cm-string">'min_samples_split'</span>: <span class="cm-number">10</span>, <span class="cm-string">'max_depth'</span>: <span class="cm-number">5</span>, <span class="cm-string">'learning_rate'</span>: <span class="cm-number">0.5</span>}
</pre><p class="">After about 30 min, the model returns the best parameters. The model can be re-run with the new parameters to get an accuracy score:</p><pre class="source-code"><span class="cm-variable">gbr_regressor_tuned</span> <span class="cm-operator">=</span> <span class="cm-variable">GradientBoostingRegressor</span>(<span class="cm-variable">n_estimators</span> <span class="cm-operator">=</span> <span class="cm-number">250</span>, 
                                                <span class="cm-variable">min_samples_split</span> <span class="cm-operator">=</span> <span class="cm-number">10</span>, 
                                                <span class="cm-variable">max_depth</span> <span class="cm-operator">=</span> <span class="cm-number">5</span>, 
                                                <span class="cm-variable">learning_rate</span> <span class="cm-operator">=</span> <span class="cm-number">0.5</span>)

<span class="cm-variable">X_train_pred</span>, <span class="cm-variable">X_test_pred</span>, <span class="cm-variable">y_train_pred</span>, <span class="cm-variable">y_test_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">train_test_split</span>(<span class="cm-variable">X_pred</span>, <span class="cm-variable">y_pred</span>, <span class="cm-variable">test_size</span><span class="cm-operator">=</span><span class="cm-number">0.10</span>, <span class="cm-variable">random_state</span><span class="cm-operator">=</span><span class="cm-number">0</span>)

<span class="cm-variable">gbr_regressor_tuned</span> <span class="cm-operator">=</span> <span class="cm-variable">gbr_regressor_tuned</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_pred</span>, <span class="cm-variable">y_train_pred</span>)
<span class="cm-variable">gbr_regressor_tuned</span>.<span class="cm-property">score</span>(<span class="cm-variable">X_test_pred</span>, <span class="cm-variable">y_test_pred</span>)

<span class="cm-variable">future_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">gbr_regressor_tuned</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_pred</span>)
<span class="cm-variable">gbr_regressor_tuned</span>.<span class="cm-property">score</span>(<span class="cm-variable">X_test_pred</span>, <span class="cm-variable">y_test_pred</span>)

</pre><ul data-rte-list="default"><li><p class="">Tuned Prediction Set Score 96.7%</p></li><li><p class="">The tuned model mean squared error (MSE) on prediction set: $16,116,078.3</p></li><li><p class="">The tuned model mean absolute error (MAE) on prediction set: $2,379.7</p></li><li><p class="">The tuned model root mean squared error (RMSE) on prediction set: $4,011.6</p></li></ul><p class="">Note here that our R^2 jumped by over 20%. Depending on the use-case that type of increase can indicate over-fitting of the model. However, MSE, MAE, and RMSE all declined precipitously. The hyper-parameter tuning seems to have helped significantly with the predictive power of the model. </p><h2>Cool, but why does this matter?</h2><p class="">The ability, at least generally, to predict sales from week to week is hugely important in a tighter margin retail space—where the stakes are high:</p><ul data-rte-list="default"><li><p class="">Improve the ability to staff more efficiently</p></li><li><p class="">Purchase material in a more efficient way</p></li><li><p class="">Reduce the level of risk when planning financial and capital investments</p></li></ul><h2>Sorry, I just don’t have enough data to do this kind of Modeling!</h2><p class="">For many businesses, there just won’t be enough data to build a model. Unfortunately, a vast majority of small/micro businesses simply will not have enough data. So what are these businesses supposed to do? What if you have a new business trying to forecast potential sales within a business community? Collective Vantage wants to eliminate the need to worry about these hurdles. Even if you only have a small amount of data, there are probably similar businesses to yours that probably already has enough to build a forecast. </p><h2>Even if I had data to Forecast, I don’t know how to code or Productionize anything</h2><p class="">Collective Vantage condenses the technical know-how to a manageable level. Conaxon brings the tech and the users bring the domain experience. It’s as simple as asking a question like: “What might my sales look like next week, month, year?”</p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1621296923019-IKJFEPRFVWRFJKZIO9EM/unsplash-image-qwtCeJ5cLYs.jpg?format=1500w" medium="image" isDefault="true" width="1500" height="1079"><media:title type="plain">Keep Analytics, Machine Learning, and Artificial Intelligence to Simple Use-Cases with Collective Vantage</media:title></media:content></item><item><title>Be Wary of Second Hand Data: Here's How Collective Vantage Addresses that Fear</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Tue, 11 May 2021 17:38:57 +0000</pubDate><link>https://www.conaxon.org/projects/be-wary-of-inherited-data-heres-how-collective-vantage-addresses-that-fear</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:609aa7551e464b320698074b</guid><description><![CDATA[A brief discussion how Conaxon & Collective Vantage maintains trust and 
data quality. You SHOULD be wary of second hand data! But, there are ways 
to balance the costs and benefits to drive the most value.]]></description><content:encoded><![CDATA[<p class="">I have huge respect for Cassie Kozyrkov, Chief Decision Scientist at Google. Her succinct, but content rich articles are quite helpful to conceptualize complex topics. Her most recent post on LinkedIn can be found here: <a href="https://bit.ly/3vSD2y0)">https://bit.ly/3vSD2y0</a> </p><p class="">This most recent post certainly did not disappoint! You should definitely check out the post.</p><p class="">All summed up, the video details the dangers of relying too heavily on data collected, aggregated, and transformed by someone other than the analyst or data scientist doing the analysis or development of an algorithm. This tip is vastly underrated. There have been so many instances where I have been simply given data and assumed the content was the ‘gold source’ level transactions that can be trusted. Sometimes there would be options to collect my own data and other times that option was not available. Each data scientist and analyst will need to conduct a cost benefit analysis to identify the best course of action that gets the most accurate result.</p><p class="">Cassie’s’ post got me thinking about how Conaxon &amp; Collective Vantage addresses this big assumption in our technology and culture we are building:</p><ul data-rte-list="default"><li><p class="">Collective Vantage is being built because of the ambiguity around where data is coming from and how that data is collected. How can you trust conclusions from analysis built on data that can’t be trusted? One of our key challenges is creating technologies that standardize data collection across a wide range of businesses that have different levels of digitalization—reducing the chances of poor quality data making its way into users and customers hands</p></li><li><p class="">Collective Vantage doesn’t just standardize data collection, but makes data collection easy through custom integrations with popular platforms. “Set it and forget it'“ as it were!</p></li><li><p class="">Trust through Expertise. Conaxon has experts devoted to understanding how data is collected and generated by users and acts as a digitalization partner. The closer Conaxon can work with its’ data contributors, the better we can maintain the highest level of data quality possible</p></li></ul><p class="">Check out Collective Vantage (a product of Conaxon) and sign up to be one of a few companies to pilot the project: <a href="https://collective-vantage.crd.co">https://collective-vantage.crd.co</a></p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1620754476272-6LTT9HS8ZBP1TZGK2PFA/unsplash-image-NDfqqq_7QWM.jpg?format=1500w" medium="image" isDefault="true" width="1500" height="1125"><media:title type="plain">Be Wary of Second Hand Data: Here's How Collective Vantage Addresses that Fear</media:title></media:content></item><item><title>Key Struggles with Analytics, Decision Intelligence, and Machine Learning Micro-Business Communities</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Fri, 07 May 2021 03:33:52 +0000</pubDate><link>https://www.conaxon.org/projects/key-struggles-with-analytics-artificial-intelligence-and-machine-learning-micro-business-communities</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:60942bb1533e5f33f38251b4</guid><description><![CDATA[Conaxon talks about key problems in applying Machine Learning, Artificial 
Intelligence, Analytics, and Decision Intelligence within Small / Micro 
Businesses.]]></description><content:encoded><![CDATA[<h2>A Big Market has a ton of Untapped Potential:</h2><p class="">Conaxon has spent a lot of time interviewing small business owners, representatives from economic development organizations, chamber of commerce’s, and small-business development centers. Out of these conversations, a burgeoning passion for finding a way to make machine learning (ml), artificial intelligence (ai), analytics, and decision intelligence (DI) more plausible for the micro-businesses out there.  By focusing on how to apply these technologies, Conaxon can help take the fear and risk out of key decision making for the small companies that could use the help.</p><p class="">Our mission is to optimize decision making processes for small and micro businesses using a platform that aggregates shared data from contributors and recommends data to users that might help answer key business questions. </p><p class="">Nearly 99% of businesses in America are considered small—topping out at about 30 million. A further subset of the small businesses are called micro-businesses and they make up nearly 75% of the small businesses. We think a vast majority of those micro businesses are underserved due to a focus on firms that can deliver huge contracts. </p><h2>What we’ve Found in our Research so far:</h2><p class="">Here are the key struggles and how our (pre-launch) product Collective Vantage attempts to address some of those:</p><h3>Cost</h3><p class="">For most micro-businesses, budgets are tight! Return on investment is of huge importance when making purchases—especially when it comes to technology. Given how hard conveying the value of analytics is in terms of financial return for a large organization, the uphill climb is even steeper for a small business. The constituent components for a production-level analytics, ai, and ml application are still quite expensive to build, maintain and grow. A data engineer in most areas commands northward of $70,000. Many Data Scientists make northward of $80,000 to $90,000. Business intelligence professionals and analysts can make between $65,000 and $80,000. If a consultant is being considered, they can charge several hundred per hour. Marketing databases that aid in customer segmentation can also cost many thousands of dollars—and are often out of date quickly. It doesn’t take long to realize your typical small business will struggle to afford tools, human capital, and data.</p><p class="">All these costs in mind, Conaxon realized if the decision making process could be optimized, then a lot of good can be done for a many businesses. Essentially, automating the process of research, aggregating data, cleaning, enrichment, and basic presentation would save considerable time and create economies of scale required to quash the cost. Conaxon believes Collective Vantage has the technology to achieve these efficiencies and can deliver an extremely competitive product relative to other players in the space.</p><h3>There Just Isn’t Enough Data</h3><p class="">In all of our interviews, it was very obvious that small &amp; micro businesses struggle to collect and store enough structured data to derive insights. Most companies, logically, are focused on the operations and delighting customers. Data isn’t really at the top of most small businesses to-do list. </p><p class="">This feedback prompted a thought! Why not aggregate data across a network of similar businesses to create a large dataset that can be anonymized, cleaned, enriched and more useful? This type of model provides some unique advantages:</p><ul data-rte-list="default"><li><p class="">Users / contributors get a truer sense of the larger context of a market, customer segment, or any other subject matter that gets aggregated. The whole is greater than the sum of its’ constituent parts</p></li><li><p class="">Since the goal is to aggregate completely across a network, a user / contributor can ‘look across’ to other products and services being offered to find opportunities for innovation within their own local markets. Diversity in analytics is a key advantage we want to provide to contributors and users</p></li><li><p class="">Data quality is more likely to be less of a concern because there is an incentive to not poison the well everyone is drinking from in terms of what gets contributed</p></li><li><p class="">Contributors do not need a ton of data to get started. In fact, we want to encourage large networks of businesses to start small continue to grow their capabilities while still being able to adopt analytics, ai, ml, and DI in the short-term. Costs can remain low with this approach as well.</p></li></ul><h3>Data can take Awhile to Collect</h3><p class="">Interviews with our target customers seem to indicate that, in most cases, there would be a pretty long lead time to be able to have a sizeable enough dataset for analytics. If a retail establishment wanted to do any sort of sophisticated forecasting, it would take years to collect sufficient number of observations. Consider a small antique shop that wants to do customer segmentation. Unless an antique shop brings in hundreds of customers in a short amount of time, the database will become outdated quickly and it will be impossible to detect drift.</p><p class="">Collective Vantage addresses the timeliness issue by taking small sets from many sources and continually supporting contributors in expanding their collection capabilities.</p><h3>Complexity</h3><p class="">We have been learning that many small and micro businesses simply can’t yet tackle these complex topics on their own. Many micro businesses are an owner-operator with a few employees that do not specialize in analytics, let alone ai or ml.</p><p class="">Our solution is focused on being able to quickly tackle the decision making process by being the analyst for the customer—but much faster and more cheaply than hiring someone or performing the work themselves. The simplicity comes from the customer/user inputting a question they want an answer to so to make a decision. Once the question has been asked, Collective Vantage handles the rest. The platform returns a set of recommended data and analysis that can help answer the query in an informed way. </p><h3>When cash is tight, ROI is king</h3><p class="">Many businesses we talked to expressed how tough it is to just pay the bills. Insufficient cashflow is the third leading reason entrepreneurs close their doors. For those contributing data to Collective Vantage, we want to show that data truly is an asset, each entity owns their data, and each entity can derive monetary value from their data assets. To address the concern over ROI, Collective Vantage will provide the opportunity to earn crypto or cash when a contributors data is sold (with permission of course). Because we believe that each contributor owns their data and is contributing to a valuable dataset that outside entities that will want access, it makes sense to redistribute those revenues back to contributors.</p><h3>Join the Team</h3><p class="">Conaxon is looking to find 100 businesses to partner with on our launch. Please visit: <a href="https://collective-vantage.crd.co">https://collective-vantage.crd.co</a> to sign up and learn more about Collective Vantage.</p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1620355757504-KKOBCVIK557CEJHP71DA/unsplash-image-EKdM6EZ6HDY.jpg?format=1500w" medium="image" isDefault="true" width="1500" height="2250"><media:title type="plain">Key Struggles with Analytics, Decision Intelligence, and Machine Learning Micro-Business Communities</media:title></media:content></item><item><title>Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 2)</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Wed, 17 Mar 2021 00:27:23 +0000</pubDate><link>https://www.conaxon.org/projects/sentiment-analysis-to-drive-content-strategy-on-your-youtube-marketing-channel-part-2</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:6050bccc7701a065a27a9549</guid><description><![CDATA[Conaxon uses the YouTube API to extract stats and comments from a YouTube 
channel as a method of using data to drive content strategy that viewers 
and listeners like watching. Likewise, finding the content that tends to be 
engaged with more negatively. Deciding content strategy is not an exact 
science. But, there are tools out there to create a more efficient decision 
making process. We dive into a few different topics like using APIs, VADER 
Lexicon from NLTK, for and while loops, cleaning text data, and a dashboard 
concept to visualize a channel’s data.]]></description><content:encoded><![CDATA[<p class="">Last time, we looked at the process for setting up a project in Google Cloud, enabling the API, and utilizing the API to get data that we can analyze. In this article, we are going to do a bit of data cleaning, analysis, NLP/unsupervised sentiment analysis, and data visualization in PowerBI.</p><h3>Where we left off….</h3><p class="">Last time, we created a Pandas DataFrame that houses the commentary from a YouTube Channel:</p><pre class="source-code"><span class="cm-variable">data_threads</span><span class="cm-operator">=</span>{<span class="cm-string cm-property">'comment'</span>:<span class="cm-variable">comments_pop</span>,<span class="cm-string cm-property">'comment_id'</span>:<span class="cm-variable">comment_id_pop</span>,<span class="cm-string cm-property">'reply_count'</span>:<span class="cm-variable">reply_count_pop</span>,<span class="cm-string cm-property">'like_count'</span>:<span class="cm-variable">like_count_pop</span>,<span class="cm-string cm-property">'channel_id'</span>:<span class="cm-variable">channel_id_pop</span>,<span class="cm-string cm-property">'video_id'</span>:<span class="cm-variable">video_id_pop</span>}
<span class="cm-variable">threads</span><span class="cm-operator">=</span><span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">data_threads</span>)
<span class="cm-variable">threads</span>.<span class="cm-property">head</span>()</pre><p class="">After creating this table, I removed the duplicates in the even that we actually DO have any duplicates. This is more of a best practice than an actual need. I did have an issue in previous versions of the script where there had been duplicate comments generated. This will also ensure that when we get to calculating metrics and counting there will be no risk of artificially inflating values.</p><pre class="source-code"><span class="cm-variable">threads</span>.<span class="cm-property">drop_duplicates</span>(<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
</pre><p class="">Next, we merge the high-level statistics with the comments:</p><pre class="source-code"><span class="cm-variable">result</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">merge</span>(<span class="cm-variable">threads</span>, <span class="cm-variable">df</span>, <span class="cm-variable">how</span><span class="cm-operator">=</span><span class="cm-string">"inner"</span>, <span class="cm-variable">on</span><span class="cm-operator">=</span>[<span class="cm-string">"video_id"</span>])</pre><h3>Cleaning the Comment Text:</h3><p class="">Before applying any sort of sentiment analysis, or analysis in general, we absolutely should clean the comments. We would not be able to scale an analysis on hundreds of thousands of comments without some sort of cleaning.  Let us start with removing tags, for example:</p><pre class="source-code"><span class="cm-variable">def</span> <span class="cm-variable">remove_tags</span>(<span class="cm-variable">string</span>):
    <span class="cm-variable">result</span> <span class="cm-operator">=</span> <span class="cm-variable">re</span>.<span class="cm-property">sub</span>(<span class="cm-string">'&lt;.*?&gt;'</span>,<span class="cm-string">''</span>,<span class="cm-variable">string</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">result</span>

<span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>]<span class="cm-operator">=</span><span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">lambda</span> <span class="cm-variable">cw</span> : <span class="cm-variable">remove_tags</span>(<span class="cm-variable">cw</span>))</pre><p class="">There are lots of things we can do with emojis and emoticons. They convey sentiment via a pictogram. Unfortunately, a lexicon cannot interpret a picture. The emojis will need to be converted to a phrase. We can do that in this way:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">emot</span>.<span class="cm-property">emo_unicode</span> <span class="cm-keyword">import</span> <span class="cm-def">UNICODE_EMO</span>, <span class="cm-def">EMOTICONS</span>

<span class="cm-variable">def</span> <span class="cm-variable">convert_emojis</span>(<span class="cm-variable">text</span>):
    <span class="cm-keyword">for</span> <span class="cm-variable">emot</span> <span class="cm-keyword">in</span> <span class="cm-variable">UNICODE_EMO</span>:
        <span class="cm-variable">text</span> <span class="cm-operator">=</span> <span class="cm-variable">text</span>.<span class="cm-property">replace</span>(<span class="cm-variable">emot</span>, <span class="cm-string">"_"</span>.<span class="cm-property">join</span>(<span class="cm-variable">UNICODE_EMO</span>[<span class="cm-variable">emot</span>].<span class="cm-property">replace</span>(<span class="cm-string">","</span>,<span class="cm-string">""</span>).<span class="cm-property">replace</span>(<span class="cm-string">":"</span>,<span class="cm-string">""</span>).<span class="cm-property">split</span>()))
        <span class="cm-keyword">return</span> <span class="cm-variable">text</span>
  
<span class="cm-variable">def</span> <span class="cm-variable">convert_emoticons</span>(<span class="cm-variable">text</span>):
    <span class="cm-keyword">for</span> <span class="cm-variable">emot</span> <span class="cm-keyword">in</span> <span class="cm-variable">EMOTICONS</span>:
        <span class="cm-variable">text</span> <span class="cm-operator">=</span> <span class="cm-variable">re</span>.<span class="cm-property">sub</span>(<span class="cm-variable">u</span><span class="cm-string">'('</span><span class="cm-operator">+</span><span class="cm-variable">emot</span><span class="cm-operator">+</span><span class="cm-string">')'</span>, <span class="cm-string">"_"</span>.<span class="cm-variable">join</span>(<span class="cm-variable">EMOTICONS</span>[<span class="cm-variable">emot</span>].<span class="cm-variable">replace</span>(<span class="cm-string">","</span>,<span class="cm-string">""</span>).<span class="cm-property">split</span>()), <span class="cm-variable">text</span>)
        <span class="cm-keyword">return</span> <span class="cm-variable">text</span>

<span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">convert_emoticons</span>)
<span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">convert_emojis</span>)</pre><p class="">URLs cannot be interpreted as much of anything. We will remove them:</p><pre class="source-code"><span class="cm-variable">def</span> <span class="cm-variable">remove_urls</span>(<span class="cm-variable">text</span>):
    <span class="cm-variable">url_pattern</span> <span class="cm-operator">=</span> <span class="cm-variable">re</span>.<span class="cm-property">compile</span>(<span class="cm-variable">r</span><span class="cm-string">'https?://\S+|www\.\S+'</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">url_pattern</span>.<span class="cm-property">sub</span>(<span class="cm-variable">r</span><span class="cm-string">''</span>, <span class="cm-variable">text</span>)

<span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">remove_urls</span>)</pre><p class="">HTML is also another piece that should be cleaned up:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">bs4</span> <span class="cm-keyword">import</span> <span class="cm-def">BeautifulSoup</span>

<span class="cm-variable">def</span> <span class="cm-variable">html</span>(<span class="cm-variable">text</span>):
    <span class="cm-keyword">return</span> <span class="cm-variable">BeautifulSoup</span>(<span class="cm-variable">text</span>, <span class="cm-string">"lxml"</span>).<span class="cm-property">text</span>

<span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">html</span>)</pre><p class="">The next few lines were generated for future studies in Natural Language Processing (NLP), but not necessarily used here. However, they are useful functions to reference back to later if you happen to be on this journey as well:</p><ul data-rte-list="default"><li><p class="">Remove Punctuation</p></li><li><p class="">Tokenize</p></li><li><p class="">Remove Stop Words</p></li><li><p class="">Lemmatize</p></li><li><p class="">Generate the number  of words in a comment</p></li><li><p class="">Generate the number  of sentences in a comment</p></li></ul><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">string</span>
<span class="cm-variable">string</span>.<span class="cm-property">punctuation</span>
<span class="cm-variable">def</span> <span class="cm-variable">remove_punctuation</span>(<span class="cm-variable">text</span>):
    <span class="cm-variable">no_punct</span><span class="cm-operator">=</span>[<span class="cm-variable">words</span> <span class="cm-keyword">for</span> <span class="cm-variable">words</span> <span class="cm-keyword">in</span> <span class="cm-variable">text</span> <span class="cm-keyword">if</span> <span class="cm-variable">words</span> <span class="cm-variable">not</span> <span class="cm-keyword">in</span> <span class="cm-variable">string</span>.<span class="cm-variable">punctuation</span>]
    <span class="cm-variable">words_wo_punct</span><span class="cm-operator">=</span><span class="cm-string">''</span>.<span class="cm-property">join</span>(<span class="cm-variable">no_punct</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">words_wo_punct</span>
<span class="cm-variable">result</span>[<span class="cm-string">'comment_no_punc'</span>]<span class="cm-operator">=</span><span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">remove_punctuation</span>(<span class="cm-variable">x</span>))

<span class="cm-variable">def</span> <span class="cm-variable">tokenize</span>(<span class="cm-variable">text</span>):
    <span class="cm-variable">split</span><span class="cm-operator">=</span><span class="cm-variable">re</span>.<span class="cm-property">split</span>(<span class="cm-string">"\W+"</span>,<span class="cm-variable">text</span>) 
    <span class="cm-keyword">return</span> <span class="cm-variable">split</span>
<span class="cm-variable">result</span>[<span class="cm-string">'comment_no_punc_tokens'</span>]<span class="cm-operator">=</span><span class="cm-variable">result</span>[<span class="cm-string">'comment_no_punc'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">tokenize</span>(<span class="cm-variable">x</span>.<span class="cm-variable">lower</span>()))
<span class="cm-variable">result</span>.<span class="cm-property">head</span>(<span class="cm-number">1</span>)

<span class="cm-error">#Importing stopwords from nltk library</span>
<span class="cm-variable">from</span> <span class="cm-variable">nltk</span>.<span class="cm-property">corpus</span> <span class="cm-keyword">import</span> <span class="cm-def">stopwords</span>
<span class="cm-variable">STOPWORDS</span> <span class="cm-operator">=</span> <span class="cm-variable">set</span>(<span class="cm-variable">stopwords</span>.<span class="cm-property">words</span>(<span class="cm-string">'english'</span>))
<span class="cm-error"># Function to remove the stopwords</span>
<span class="cm-variable">def</span> <span class="cm-variable">stopwords</span>(<span class="cm-variable">text</span>):
    <span class="cm-keyword">return</span> <span class="cm-string">" "</span>.<span class="cm-property">join</span>([<span class="cm-variable">word</span> <span class="cm-keyword">for</span> <span class="cm-variable">word</span> <span class="cm-keyword">in</span> <span class="cm-variable">str</span>(<span class="cm-variable">text</span>).<span class="cm-variable">split</span>() <span class="cm-keyword">if</span> <span class="cm-variable">word</span> <span class="cm-variable">not</span> <span class="cm-keyword">in</span> <span class="cm-variable">STOPWORDS</span>])

<span class="cm-variable">result</span>[<span class="cm-string">'title_wo_punct_split_wo_stopwords'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment_no_punc_tokens'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">stopwords</span>)

<span class="cm-keyword">import</span> <span class="cm-def">nltk</span>
<span class="cm-variable">nltk</span>.<span class="cm-property">download</span>(<span class="cm-string">'wordnet'</span>)
<span class="cm-variable">nltk</span>.<span class="cm-property">download</span>(<span class="cm-string">'averaged_perceptron_tagger'</span>)
<span class="cm-variable">from</span> <span class="cm-variable">nltk</span>.<span class="cm-property">corpus</span> <span class="cm-keyword">import</span> <span class="cm-def">wordnet</span>
<span class="cm-keyword">from</span> <span class="cm-variable">nltk</span>.<span class="cm-property">stem</span> <span class="cm-keyword">import</span> <span class="cm-def">WordNetLemmatizer</span>
<span class="cm-variable">lemmatizer</span> <span class="cm-operator">=</span> <span class="cm-variable">WordNetLemmatizer</span>()
<span class="cm-variable">wordnet_map</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">"N"</span>:<span class="cm-variable">wordnet</span>.<span class="cm-property">NOUN</span>, <span class="cm-string cm-property">"V"</span>:<span class="cm-variable">wordnet</span>.<span class="cm-property">VERB</span>, <span class="cm-string cm-property">"J"</span>:<span class="cm-variable">wordnet</span>.<span class="cm-property">ADJ</span>, <span class="cm-string cm-property">"R"</span>:<span class="cm-variable">wordnet</span>.<span class="cm-property">ADV</span>} 

<span class="cm-variable">def</span> <span class="cm-variable">lemmatize_words</span>(<span class="cm-variable">text</span>):
    <span class="cm-variable">pos_tagged_text</span> <span class="cm-operator">=</span> <span class="cm-variable">nltk</span>.<span class="cm-property">pos_tag</span>(<span class="cm-variable">text</span>.<span class="cm-property">split</span>())
    <span class="cm-keyword">return</span> <span class="cm-string">" "</span>.<span class="cm-property">join</span>([<span class="cm-variable">lemmatizer</span>.<span class="cm-property">lemmatize</span>(<span class="cm-variable">word</span>, <span class="cm-variable">wordnet_map</span>.<span class="cm-property">get</span>(<span class="cm-variable">pos</span>[<span class="cm-number">0</span>], <span class="cm-variable">wordnet</span>.<span class="cm-property">NOUN</span>)) <span class="cm-keyword">for</span> <span class="cm-variable">word</span>, <span class="cm-variable">pos</span> <span class="cm-keyword">in</span> <span class="cm-variable">pos_tagged_text</span>])

<span class="cm-variable">result</span>[<span class="cm-string">'title_wo_punct_split_wo_stopwords_lemma'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'title_wo_punct_split_wo_stopwords'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">lemmatize_words</span>)

<span class="cm-variable">result</span>[<span class="cm-string">'num_words'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">len</span>(<span class="cm-variable">x</span>.<span class="cm-variable">split</span>()))
<span class="cm-variable">result</span>[<span class="cm-string">'num_sentences'</span>] <span class="cm-operator">=</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>].<span class="cm-property">apply</span>(<span class="cm-variable">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">len</span>(<span class="cm-variable">re</span>.<span class="cm-variable">split</span>( <span class="cm-string">'~ ...'</span> ,<span class="cm-string">'~'</span>.<span class="cm-variable">join</span>(<span class="cm-variable">x</span>.<span class="cm-variable">split</span>(<span class="cm-string">'.'</span>)))))</pre><h3>Implement NLTK VADER Lexicon:</h3><p class="">One thing to note here is that the optimal solution for sentiment analysis is actually LABELING the data manually BEFORE attempting sentiment analysis. By doing this, you ensure that the labeling is unique to your own use case—which will be explored a bit later. The code to run the comments against the NLTK can be found below:</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">nltk</span>
<span class="cm-keyword">from</span> <span class="cm-variable">nltk</span>.<span class="cm-property">sentiment</span>.<span class="cm-property">vader</span> <span class="cm-keyword">import</span> <span class="cm-def">SentimentIntensityAnalyzer</span> <span class="cm-keyword">as</span> <span class="cm-def">SIA</span>
<span class="cm-variable">nltk</span>.<span class="cm-property">downloader</span>.<span class="cm-property">download</span>(<span class="cm-string">'vader_lexicon'</span>)

<span class="cm-variable">sid</span> <span class="cm-operator">=</span> <span class="cm-variable">SIA</span>()

<span class="cm-variable">sentiment</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">comment</span> <span class="cm-keyword">in</span> <span class="cm-variable">result</span>[<span class="cm-string">'comment'</span>]:
    <span class="cm-variable">sentiment</span>.<span class="cm-property">append</span>(<span class="cm-variable">sid</span>.<span class="cm-property">polarity_scores</span>(<span class="cm-variable">comment</span>) )

<span class="cm-variable">result</span>[<span class="cm-string">'sentiment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">sentiment</span>

<span class="cm-variable">result</span> <span class="cm-operator">=</span> <span class="cm-variable">result</span>.<span class="cm-property">drop</span>(<span class="cm-string">'sentiment'</span>, <span class="cm-number">1</span>).<span class="cm-property">assign</span>(<span class="cm-operator">**</span><span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">result</span>.<span class="cm-property">sentiment</span>.<span class="cm-property">values</span>.<span class="cm-property">tolist</span>()))</pre><p class="">This piece of the code will take awhile, depending on the number of comments. In this block, we simply loop through the comments, write the results from SentimentIntensityAnalyzer to the empty sentiment list, and then creates another column to the result/comments table. The result of the sentiment analyzer returns a dictionary response. We can split this dictionary into separate columns by executing the following code:</p><pre class="source-code"><span class="cm-variable">result</span> <span class="cm-operator">=</span> <span class="cm-variable">result</span>.<span class="cm-property">drop</span>(<span class="cm-string">'sentiment'</span>, <span class="cm-number">1</span>).<span class="cm-property">assign</span>(<span class="cm-operator">**</span><span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">result</span>.<span class="cm-property">sentiment</span>.<span class="cm-property">values</span>.<span class="cm-property">tolist</span>()))</pre><p class="">Finally, we write the file to a csv so we can connect PowerBI. We are going to now create a simple visualization that would allow a user to track the sentiment on their YouTube content.</p><h3>Visualizing the YouTube Data Collected:</h3>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1615927321538-MAZCLOSMXU7TS8B3S41A/Dashboard+View.JPG" data-image-dimensions="995x560" data-image-focal-point="0.5,0.5" alt="Dashboard View.JPG" data-load="false" data-image-id="60511819d456cc21da947b0e" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1615927321538-MAZCLOSMXU7TS8B3S41A/Dashboard+View.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Since the goal was to measure engagement on the channel, the more basic stats are shown at the top. Comments, Likes, Dislikes, and Views measure a general level of engagement with the topic of the video. One could probably even created a weighted average calculation with the average sentiment, thereby creating a single engagement score. The bar chart simply shows the top 10 videos. Naturally, it would make sense to focus on the videos that were viewed most so as to analyze why that video may have been more successful. Comments are listed down below along with the predicted sentiment values. NOTE: the compound score is a kind of ‘blended’ sentiment score. This reddit feed seems to explain the mathematics of the scoring well enough: <a href="https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk.">https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk.</a> Finally, there is a dual-axis chart that shows average compound score shown over time relative to the views. There is actually little correlation between the two, but I thought that juxtaposing these values would be a good method for an analyst to notice large swings in the sentiment over time. Large contrasts in sentiment would be key areas to investigate why the content was either well received or not.</p><h3>Considerations:</h3><ul data-rte-list="default"><li><p class="">The sentiment analyzer struggles with context. The sentiment analyzer struggles somewhat with things like ‘bad ass’. In most contexts, this colloquialism is actually a ‘positive’ for some content even though the words technically might be considered ‘negative’. The limitation of the sentiment analyzer is that other people without domain expertise in the subject matter is categorizing the words as negative, neutral, or positive. On aggregate, I think this methodology is reliable and the benefits outweigh the risks of mis-labeling</p></li><li><p class="">Take steps to do human labeling before attempting to do a thorough sentiment analysis. The methods we employ here are somewhat of a ‘quick and dirty’ method of measuring sentiment</p></li></ul>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1615927284303-H4RB1CLII7LHDMLYB02V/Dashboard+View.JPG?format=1500w" medium="image" isDefault="true" width="995" height="560"><media:title type="plain">Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 2)</media:title></media:content></item><item><title>Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 1)</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Sun, 14 Mar 2021 03:02:29 +0000</pubDate><link>https://www.conaxon.org/projects/sentiment-analysis-to-drive-content-strategy-on-your-youtube-marketing-channel</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:604ba6881fbcd33297344666</guid><description><![CDATA[Conaxon uses the YouTube API to extract stats and comments from a YouTube 
channel as a method of using data to drive content strategy that viewers 
and listeners like watching. Likewise, finding the content that tends to be 
engaged with more negatively. Deciding content strategy is not an exact 
science. But, there are tools out there to create a more efficient decision 
making process. We dive into a few different topics like using APIs, VADER 
Lexicon from NLTK, for and while loops, cleaning text data, and a dashboard 
concept to visualize a channel’s data.]]></description><content:encoded><![CDATA[<p class="">I haven’t spent much time managing marketing teams, content developers, creating campaigns, or any of those projects. But, from the outside looking in, I can imagine that there must be an immense amount of work and intuition required to come up with good marketing content. Some might say that the author should just ‘know their audience’ or ‘just create what you want to create’. In general, I agree. However, companies like YouTube have changed the game somewhat in recent years. Audiences have never been larger, more diverse, more targeted, and accessible. It would be nearly impossible to grow a marketing strategy through intuition alone and maintain success. Creators and Authors should have tools that give them an opportunity to keep up with a very fickle audience. With a little data, a few lines of code, and some visualizations these creatives just might have a chance to be ahead of the curve….faster.</p><p class="">In this few part series, we are going to spin up a simple project to demonstrate a neat use case. Consider for a moment that you are a marketing leader or content creator responsible for generating campaigns, marketing videos, podcasts, etc. We’ll also assume that while this leader has a lot of experience, they are also quite pragmatic and aware there are some opportunities to start off on the right foot. We can try to understand the landscape for the companies products and content before investing dollars into a project. Seeing as this company is heavily B2C, YouTube is a keystone in delivering good content to large audiences quickly. </p><p class="">In this series, we will cover the following:</p><ul data-rte-list="default"><li><p class="">Setting up a project in Google so as to activate the v3 data API for YouTube</p></li><li><p class="">Get familiar with the APIs we will need to use and what the APIs provide</p></li><li><p class="">Make calls to the appropriate APIs and store the results</p></li><li><p class="">Clean the comment data to be processed further by the VADER Lexicon from NLTK</p></li><li><p class="">Create a simple PowerBI dashboard to visualize/analyze some of the results</p></li><li><p class="">Discuss some future improvements that can be made</p></li></ul><h3>Setting up your project with Google:</h3><p class="">To begin, you’ll have to set up a project with Google so as to activate the API, get your API Keys, configure Oauth, etc.</p><ol data-rte-list="default"><li><p class="">If you don’t have an account with Google/Google Cloud, sign up for a Google account</p></li><li><p class="">Sign into Google Cloud</p></li><li><p class="">Create the project and name according to your needs</p></li><li><p class="">Once the project has been created, search for ‘youtube’ in the search bar at the top of the Google Cloud Workspace</p></li><li><p class="">One of the top results will be: <strong>YouTube Data API v3</strong></p></li><li><p class="">Select the API to move forward</p></li><li><p class="">Enable the API for your project  and move onto the next screen</p></li><li><p class="">At the top right, click the button to create credentials</p></li><li><p class="">Choose the YouTube Data API v3 from the list under the question: “Which API are you using?”</p></li><li><p class="">For the next question: “Where will you be calling the API from?” use the answer: “Other UI (e.g. Windows, CLI tool)”</p></li><li><p class="">For the next question: “What data will you be accessing?” use the answer: “Public Data”</p><ol data-rte-list="default"><li><p class="">Unless, of course, you know the app will need personal information. For this use case, we will not need to use personal data. Plus, there is extra levels of scrutiny involved with accessing personal data</p></li></ol></li><li><p class="">After proceeding to the next page, you will be presented with an API Key. Make sure to save this key.</p></li><li><p class="">Next, click on the ‘CREATE CREDENTIALS’ button. You’ll want to create an OAuth Client ID</p></li><li><p class="">Click on ‘Configure Consent Screen”</p></li><li><p class="">I chose an External User Type and selected Create</p></li><li><p class="">Fill out the form and continue</p></li><li><p class="">Add the first three scopes. If you are only going after the comments, you won’t need to add most of these APIs. You will not need to add sensitive or restricted scopes</p></li><li><p class="">Add users as necessary</p></li><li><p class="">I published my app in the following screens because it seemed as though there were issues when running code if the project was not published. We are only testing anyways so there is little risk of huge impacts</p></li><li><p class="">Under the Credentials tab, create an Create OAuth client ID</p></li><li><p class="">Name the application type as a ‘desktop app’ and Name the client any way you want</p></li><li><p class="">Save your client ID and Client Secret</p></li><li><p class="">Download the .json file from the ‘OAuth 2.0 Client IDs’ and save the file in the same directory that you will be developing within</p></li><li><p class="">Optional: Add a service account</p></li></ol><p class="">It takes a bit of time to set up the API and OAuth, but it is worth it in the end. There are tons of other walkthroughs on YouTube if you get stuck.</p><h3>Starting to Build the Script:</h3><p class="">Import libraries and defining some variables/functions that enable the OAuth to operate:</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">os</span>
<span class="cm-keyword">import</span> <span class="cm-def">numpy</span> <span class="cm-keyword">as</span> <span class="cm-def">np</span>
<span class="cm-keyword">import</span> <span class="cm-def">re</span>

<span class="cm-variable">CLIENT_SECRETS_FILE</span> <span class="cm-operator">=</span> <span class="cm-string">"client_secret_2.json"</span>

<span class="cm-variable">SCOPES</span> <span class="cm-operator">=</span> [<span class="cm-string">'https://www.googleapis.com/auth/youtube.force-ssl'</span>]
<span class="cm-variable">API_SERVICE_NAME</span> <span class="cm-operator">=</span> <span class="cm-string">'youtube'</span>
<span class="cm-variable">API_VERSION</span> <span class="cm-operator">=</span> <span class="cm-string">'v3'</span>

<span class="cm-keyword">import</span> <span class="cm-def">google</span>.<span class="cm-variable">oauth2</span>.<span class="cm-property">credentials</span>
 
<span class="cm-variable">from</span> <span class="cm-variable">googleapiclient</span>.<span class="cm-property">discovery</span> <span class="cm-keyword">import</span> <span class="cm-def">build</span>
<span class="cm-keyword">from</span> <span class="cm-variable">googleapiclient</span>.<span class="cm-property">errors</span> <span class="cm-keyword">import</span> <span class="cm-def">HttpError</span>
<span class="cm-keyword">from</span> <span class="cm-variable">google_auth_oauthlib</span>.<span class="cm-property">flow</span> <span class="cm-keyword">import</span> <span class="cm-def">InstalledAppFlow</span>
<span class="cm-keyword">import</span> <span class="cm-def">google</span>.<span class="cm-variable">oauth2</span>.<span class="cm-property">credentials</span>
 
<span class="cm-variable">from</span> <span class="cm-variable">googleapiclient</span>.<span class="cm-property">discovery</span> <span class="cm-keyword">import</span> <span class="cm-def">build</span>
<span class="cm-keyword">from</span> <span class="cm-variable">googleapiclient</span>.<span class="cm-property">errors</span> <span class="cm-keyword">import</span> <span class="cm-def">HttpError</span>
<span class="cm-keyword">from</span> <span class="cm-variable">google_auth_oauthlib</span>.<span class="cm-property">flow</span> <span class="cm-keyword">import</span> <span class="cm-def">InstalledAppFlow</span>
<span class="cm-keyword">from</span> <span class="cm-variable">google</span>.<span class="cm-property">auth</span>.<span class="cm-property">transport</span>.<span class="cm-property">requests</span> <span class="cm-keyword">import</span> <span class="cm-def">Request</span>

<span class="cm-keyword">import</span> <span class="cm-def">pickle</span>
<span class="cm-keyword">import</span> <span class="cm-def">google</span>.<span class="cm-variable">oauth2</span>.<span class="cm-property">credentials</span>
 
<span class="cm-variable">from</span> <span class="cm-variable">googleapiclient</span>.<span class="cm-property">discovery</span> <span class="cm-keyword">import</span> <span class="cm-def">build</span>
<span class="cm-keyword">from</span> <span class="cm-variable">googleapiclient</span>.<span class="cm-property">errors</span> <span class="cm-keyword">import</span> <span class="cm-def">HttpError</span>
<span class="cm-keyword">from</span> <span class="cm-variable">google_auth_oauthlib</span>.<span class="cm-property">flow</span> <span class="cm-keyword">import</span> <span class="cm-def">InstalledAppFlow</span>
<span class="cm-keyword">from</span> <span class="cm-variable">google</span>.<span class="cm-property">auth</span>.<span class="cm-property">transport</span>.<span class="cm-property">requests</span> <span class="cm-keyword">import</span> <span class="cm-def">Request</span>

<span class="cm-variable">def</span> <span class="cm-variable">get_authenticated_service</span>():
    <span class="cm-variable">credentials</span> <span class="cm-operator">=</span> <span class="cm-variable">None</span>
    <span class="cm-keyword">if</span> <span class="cm-variable">os</span>.<span class="cm-property">path</span>.<span class="cm-property">exists</span>(<span class="cm-string">'token.pickle'</span>):
        <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-string">'token.pickle'</span>, <span class="cm-string">'rb'</span>) <span class="cm-variable">as</span> <span class="cm-variable">token</span>:
            <span class="cm-variable">credentials</span> <span class="cm-operator">=</span> <span class="cm-variable">pickle</span>.<span class="cm-property">load</span>(<span class="cm-variable">token</span>)
    <span class="cm-error">#  Check if the credentials are invalid or do not exist</span>
    <span class="cm-keyword">if</span> <span class="cm-variable">not</span> <span class="cm-variable">credentials</span> <span class="cm-variable">or</span> <span class="cm-variable">not</span> <span class="cm-variable">credentials</span>.<span class="cm-property">valid</span>:
        <span class="cm-error"># Check if the credentials have expired</span>
        <span class="cm-keyword">if</span> <span class="cm-variable">credentials</span> <span class="cm-variable">and</span> <span class="cm-variable">credentials</span>.<span class="cm-property">expired</span> <span class="cm-variable">and</span> <span class="cm-variable">credentials</span>.<span class="cm-property">refresh_token</span>:
            <span class="cm-variable">credentials</span>.<span class="cm-property">refresh</span>(<span class="cm-variable">Request</span>())
        <span class="cm-keyword">else</span>:
            <span class="cm-variable">flow</span> <span class="cm-operator">=</span> <span class="cm-variable">InstalledAppFlow</span>.<span class="cm-property">from_client_secrets_file</span>(
                <span class="cm-variable">CLIENT_SECRETS_FILE</span>, <span class="cm-variable">SCOPES</span>)
            <span class="cm-variable">credentials</span> <span class="cm-operator">=</span> <span class="cm-variable">flow</span>.<span class="cm-property">run_console</span>()
 
        <span class="cm-error"># Save the credentials for the next run</span>
        <span class="cm-keyword">with</span> <span class="cm-variable">open</span>(<span class="cm-string">'token.pickle'</span>, <span class="cm-string">'wb'</span>) <span class="cm-variable">as</span> <span class="cm-variable">token</span>:
            <span class="cm-variable">pickle</span>.<span class="cm-property">dump</span>(<span class="cm-variable">credentials</span>, <span class="cm-variable">token</span>)
 
    <span class="cm-keyword">return</span> <span class="cm-variable">build</span>(<span class="cm-variable">API_SERVICE_NAME</span>, <span class="cm-variable">API_VERSION</span>, <span class="cm-variable">credentials</span> <span class="cm-operator">=</span> <span class="cm-variable">credentials</span>)
 
<span class="cm-keyword">if</span> <span class="cm-variable">__name__</span> <span class="cm-operator">==</span> <span class="cm-string">'__main__'</span>:
    <span class="cm-variable">os</span>.<span class="cm-property">environ</span>[<span class="cm-string">'OAUTHLIB_INSECURE_TRANSPORT'</span>] <span class="cm-operator">=</span> <span class="cm-string">'1'</span>
    <span class="cm-variable">service</span> <span class="cm-operator">=</span> <span class="cm-variable">get_authenticated_service</span>()</pre><p class="">A key part of this block of code is to save off the credentials for the authenticated service. You do not want to have to re-authenticate EVERY single time you run the script. This block will fix this problem.</p><pre class="source-code"><span class="cm-variable">api_key</span> <span class="cm-operator">=</span> <span class="cm-string">'&lt;API KEY&gt;'</span>
<span class="cm-variable">youtube</span><span class="cm-operator">=</span><span class="cm-variable">build</span>(<span class="cm-string">'youtube'</span>,<span class="cm-string">'v3'</span>,<span class="cm-variable">developerKey</span><span class="cm-operator">=</span><span class="cm-variable">api_key</span>)</pre><p class="">Define your API Key here in these lines of code. This builds a ‘key’ as it were to then call the different parts of the API like comments, search function, etc.</p><h3>Using the Search Function to get the Channels we will analyze:</h3><p class="">You can use the search function to get a set of results from the YouTube search function.  You should familiarize yourself with the documentation: https://developers.google.com/youtube/v3/getting-started</p><p class="">We will not be exploring the API functions in depth here in this article. Here is an example of using the search method of the API:</p><pre class="source-code"><span class="cm-variable">snippets</span> <span class="cm-operator">=</span> <span class="cm-variable">youtube</span>.<span class="cm-property">search</span>().<span class="cm-property">list</span>(<span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'id,snippet'</span>,<span class="cm-variable">type</span><span class="cm-operator">=</span><span class="cm-string">'channel'</span>,<span class="cm-variable">q</span><span class="cm-operator">=</span><span class="cm-string">'t.rex arms'</span>).<span class="cm-property">execute</span>()

<span class="cm-variable">channelId</span> <span class="cm-operator">=</span> <span class="cm-variable">snippets</span>[<span class="cm-string">'items'</span>][<span class="cm-number">0</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'channelId'</span>]
<span class="cm-variable">print</span>(<span class="cm-variable">channelId</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-variable">UCU</span><span class="cm-operator">-</span><span class="cm-variable">ljC8EvKZFhJ</span><span class="cm-operator">-</span><span class="cm-variable">pct_5rMQ</span></pre><p class="">I cut a bit of a corner here. I knew that the channel Id I wanted existed at the 0 index. This Channel Id will be used to kick off the rest of the script.  The concept is as follows:</p><p class="">Search YouTube for a Channel(s) (the Seed) &gt;&gt; Extract the Stats and the Playlist Id from channels.list() &gt;&gt; Get the list of videos and their Ids from the playlistItems() &gt;&gt; Use the channel/video ids to get the comments from each video. The next lines of code will demonstrate the results that the YouTube API will return:</p><pre class="source-code"><span class="cm-variable">stats</span> <span class="cm-operator">=</span> <span class="cm-variable">youtube</span>.<span class="cm-property">channels</span>().<span class="cm-property">list</span>(<span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'statistics'</span>,<span class="cm-variable">id</span><span class="cm-operator">=</span><span class="cm-variable">channelId</span>).<span class="cm-property">execute</span>()
<span class="cm-variable">stats</span>[<span class="cm-string">'items'</span>]

<span class="cm-operator">&gt;&gt;</span> [{<span class="cm-string cm-property">'kind'</span>: <span class="cm-string">'youtube#channel'</span>,
  <span class="cm-string cm-property">'etag'</span>: <span class="cm-string">'6p3MzT5MtiAPsl3LjZUa1Jrfp78'</span>,
  <span class="cm-string cm-property">'id'</span>: <span class="cm-string">'UCU-ljC8EvKZFhJ-pct_5rMQ'</span>,
  <span class="cm-string cm-property">'statistics'</span>: {<span class="cm-string cm-property">'viewCount'</span>: <span class="cm-string">'103419822'</span>,
   <span class="cm-string cm-property">'subscriberCount'</span>: <span class="cm-string">'975000'</span>,
   <span class="cm-string cm-property">'hiddenSubscriberCount'</span>: <span class="cm-variable">False</span>,
   <span class="cm-string cm-property">'videoCount'</span>: <span class="cm-string">'145'</span>}}]

<span class="cm-variable">content</span> <span class="cm-operator">=</span> <span class="cm-variable">youtube</span>.<span class="cm-property">channels</span>().<span class="cm-property">list</span>(<span class="cm-variable">id</span> <span class="cm-operator">=</span> <span class="cm-variable">channelId</span>, <span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'contentDetails'</span>).<span class="cm-property">execute</span>()
<span class="cm-variable">content</span>[<span class="cm-string">'items'</span>]

<span class="cm-operator">&gt;&gt;</span> [{<span class="cm-string cm-property">'kind'</span>: <span class="cm-string">'youtube#channel'</span>,
  <span class="cm-string cm-property">'etag'</span>: <span class="cm-string">'NHEVnfNtoeJIhQaZFf68M1xiH9c'</span>,
  <span class="cm-string cm-property">'id'</span>: <span class="cm-string">'UCU-ljC8EvKZFhJ-pct_5rMQ'</span>,
  <span class="cm-string cm-property">'contentDetails'</span>: {<span class="cm-string cm-property">'relatedPlaylists'</span>: {<span class="cm-string cm-property">'likes'</span>: <span class="cm-string">''</span>,
    <span class="cm-string cm-property">'favorites'</span>: <span class="cm-string">''</span>,
    <span class="cm-string cm-property">'uploads'</span>: <span class="cm-string">'UUU-ljC8EvKZFhJ-pct_5rMQ'</span>}}}]

<span class="cm-variable">uploadId</span> <span class="cm-operator">=</span> <span class="cm-variable">content</span>[<span class="cm-string">'items'</span>][<span class="cm-number">0</span>][<span class="cm-string">'contentDetails'</span>][<span class="cm-string">'relatedPlaylists'</span>][<span class="cm-string">'uploads'</span>]
<span class="cm-variable">uploadId</span>

<span class="cm-operator">&gt;&gt;</span> <span class="cm-string">'UUU-ljC8EvKZFhJ-pct_5rMQ'</span></pre><p class="">After getting the uploads playlist, we should be able to go get the videos from the playlist. If there was more than one playlist, you could simply write the playlist ids to an empty list and loop through all of them to get the videos. Next, we get the videos from the playlist:</p><pre class="source-code"><span class="cm-keyword">while</span> <span class="cm-number">1</span>: <span class="cm-variable">res</span><span class="cm-operator">=</span><span class="cm-variable">youtube</span>.<span class="cm-property">playlistItems</span>().<span class="cm-property">list</span>(<span class="cm-variable">playlistId</span><span class="cm-operator">=</span><span class="cm-variable">uploadId</span>,<span class="cm-variable">maxResults</span><span class="cm-operator">=</span><span class="cm-number">50</span>,<span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'snippet'</span>,<span class="cm-variable">pageToken</span><span class="cm-operator">=</span><span class="cm-variable">nextPage_token</span>).<span class="cm-property">execute</span>()
    <span class="cm-variable">allVideos</span> <span class="cm-operator">+=</span> <span class="cm-variable">res</span>[<span class="cm-string">'items'</span>]
    <span class="cm-variable">nextPage_token</span> <span class="cm-operator">=</span> <span class="cm-variable">res</span>.<span class="cm-property">get</span>(<span class="cm-string">'nextPageToken'</span>)
    <span class="cm-keyword">if</span> <span class="cm-variable">nextPage_token</span> <span class="cm-variable">is</span> <span class="cm-variable">None</span>:
        <span class="cm-keyword">break</span>

<span class="cm-variable">video_ids</span><span class="cm-operator">=</span>[]
<span class="cm-variable">channelId</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">i</span> <span class="cm-keyword">in</span> <span class="cm-variable">range</span>(<span class="cm-number">0</span>,<span class="cm-number">143</span>):
    <span class="cm-variable">video_ids</span>.<span class="cm-property">append</span>(<span class="cm-variable">allVideos</span>[<span class="cm-variable">i</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'resourceId'</span>][<span class="cm-string">'videoId'</span>])
    <span class="cm-variable">channelId</span>.<span class="cm-property">append</span>(<span class="cm-variable">allVideos</span>[<span class="cm-variable">i</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'channelId'</span>])

<span class="cm-variable">stats</span> <span class="cm-operator">=</span> []
<span class="cm-keyword">for</span> <span class="cm-variable">i</span> <span class="cm-keyword">in</span> <span class="cm-variable">range</span>(<span class="cm-number">0</span>,<span class="cm-variable">len</span>(<span class="cm-variable">video_ids</span>),<span class="cm-number">40</span>):
    <span class="cm-variable">res</span> <span class="cm-operator">=</span> (<span class="cm-variable">youtube</span>).<span class="cm-property">videos</span>().<span class="cm-property">list</span>(<span class="cm-variable">id</span><span class="cm-operator">=</span><span class="cm-string">','</span>.<span class="cm-property">join</span>(<span class="cm-variable">video_ids</span>[<span class="cm-variable">i</span>:<span class="cm-variable">i</span><span class="cm-operator">+</span><span class="cm-number">40</span>]),<span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'statistics'</span>).<span class="cm-property">execute</span>()
    <span class="cm-variable">stats</span> <span class="cm-operator">+=</span> <span class="cm-variable">res</span>[<span class="cm-string">'items'</span>]</pre><p class="">A while loop grabs any and all videos. Depending on your own use case, there might be a need to stop after so many calls. Remember, you only have 10,000 calls per day. The two other blocks simply appends data to a list for post processing later.  I would probably not hard code a range in production level code since playlists will have different numbers of videos. </p><p class="">Next, we deconstruct the results of the other lists to separate ‘columns’ to be used in a dataframe/table:</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">pandas</span> <span class="cm-keyword">as</span> <span class="cm-def">pd</span>
<span class="cm-variable">data</span><span class="cm-operator">=</span>{<span class="cm-string cm-property">'title'</span>:<span class="cm-variable">title</span>,<span class="cm-string cm-property">'video_id'</span>:<span class="cm-variable">videoid</span>,<span class="cm-string cm-property">'video_description'</span>:<span class="cm-variable">video_description</span>,<span class="cm-string cm-property">'publishedDate'</span>:<span class="cm-variable">publishedDate</span>,<span class="cm-string cm-property">'likes'</span>:<span class="cm-variable">liked</span>,<span class="cm-string cm-property">'dislikes'</span>:<span class="cm-variable">disliked</span>,<span class="cm-string cm-property">'views'</span>:<span class="cm-variable">views</span>,<span class="cm-string cm-property">'comment_count'</span>:<span class="cm-variable">comment</span>}
<span class="cm-variable">df</span><span class="cm-operator">=</span><span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">data</span>)
<span class="cm-variable">df</span>.<span class="cm-property">head</span>()</pre><p class="">We go after the comments with the following lines of code:</p><pre class="source-code"><span class="cm-variable">channelId</span> <span class="cm-operator">=</span> <span class="cm-variable">list</span>(<span class="cm-variable">set</span>(<span class="cm-variable">channelId</span>))
<span class="cm-variable">allComments</span> <span class="cm-operator">=</span> []
<span class="cm-variable">video_id_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">channel_id_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">video_title_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">video_desc_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">comments_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">comment_id_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">reply_count_pop</span> <span class="cm-operator">=</span> []
<span class="cm-variable">like_count_pop</span> <span class="cm-operator">=</span> []

<span class="cm-keyword">for</span> <span class="cm-variable">channel</span> <span class="cm-keyword">in</span> <span class="cm-variable">channelId</span>:
    <span class="cm-variable">res</span><span class="cm-operator">=</span><span class="cm-variable">youtube</span>.<span class="cm-property">commentThreads</span>().<span class="cm-property">list</span>(<span class="cm-variable">allThreadsRelatedToChannelId</span><span class="cm-operator">=</span><span class="cm-variable">channel</span>,
                                      <span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'id,snippet'</span>,
                                      <span class="cm-variable">maxResults</span><span class="cm-operator">=</span><span class="cm-number">100</span>).<span class="cm-property">execute</span>()

    <span class="cm-keyword">try</span>:
        <span class="cm-variable">nextPageToken</span> <span class="cm-operator">=</span> <span class="cm-variable">res</span>[<span class="cm-string">'nextPageToken'</span>]

    <span class="cm-variable">except</span> <span class="cm-variable">KeyError</span>:
        <span class="cm-variable">nextPageToken</span> <span class="cm-operator">=</span> <span class="cm-variable">None</span>

    <span class="cm-variable">except</span> <span class="cm-variable">TypeError</span>:
        <span class="cm-variable">nextPageToken</span> <span class="cm-operator">=</span> <span class="cm-variable">None</span>
    
    <span class="cm-variable">comments_temp</span> <span class="cm-operator">=</span> []
    <span class="cm-variable">comment_id_temp</span> <span class="cm-operator">=</span> []
    <span class="cm-variable">reply_count_temp</span> <span class="cm-operator">=</span> []
    <span class="cm-variable">like_count_temp</span> <span class="cm-operator">=</span> []
    <span class="cm-variable">channel_id_temp</span> <span class="cm-operator">=</span> []
    <span class="cm-variable">video_id_temp</span> <span class="cm-operator">=</span> []

    <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">res</span>[<span class="cm-string">'items'</span>]:
        <span class="cm-variable">allComments</span>.<span class="cm-property">append</span>(<span class="cm-variable">res</span>[<span class="cm-string">'items'</span>])
        <span class="cm-variable">comments_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'topLevelComment'</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'textDisplay'</span>])
        <span class="cm-variable">comment_id_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'topLevelComment'</span>][<span class="cm-string">'id'</span>])
        <span class="cm-variable">reply_count_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'totalReplyCount'</span>])
        <span class="cm-variable">like_count_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'topLevelComment'</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'likeCount'</span>])
        <span class="cm-variable">channel_id_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'channelId'</span>])
        <span class="cm-variable">video_id_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'videoId'</span>])

    <span class="cm-variable">comments_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">comments_temp</span>)
    <span class="cm-variable">comment_id_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">comment_id_temp</span>)
    <span class="cm-variable">reply_count_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">reply_count_temp</span>)
    <span class="cm-variable">like_count_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">like_count_temp</span>)
    <span class="cm-variable">channel_id_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">channel_id_temp</span>)
    <span class="cm-variable">video_id_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">video_id_temp</span>)
    
    <span class="cm-keyword">while</span> (<span class="cm-variable">nextPageToken</span>):
        <span class="cm-keyword">try</span>:
            <span class="cm-variable">res</span><span class="cm-operator">=</span><span class="cm-variable">youtube</span>.<span class="cm-property">commentThreads</span>().<span class="cm-property">list</span>(<span class="cm-variable">allThreadsRelatedToChannelId</span><span class="cm-operator">=</span><span class="cm-variable">channel</span>,
                                      <span class="cm-variable">part</span><span class="cm-operator">=</span><span class="cm-string">'id,snippet'</span>,
                                      <span class="cm-variable">maxResults</span><span class="cm-operator">=</span><span class="cm-number">100</span>,<span class="cm-variable">pageToken</span><span class="cm-operator">=</span><span class="cm-variable">nextPageToken</span>).<span class="cm-property">execute</span>()
            
            <span class="cm-variable">comments_temp</span> <span class="cm-operator">=</span> []
            <span class="cm-variable">comment_id_temp</span> <span class="cm-operator">=</span> []
            <span class="cm-variable">reply_count_temp</span> <span class="cm-operator">=</span> []
            <span class="cm-variable">like_count_temp</span> <span class="cm-operator">=</span> []
            <span class="cm-variable">channel_id_temp</span> <span class="cm-operator">=</span> []
            <span class="cm-variable">video_id_temp</span> <span class="cm-operator">=</span> []

            <span class="cm-keyword">for</span> <span class="cm-variable">item</span> <span class="cm-keyword">in</span> <span class="cm-variable">res</span>[<span class="cm-string">'items'</span>]:
                <span class="cm-variable">allComments</span>.<span class="cm-property">append</span>(<span class="cm-variable">res</span>[<span class="cm-string">'items'</span>])
                <span class="cm-variable">comments_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'topLevelComment'</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'textDisplay'</span>])
                <span class="cm-variable">comment_id_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'topLevelComment'</span>][<span class="cm-string">'id'</span>])
                <span class="cm-variable">reply_count_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'totalReplyCount'</span>])
                <span class="cm-variable">like_count_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'topLevelComment'</span>][<span class="cm-string">'snippet'</span>][<span class="cm-string">'likeCount'</span>])
                <span class="cm-variable">channel_id_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'channelId'</span>])
                <span class="cm-variable">video_id_temp</span>.<span class="cm-property">append</span>(<span class="cm-variable">item</span>[<span class="cm-string">'snippet'</span>][<span class="cm-string">'videoId'</span>])

            <span class="cm-variable">comments_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">comments_temp</span>)
            <span class="cm-variable">comment_id_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">comment_id_temp</span>)
            <span class="cm-variable">reply_count_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">reply_count_temp</span>)
            <span class="cm-variable">like_count_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">like_count_temp</span>)
            <span class="cm-variable">channel_id_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">channel_id_temp</span>)
            <span class="cm-variable">video_id_pop</span>.<span class="cm-property">extend</span>(<span class="cm-variable">video_id_temp</span>)
            
            <span class="cm-variable">nextPageToken</span> <span class="cm-operator">=</span> <span class="cm-variable">res</span>[<span class="cm-string">'nextPageToken'</span>]
            
        <span class="cm-variable">except</span> <span class="cm-variable">KeyError</span>:
            <span class="cm-keyword">break</span>

<span class="cm-variable">data_threads</span><span class="cm-operator">=</span>{<span class="cm-string cm-property">'comment'</span>:<span class="cm-variable">comments_pop</span>,<span class="cm-string cm-property">'comment_id'</span>:<span class="cm-variable">comment_id_pop</span>,<span class="cm-string cm-property">'reply_count'</span>:<span class="cm-variable">reply_count_pop</span>,<span class="cm-string cm-property">'like_count'</span>:<span class="cm-variable">like_count_pop</span>,<span class="cm-string cm-property">'channel_id'</span>:<span class="cm-variable">channel_id_pop</span>,<span class="cm-string cm-property">'video_id'</span>:<span class="cm-variable">video_id_pop</span>}
<span class="cm-variable">threads</span><span class="cm-operator">=</span><span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">data_threads</span>)
<span class="cm-variable">threads</span>.<span class="cm-property">head</span>()</pre><p class="">The code above looks complicated, but there isn’t too much to the functions. YouTube has a function to grab all the threads that relate to a channel id. Because the results are paginated, you will need to incorporate the nextPageToken and loop through the pages until complete. In some applications, you may want to cut the calls off early—especially if the channel has a ton of engagement.</p><p class="">We’ll cover data cleaning, feature engineering, use of NLTK/VADER for sentiment analysis, and a simple dashboard in PowerBI in the next article!</p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1615570621511-IHVIPJTFUQ8Z6V6SFVE9/unsplash-image-NmGzVG5Wsg8.jpg?format=1500w" medium="image" isDefault="true" width="1500" height="1000"><media:title type="plain">Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 1)</media:title></media:content></item><item><title>Sales Forecasting: Predict Your Sales Cycle Using Machine Learning</title><dc:creator>Tyler Betthauser</dc:creator><pubDate>Sat, 27 Feb 2021 23:38:25 +0000</pubDate><link>https://www.conaxon.org/projects/sales-forecasting-predict-the-sales-cycle-using-machine-learning</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:60382f8fb3f1934ca14863b2</guid><description><![CDATA[We conceptualize a methodology for predicting the length of time it takes 
for the sales cycle to complete.]]></description><content:encoded><![CDATA[<h3>Business Context:</h3><p class="">There is no shortage of methods to forecast sales. To demonstrate one of those methods, we look back at the O-List data from Kaggle. The forecasting methodology that will be focused on is analyzing the sales cycle to predict how long a sales lead might take to close. So, not only will we be able to predict if a lead will close, but also how long it might take to close the deal. </p><p class="">The benefits of sales forecasting are pretty straightforward:</p><ul data-rte-list="default"><li><p class="">Improved financial planning </p></li><li><p class="">More precise work-load balance at each level of the organization</p></li><li><p class="">Better insights into velocity or growth</p></li></ul><p class="">It needs to be said that this type of sales forecasting might not work for every business model and data model, for that matter. </p><p class="">In this post, the following will be covered:</p><ul data-rte-list="default"><li><p class="">Feature Engineering</p></li><li><p class="">Data Quality Improvements</p></li><li><p class="">Testing various models</p></li></ul><h3>Libraries &amp; Reading in the Data:</h3><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">pandas</span> <span class="cm-keyword">as</span> <span class="cm-def">pd</span>
<span class="cm-keyword">import</span> <span class="cm-def">seaborn</span> <span class="cm-keyword">as</span> <span class="cm-def">sns</span>
<span class="cm-keyword">import</span> <span class="cm-def">matplotlib</span>.<span class="cm-variable">pyplot</span> <span class="cm-variable">as</span> <span class="cm-variable">plt</span>
<span class="cm-keyword">import</span> <span class="cm-def">numpy</span> <span class="cm-keyword">as</span> <span class="cm-def">np</span>
<span class="cm-keyword">from</span> <span class="cm-variable">datetime</span> <span class="cm-keyword">import</span> <span class="cm-def">date</span>
<span class="cm-keyword">from</span> <span class="cm-variable">datetime</span> <span class="cm-keyword">import</span> <span class="cm-def">datetime</span>
<span class="cm-keyword">import</span> <span class="cm-def">xgboost</span> <span class="cm-keyword">as</span> <span class="cm-def">xgb</span>
<span class="cm-keyword">from</span> <span class="cm-variable">xgboost</span> <span class="cm-keyword">import</span> <span class="cm-def">XGBClassifier</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">StratifiedKFold</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">roc_auc_score</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">RandomizedSearchCV</span>, <span class="cm-def">GridSearchCV</span>
<span class="cm-keyword">from</span> <span class="cm-variable">tensorflow</span> <span class="cm-keyword">import</span> <span class="cm-def">keras</span></pre><pre class="source-code"><span class="cm-variable">closed_deals</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">'olist_closed_deals_dataset.csv'</span>)
<span class="cm-variable">olist_leads</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">'olist_marketing_qualified_leads_dataset.csv'</span>)</pre><p class="">Next, we merge the funnel and the qualified leads datasets:</p><pre class="source-code"><span class="cm-variable">funnel</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">merge</span>(<span class="cm-variable">olist_leads</span>,<span class="cm-variable">closed_deals</span>,<span class="cm-variable">how</span><span class="cm-operator">=</span><span class="cm-string">'left'</span>,<span class="cm-variable">on</span><span class="cm-operator">=</span><span class="cm-string">'mql_id'</span>)<span class="cm-variable">funnel</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">merge</span>(<span class="cm-variable">olist_leads</span>,<span class="cm-variable">closed_deals</span>,<span class="cm-variable">how</span><span class="cm-operator">=</span><span class="cm-string">'left'</span>,<span class="cm-variable">on</span><span class="cm-operator">=</span><span class="cm-string">'mql_id'</span>)</pre><h3>Data Cleaning and Feature Engineering</h3><p class="">The next section, we focus on some initial cleaning and feature engineering on the dataset. One thing to note with this investigation. There was not a ton of leads actually closed. So, we built some simulated data to increase the data that a model could be trained on.</p><pre class="source-code"><span class="cm-variable">funnel</span>[<span class="cm-string">'won_date'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'datetime64[ns]'</span>)
<span class="cm-variable">funnel</span>[<span class="cm-string">'first_contact_date'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'datetime64[ns]'</span>)</pre><pre class="source-code"><span class="cm-error">#drop dimensions that likely will not be that reliable in collecting within the business process</span>
<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>([<span class="cm-string">'declared_monthly_revenue'</span>,<span class="cm-string">'declared_product_catalog_size'</span>,<span class="cm-string">'average_stock'</span>,<span class="cm-string">'has_company'</span>,<span class="cm-string">'seller_id'</span>,<span class="cm-string">'mql_id'</span>],<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)</pre><p class="">An unfortunate deficiency of the O-list data is that there is not a reliable source of revenue data per lead. While we can successfully complete the task, the case study would be closer to real-world if there was more samples with a richer context. Next, a copy of the dataframe will be created and some time based features created:</p><pre class="source-code"><span class="cm-variable">funnel_model</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>.<span class="cm-property">copy</span>(<span class="cm-variable">deep</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)

<span class="cm-variable">funnel_model</span>[<span class="cm-string">'contact_day'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">strftime</span>(<span class="cm-string">'%d'</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'contact_month'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">strftime</span>(<span class="cm-string">'%m'</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'contact_year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">year</span></pre><p class="">Intuitively, the contact date information will be predictors for length of deal closure. These features will be especially important if there is any seasonality present. Since we only  have a years worth of data in the set, it would be tough to make a judgement on seasonalities. The most important part of the data prep is addressing the NA’s that exist:</p><pre class="source-code"><span class="cm-error">#count missing values (NAs)</span>
<span class="cm-variable">missing_count</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">isna</span>().<span class="cm-property">sum</span>(),<span class="cm-variable">columns</span><span class="cm-operator">=</span>[<span class="cm-string">'Number'</span>])
<span class="cm-variable">missing_count</span>[<span class="cm-string">'Percentage'</span>] <span class="cm-operator">=</span> <span class="cm-variable">round</span>(<span class="cm-variable">missing_count</span> <span class="cm-operator">/</span> <span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>),<span class="cm-number">2</span>) <span class="cm-operator">*</span> <span class="cm-number">100</span>
<span class="cm-variable">missing_count</span>
</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1614312964353-C23J6CF4TPGH2VGIEK83/nas.JPG" data-image-dimensions="472x610" data-image-focal-point="0.5,0.5" alt="nas.JPG" data-load="false" data-image-id="6038760452c8593dd6d9b8fe" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1614312964353-C23J6CF4TPGH2VGIEK83/nas.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">The next block of code addresses the NAs in the records given the dimensions above. Quite simply, a list of unique values was pulled from the dimension. Then, randomly those values are applied where the record is NA. This methodology was used to try and preserve as much of the distributions native to the data as possible, while also giving some more data to train.</p><pre class="source-code"><span class="cm-variable">origin_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"origin"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">origin_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">origin_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'origin'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'origin'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">origin_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">sdr_id_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"sdr_id"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">sdr_id_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">sdr_id_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_id'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_id'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">sdr_id_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">sr_id_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"sr_id"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">sr_id_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">sr_id_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'sr_id'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sr_id'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">sr_id_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">bs_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"business_segment"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">bs_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">bs_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_segment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_segment'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">bs_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">lead_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"lead_type"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">lead_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">lead_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_type'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_type'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">lead_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">lbp_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"lead_behaviour_profile"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">lbp_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">lbp_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_behaviour_profile'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_behaviour_profile'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">lbp_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">gtin_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"has_gtin"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">gtin_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">gtin_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'has_gtin'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'has_gtin'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">gtin_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">btype_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"business_type"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">btype_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">btype_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_type'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_type'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">btype_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">dt_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"won_date"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_date'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">dt_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))</pre><p class="">The next block of code finishes off the feature engineering and data quality improvements:</p><pre class="source-code"><span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_day'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">strftime</span>(<span class="cm-string">'%d'</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_month'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">strftime</span>(<span class="cm-string">'%m'</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">year</span>

<span class="cm-variable">funnel_model</span>[<span class="cm-string">'time_to_close'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'won_date'</span>] <span class="cm-operator">-</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'time_to_close'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'time_to_close'</span>].<span class="cm-property">dt</span>.<span class="cm-property">days</span>

<span class="cm-error">#we can create a combo feature now of sdr and sr</span>
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_sr'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_id'</span>] <span class="cm-operator">+</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sr_id'</span>]</pre><p class="">This code should fill our NA fields with data that is representative of some reality—which should be sufficient to demonstrate the use case effectively. A few more lines of code to clean up the dataset:</p><pre class="source-code"><span class="cm-error">#drop any rows where there are already suspect data like when there is a negative close time</span>
<span class="cm-variable">indexNames</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[ (<span class="cm-variable">funnel_model</span>[<span class="cm-string">'time_to_close'</span>] <span class="cm-operator">&lt;</span> <span class="cm-number">0</span>)].<span class="cm-property">index</span>
<span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>(<span class="cm-variable">indexNames</span> , <span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)

<span class="cm-variable">funnel_model</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>.<span class="cm-property">dropna</span>(<span class="cm-variable">subset</span><span class="cm-operator">=</span>[<span class="cm-string">'won_date'</span>])

<span class="cm-variable">funnel_model</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>([<span class="cm-string">'won_date'</span>,<span class="cm-string">'first_contact_date'</span>],<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>)</pre><h3>Model Development, Train/Test/Split, &amp; Defining X,y</h3><pre class="source-code"><span class="cm-variable">df</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>.<span class="cm-property">copy</span>(<span class="cm-variable">deep</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)

<span class="cm-error">#define the X, y variables</span>
<span class="cm-variable">X</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">get_dummies</span>(<span class="cm-variable">df</span>.<span class="cm-property">drop</span>(<span class="cm-string">'time_to_close'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>),<span class="cm-variable">drop_first</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">y</span> <span class="cm-operator">=</span> <span class="cm-variable">df</span>[<span class="cm-string">'time_to_close'</span>]

<span class="cm-error">#always split the data</span>
<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">train_test_split</span>
<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>, <span class="cm-variable">y_train</span>, <span class="cm-variable">y_test</span> <span class="cm-operator">=</span> <span class="cm-variable">train_test_split</span>(<span class="cm-variable">X</span>, <span class="cm-variable">y</span>, <span class="cm-variable">test_size</span><span class="cm-operator">=</span><span class="cm-number">0.33</span>, <span class="cm-variable">random_state</span><span class="cm-operator">=</span><span class="cm-number">101</span>)

<span class="cm-error"># prepare input data</span>
<span class="cm-variable">def</span> <span class="cm-variable">prepare_inputs</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>):
    <span class="cm-variable">ohe</span> <span class="cm-operator">=</span> <span class="cm-variable">OneHotEncoder</span>(<span class="cm-variable">handle_unknown</span><span class="cm-operator">=</span><span class="cm-string">'ignore'</span>)
    <span class="cm-variable">ohe</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train</span>)
    <span class="cm-variable">X_train_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">ohe</span>.<span class="cm-property">transform</span>(<span class="cm-variable">X_train</span>)
    <span class="cm-variable">X_test_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">ohe</span>.<span class="cm-property">transform</span>(<span class="cm-variable">X_test</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">X_train_enc</span>, <span class="cm-variable">X_test_enc</span>

<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">preprocessing</span> <span class="cm-keyword">import</span> <span class="cm-def">OneHotEncoder</span>
<span class="cm-error"># prepare input data</span>
<span class="cm-variable">X_train_enc</span>, <span class="cm-variable">X_test_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">prepare_inputs</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>)</pre><p class="">In the code above, we one-hot encode the categorical variables. This is necessary because the algorithms we are going to use requires a binary array to be fed as inputs. I have found that it is best to split the data and THEN one-hot encode ‘X’. </p><h3>support vector regression is up first</h3><p class="">First, support vector regression is going to be used to predict the sales cycle time. We had good accuracy with this algorithm in our classification exercise. </p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">svm</span> <span class="cm-keyword">import</span> <span class="cm-def">SVR</span>,<span class="cm-def">LinearSVR</span>
<span class="cm-variable">base_model</span> <span class="cm-operator">=</span> <span class="cm-variable">SVR</span>()
<span class="cm-variable">base_model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)
<span class="cm-variable">base_preds</span> <span class="cm-operator">=</span> <span class="cm-variable">base_model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)
<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">mean_absolute_error</span>,<span class="cm-def">mean_squared_error</span>
<span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">base_preds</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">73.15358191010293</span>
<span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">base_preds</span>))
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">93.54240378007681</span>
<span class="cm-variable">y_test</span>.<span class="cm-property">mean</span>()
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">112.10120240480961</span></pre><p class="">Some interesting things to note here in this code:</p><ul data-rte-list="default"><li><p class="">The scale of measured accuracy is the same of the target variable. So, in this case the SVR model is about 73 days off. Not ALL the predictions are off by 73 days, but on average the predictions can be inaccurate by 73 days—this is not good! </p></li><li><p class="">Pay attention to the mean of y_<em>test() though: 112 days. The dataset itself has quite a bit of variation in the time to close. The fact our predictions are quite a few days less than the y_</em>test mean is actually positive</p></li></ul><p class="">Given the positivity with this model, we can try to tune the hyper-parameters so as to improve accuracy:</p><pre class="source-code"><span class="cm-variable">param_grid</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'C'</span>:[<span class="cm-number">0.001</span>,<span class="cm-number">0.01</span>,<span class="cm-number">0.1</span>,<span class="cm-number">0.5</span>,<span class="cm-number">1</span>],
             <span class="cm-string cm-property">'kernel'</span>:[<span class="cm-string">'linear'</span>,<span class="cm-string">'rbf'</span>,<span class="cm-string">'poly'</span>],
              <span class="cm-string cm-property">'gamma'</span>:[<span class="cm-string">'scale'</span>,<span class="cm-string">'auto'</span>],
              <span class="cm-string cm-property">'degree'</span>:[<span class="cm-number">2</span>,<span class="cm-number">3</span>,<span class="cm-number">4</span>],
              <span class="cm-string cm-property">'epsilon'</span>:[<span class="cm-number">0</span>,<span class="cm-number">0.01</span>,<span class="cm-number">0.1</span>,<span class="cm-number">0.5</span>,<span class="cm-number">1</span>,<span class="cm-number">2</span>]}

<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">GridSearchCV</span>
<span class="cm-variable">svr</span> <span class="cm-operator">=</span> <span class="cm-variable">SVR</span>()
<span class="cm-variable">grid</span> <span class="cm-operator">=</span> <span class="cm-variable">GridSearchCV</span>(<span class="cm-variable">svr</span>,<span class="cm-variable">param_grid</span><span class="cm-operator">=</span><span class="cm-variable">param_grid</span>,<span class="cm-variable">cv</span><span class="cm-operator">=</span><span class="cm-number">3</span>, <span class="cm-variable">n_jobs</span><span class="cm-operator">=</span><span class="cm-operator">-</span><span class="cm-number">1</span>)
<span class="cm-variable">grid</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-variable">SVR</span>(<span class="cm-variable">C</span><span class="cm-operator">=</span><span class="cm-number">1</span>, <span class="cm-variable">degree</span><span class="cm-operator">=</span><span class="cm-number">2</span>, <span class="cm-variable">epsilon</span><span class="cm-operator">=</span><span class="cm-number">2</span>, <span class="cm-variable">kernel</span><span class="cm-operator">=</span><span class="cm-string">'linear'</span>)
<span class="cm-variable">grid</span>.<span class="cm-property">best_params_</span>
<span class="cm-operator">&gt;&gt;</span> {<span class="cm-string cm-property">'C'</span>: <span class="cm-number">1</span>, <span class="cm-string cm-property">'degree'</span>: <span class="cm-number">2</span>, <span class="cm-string cm-property">'epsilon'</span>: <span class="cm-number">2</span>, <span class="cm-string cm-property">'gamma'</span>: <span class="cm-string">'scale'</span>, <span class="cm-string cm-property">'kernel'</span>: <span class="cm-string">'linear'</span>}
<span class="cm-variable">print</span>(<span class="cm-variable">grid</span>.<span class="cm-property">best_score_</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">0.9000862164349251</span>
<span class="cm-variable">preds</span> <span class="cm-operator">=</span> <span class="cm-variable">svr</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)

<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">mean_absolute_error</span>,<span class="cm-def">mean_squared_error</span>
<span class="cm-variable">MAE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">preds</span>)
<span class="cm-variable">MSE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">preds</span>)
<span class="cm-variable">RMSE</span> <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">MSE</span>)
<span class="cm-variable">print</span>(<span class="cm-variable">MAE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">9.492129595872589</span>
<span class="cm-variable">print</span>(<span class="cm-variable">MSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">853.6142931876532</span>
<span class="cm-variable">print</span>(<span class="cm-variable">RMSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">29.21667833939466</span></pre><p class="">Note here:</p><ul data-rte-list="default"><li><p class="">GridSearchCV is very slow given the number of features in the dataset—there are over 1500. We should probably be using some sort of dimensionality reduction in order to reduce the training time and make the process more efficient. SVR with GridSearchCV takes several hours to run in this investigation—that may not be tenable for some applications</p></li><li><p class="">GridSearchCV should be used smartly. The more variables that get added to the parameters the greater the training time</p></li><li><p class="">n_jobs set to -1 helps training time and optimizes the use of your machines resources</p></li><li><p class="">Large outliers in the data creates some difficulty creating accurate predictions—hence the terrible mean squared error. </p></li></ul><h3>Try a Simple Linear Regression:</h3><p class="">After the long training of the Support Vector Regression and less than wonderful performance, maybe a simple linear regression will be more effective:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">linear_model</span> <span class="cm-keyword">import</span> <span class="cm-def">LinearRegression</span>
<span class="cm-variable">model</span> <span class="cm-operator">=</span> <span class="cm-variable">LinearRegression</span>()
<span class="cm-variable">model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)
<span class="cm-error"># We only pass in test features</span>
<span class="cm-error"># The model predicts its own y hat</span>
<span class="cm-error"># We can then compare these results to the true y test label value</span>
<span class="cm-variable">test_predictions</span> <span class="cm-operator">=</span> <span class="cm-variable">model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)

<span class="cm-variable">MAE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">MSE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">RMSE</span> <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">MSE</span>)

<span class="cm-variable">print</span>(<span class="cm-variable">MAE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">0.057113830282917256</span>
<span class="cm-variable">print</span>(<span class="cm-variable">MSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">3.253882131975636</span>
<span class="cm-variable">print</span>(<span class="cm-variable">RMSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">1.803852026075209</span></pre><p class="">Linear Regression model seems to perform the best so far.  A RMSE/MSE of 1-3 days is pretty accurate and serviceable for a sales forecasting. Still curious if we can further fine tune the results via Ridge Regression.</p><h3>Ridge Regression Model Training:</h3><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">linear_model</span> <span class="cm-keyword">import</span> <span class="cm-def">Ridge</span>
<span class="cm-variable">ridge_model</span> <span class="cm-operator">=</span> <span class="cm-variable">Ridge</span>(<span class="cm-variable">alpha</span><span class="cm-operator">=</span><span class="cm-number">10</span>)
<span class="cm-variable">ridge_model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)
<span class="cm-variable">test_predictions</span> <span class="cm-operator">=</span> <span class="cm-variable">ridge_model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)

<span class="cm-variable">MAE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">MSE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">RMSE</span> <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">MSE</span>)

<span class="cm-variable">print</span>(<span class="cm-variable">MAE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">7.6568575927885645</span>
<span class="cm-variable">print</span>(<span class="cm-variable">MSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">147.21235108694535</span>
<span class="cm-variable">print</span>(<span class="cm-variable">RMSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">12.133109703903008</span></pre><p class="">The Ridge Model is not as performant! Using this model as a baseline, we can use RidgeCV to see if it is not possible to improve the results. The RidgeCV model can be set up as such:</p><pre class="source-code"><span class="cm-error"># Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html</span>
<span class="cm-error"># Negative RMSE so all metrics follow convention "Higher is better"</span>

<span class="cm-error"># See all options: sklearn.metrics.SCORERS.keys()</span>
<span class="cm-variable">ridge_cv_model</span> <span class="cm-operator">=</span> <span class="cm-variable">RidgeCV</span>(<span class="cm-variable">alphas</span><span class="cm-operator">=</span>(<span class="cm-number">0.1</span>, <span class="cm-number">1.0</span>, <span class="cm-number">10.0</span>),<span class="cm-variable">scoring</span><span class="cm-operator">=</span><span class="cm-string">'neg_root_mean_squared_error'</span>)

<span class="cm-error"># The more alpha options you pass, the longer this will take.</span>
<span class="cm-error"># Fortunately our data set is still pretty small</span>
<span class="cm-variable">ridge_cv_model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)

<span class="cm-variable">ridge_cv_model</span>.<span class="cm-property">alpha_</span>
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">0.1</span>

<span class="cm-variable">test_predictions</span> <span class="cm-operator">=</span> <span class="cm-variable">ridge_cv_model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)

<span class="cm-variable">MAE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">MSE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">RMSE</span> <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">MSE</span>)

<span class="cm-variable">print</span>(<span class="cm-variable">MAE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">0.22657748641706255</span>
<span class="cm-variable">print</span>(<span class="cm-variable">MSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">3.4295154423291936</span>
<span class="cm-variable">print</span>(<span class="cm-variable">RMSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">1.8518950948499198</span></pre><p class="">The improved RidgeCV model performs similar to the more basic Simple Linear Regression. </p><h3>Attempt a LassoCV</h3><p class="">Next, we try to understand how a LassoCV might do in terms of accuracy:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">linear_model</span> <span class="cm-keyword">import</span> <span class="cm-def">LassoCV</span>
<span class="cm-error"># https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html</span>
<span class="cm-variable">lasso_cv_model</span> <span class="cm-operator">=</span> <span class="cm-variable">LassoCV</span>(<span class="cm-variable">eps</span><span class="cm-operator">=</span><span class="cm-number">0.1</span>,<span class="cm-variable">n_alphas</span><span class="cm-operator">=</span><span class="cm-number">100</span>,<span class="cm-variable">cv</span><span class="cm-operator">=</span><span class="cm-number">5</span>)

<span class="cm-variable">lasso_cv_model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)

<span class="cm-variable">lasso_cv_model</span>.<span class="cm-property">alpha_</span>
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">3.0222656025281607</span>

<span class="cm-variable">test_predictions</span> <span class="cm-operator">=</span> <span class="cm-variable">lasso_cv_model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)
<span class="cm-variable">MAE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">MSE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">RMSE</span> <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">MSE</span>)

<span class="cm-variable">print</span>(<span class="cm-variable">MAE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">47.70224782216077</span>
<span class="cm-variable">print</span>(<span class="cm-variable">MSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">3692.4097816866115</span>
<span class="cm-variable">print</span>(<span class="cm-variable">RMSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">60.76520206241901</span></pre><p class="">Obviously, the model above is untenable and not a worthy candidate for something like a sales forecast—especially compared to the other models we have tried.</p><h3>Elastic Net Models:</h3><p class="">Since Elastic Net Models attempt to keep the benefits of both the Lasso and Ridge Models.  A few lines of code will tell us what the performance result might be:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">linear_model</span> <span class="cm-keyword">import</span> <span class="cm-def">ElasticNetCV</span>
<span class="cm-variable">elastic_model</span> <span class="cm-operator">=</span> <span class="cm-variable">ElasticNetCV</span>(<span class="cm-variable">l1_ratio</span><span class="cm-operator">=</span>[<span class="cm-number">.1</span>, <span class="cm-number">.5</span>, <span class="cm-number">.7</span>,<span class="cm-number">.9</span>, <span class="cm-number">.95</span>, <span class="cm-number">.99</span>, <span class="cm-number">1</span>],<span class="cm-variable">tol</span><span class="cm-operator">=</span><span class="cm-number">0.01</span>)
<span class="cm-variable">elastic_model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train</span>)
<span class="cm-variable">test_predictions</span> <span class="cm-operator">=</span> <span class="cm-variable">elastic_model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)
<span class="cm-variable">MAE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_absolute_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">MSE</span> <span class="cm-operator">=</span> <span class="cm-variable">mean_squared_error</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">test_predictions</span>)
<span class="cm-variable">RMSE</span> <span class="cm-operator">=</span> <span class="cm-variable">np</span>.<span class="cm-property">sqrt</span>(<span class="cm-variable">MSE</span>)

<span class="cm-variable">print</span>(<span class="cm-variable">MAE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">4.714205461929477</span>
<span class="cm-variable">print</span>(<span class="cm-variable">MSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">66.5452147436821</span>
<span class="cm-variable">print</span>(<span class="cm-variable">RMSE</span>)
<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">8.157525037882635</span></pre><p class="">Conclusions:</p><ul data-rte-list="default"><li><p class="">ElasticNet is a pretty happy middle ground between Ridge and Lasso, but still doesn’t perform nearly as well as Linear Regression or RidgeCV</p></li><li><p class="">Model training time was vastly quicker on Linear Regression and RidgeCV—this might be an important consideration in a production implementation</p></li></ul>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1614314314837-O23N5DRM1EKNF2F7XUO6/image-asset.jpeg?format=1500w" medium="image" isDefault="true" width="1500" height="1000"><media:title type="plain">Sales Forecasting: Predict Your Sales Cycle Using Machine Learning</media:title></media:content></item><item><title>Predict which Sales Leads Close Part 2</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Sat, 20 Feb 2021 19:46:00 +0000</pubDate><link>https://www.conaxon.org/projects/predict-which-sales-leads-close-part-2</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:6031118f37189d40a059e2b3</guid><description><![CDATA[propose a simple project to demonstrate a machine learning use case for 
optimizing which leads sales teams are predicted to be closed.]]></description><content:encoded><![CDATA[<h3>Introduction</h3><p class="">Last time we left off on this project, the technologies we chose weren’t particularly great at predicting one of the classes: Closed Lead versus Open Lead. In this article, a few different methods are employed to overcome the challenges of imbalanced classes, encoding all of the categorical variables, and hyper-parameter tuning.</p><h3>Gradient-Boosted / Ensemble Algorithms Might Help</h3><p class="">Firstly, we will import the GradientBoostingClassifier from sklearn. Catboost and XGBoost were considered for this investigation, but these algorithms are a little harder and more complicated to implement. Sklearn seemed to be more familiar and easier to understand and tune. This is not to say that Catboost and XGBoost are not good solutions—in literature, they are great! </p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">ensemble</span> <span class="cm-keyword">import</span> <span class="cm-def">GradientBoostingClassifier</span></pre><p class="">Since the feature engineering and data cleaning have already been completed, we can create a dataframe that includes all of the features we want to include in the model:</p><pre class="source-code"><span class="cm-variable">df5</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[[<span class="cm-string">'landing_page_id'</span>, <span class="cm-string">'origin'</span>, <span class="cm-string">'sdr_id'</span>,<span class="cm-string">'sr_id'</span>,<span class="cm-string">'business_segment'</span>,
                   <span class="cm-string">'lead_type'</span>,<span class="cm-string">'lead_behaviour_profile'</span>,<span class="cm-string">'has_gtin'</span>,<span class="cm-string">'business_type'</span>,
                  <span class="cm-string">'contact_day'</span>,<span class="cm-string">'contact_month'</span>,<span class="cm-string">'contact_year'</span>,<span class="cm-string">'sdr_sr'</span>,<span class="cm-string">'closed_deal'</span>]].<span class="cm-property">copy</span>()</pre><p class="">After creating the dataframe as the backbone of the model, we are going to employ the first potential fix for the imbalanced classes. The option we chose to use is called over-sampling. Over-sampling is when you duplicate records from the minority class. Because the algorithm treats each record as a unique instance, duplicated records aren’t a problem and help us synthetically enhance our dataset. Since the minority class is a closed lead, we will over-sample this class:</p><pre class="source-code"><span class="cm-variable">closed_dup</span> <span class="cm-operator">=</span> <span class="cm-variable">df5</span>[<span class="cm-string">'closed_deal'</span>] <span class="cm-operator">==</span> <span class="cm-variable">True</span>
<span class="cm-variable">df_try</span> <span class="cm-operator">=</span> <span class="cm-variable">df5</span>[<span class="cm-variable">closed_dup</span>]
<span class="cm-variable">df5</span> <span class="cm-operator">=</span> <span class="cm-variable">df5</span>.<span class="cm-property">append</span>([<span class="cm-variable">df_try</span>]<span class="cm-operator">*</span><span class="cm-number">3</span>,<span class="cm-variable">ignore_index</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">df5</span>.<span class="cm-property">shape</span></pre><p class="">This small piece of code will increase each of the closed deal records three times. Something to note: over/under sampling <span><strong><em>should not</em></strong></span> be the first line of defence. There are a myriad of other process-based fixes that should be used first such as:</p><ul data-rte-list="default"><li><p class="">Finding more data to build predictions</p></li><li><p class="">Investigate if bias is being introduced to the data collection in some way. Correct or reduce the level of bias in data collection</p></li><li><p class="">Gain some domain knowledge on the who, what, where, why, and how of the operations that build this data—make corrections to the prediction methodology where appropriate</p></li></ul><p class="">Next, functions will be prepared to one-hot encode the features and label encode the labels:</p><pre class="source-code"><span class="cm-error"># prepare input data</span>
<span class="cm-variable">def</span> <span class="cm-variable">prepare_inputs</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>):
    <span class="cm-variable">ohe</span> <span class="cm-operator">=</span> <span class="cm-variable">OneHotEncoder</span>(<span class="cm-variable">handle_unknown</span><span class="cm-operator">=</span><span class="cm-string">'ignore'</span>)
    <span class="cm-variable">ohe</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train</span>)
    <span class="cm-variable">X_train_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">ohe</span>.<span class="cm-property">transform</span>(<span class="cm-variable">X_train</span>)
    <span class="cm-variable">X_test_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">ohe</span>.<span class="cm-property">transform</span>(<span class="cm-variable">X_test</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">X_train_enc</span>, <span class="cm-variable">X_test_enc</span>
 
<span class="cm-error"># prepare target</span>
<span class="cm-variable">def</span> <span class="cm-variable">prepare_targets</span>(<span class="cm-variable">y_train</span>, <span class="cm-variable">y_test</span>):
    <span class="cm-variable">le</span> <span class="cm-operator">=</span> <span class="cm-variable">LabelEncoder</span>()
    <span class="cm-variable">le</span>.<span class="cm-property">fit</span>(<span class="cm-variable">y_train</span>)
    <span class="cm-variable">y_train_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">le</span>.<span class="cm-property">transform</span>(<span class="cm-variable">y_train</span>)
    <span class="cm-variable">y_test_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">le</span>.<span class="cm-property">transform</span>(<span class="cm-variable">y_test</span>)
    <span class="cm-keyword">return</span> <span class="cm-variable">y_train_enc</span>, <span class="cm-variable">y_test_enc</span></pre><p class="">Notice what is done inside the encoding functions. Fit to the training then transform the training and test set. Next, define the X and y:</p><pre class="source-code"><span class="cm-variable">X</span> <span class="cm-operator">=</span> <span class="cm-variable">df5</span>.<span class="cm-property">drop</span>(<span class="cm-string">'closed_deal'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>)
<span class="cm-variable">y</span> <span class="cm-operator">=</span> <span class="cm-variable">df5</span>[<span class="cm-string">'closed_deal'</span>]</pre><p class="">X is everything but the target value. y is the target value. As always, perform your train, test split. You will notice that the ‘stratify’ parameter has been added to the train, test split. The ‘stratify’ parameter is a great tool for imbalanced dataset because it helps maintain equal proportions of X and y between the train and test data. It is best to ensure that one set does not monopolize the minority target:</p><pre class="source-code"><span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>, <span class="cm-variable">y_train</span>, <span class="cm-variable">y_test</span> <span class="cm-operator">=</span> <span class="cm-variable">train_test_split</span>(<span class="cm-variable">X</span>, <span class="cm-variable">y</span>, <span class="cm-variable">test_size</span><span class="cm-operator">=</span><span class="cm-number">0.33</span>, <span class="cm-variable">random_state</span><span class="cm-operator">=</span><span class="cm-number">101</span>, <span class="cm-variable">stratify</span> <span class="cm-operator">=</span> <span class="cm-variable">y</span>)</pre><p class="">Sklearn has useful one-hot and label encoding functions. After splitting the data, we can actually use our data preparation functions:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">preprocessing</span> <span class="cm-keyword">import</span> <span class="cm-def">LabelEncoder</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">preprocessing</span> <span class="cm-keyword">import</span> <span class="cm-def">OneHotEncoder</span>
<span class="cm-error"># prepare input data</span>
<span class="cm-variable">X_train_enc</span>, <span class="cm-variable">X_test_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">prepare_inputs</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>)
<span class="cm-error"># prepare output data</span>
<span class="cm-variable">y_train_enc</span>, <span class="cm-variable">y_test_enc</span> <span class="cm-operator">=</span> <span class="cm-variable">prepare_targets</span>(<span class="cm-variable">y_train</span>, <span class="cm-variable">y_test</span>)</pre><p class="">The very basic model can be instantiated and fitted to the X<em>train and y</em>train:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">ensemble</span> <span class="cm-keyword">import</span> <span class="cm-def">GradientBoostingClassifier</span>
<span class="cm-variable">model</span> <span class="cm-operator">=</span> <span class="cm-variable">GradientBoostingClassifier</span>()

<span class="cm-variable">model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>, <span class="cm-variable">y_train_enc</span>)

<span class="cm-variable">y_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test_enc</span>)

<span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">roc_curve</span>, <span class="cm-def">auc</span>
<span class="cm-variable">false_positive_rate</span>, <span class="cm-variable">true_positive_rate</span>, <span class="cm-variable">thresholds</span> <span class="cm-operator">=</span> <span class="cm-variable">roc_curve</span>(<span class="cm-variable">y_test_enc</span>, <span class="cm-variable">y_pred</span>)
<span class="cm-variable">roc_auc</span> <span class="cm-operator">=</span> <span class="cm-variable">auc</span>(<span class="cm-variable">false_positive_rate</span>, <span class="cm-variable">true_positive_rate</span>)
<span class="cm-variable">roc_auc</span>

<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">0.8874727398312781</span></pre><p class="">The basic model does ok!, but not as great as the even more basic SupportVectorClassifier. With a baseline established, hyper-parameter tuning can be completed using GridSearchCV:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">ensemble</span> <span class="cm-keyword">import</span> <span class="cm-def">GradientBoostingClassifier</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">GridSearchCV</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">accuracy_score</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">precision_score</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">recall_score</span>
<span class="cm-keyword">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">make_scorer</span>

<span class="cm-error"># A sample parameter</span>

<span class="cm-variable">parameters</span> <span class="cm-operator">=</span> {
    <span class="cm-string cm-property">"loss"</span>:[<span class="cm-string">"deviance"</span>],
    <span class="cm-string cm-property">"learning_rate"</span>: [<span class="cm-number">0.0001</span>, <span class="cm-number">0.001</span>, <span class="cm-number">0.01</span>, <span class="cm-number">0.1</span>, <span class="cm-number">0.5</span>],
    <span class="cm-string cm-property">"min_samples_split"</span>: <span class="cm-variable">np</span>.<span class="cm-property">linspace</span>(<span class="cm-number">0.1</span>, <span class="cm-number">1.0</span>, <span class="cm-number">5</span>),
    <span class="cm-string cm-property">"min_samples_leaf"</span>: <span class="cm-variable">np</span>.<span class="cm-property">linspace</span>(<span class="cm-number">0.1</span>, <span class="cm-number">0.5</span>, <span class="cm-number">5</span>,<span class="cm-variable">endpoint</span><span class="cm-operator">=</span><span class="cm-variable">True</span>),
    <span class="cm-string cm-property">"min_weight_fraction_leaf"</span>: <span class="cm-variable">np</span>.<span class="cm-property">linspace</span>(<span class="cm-number">0.1</span>, <span class="cm-number">1.0</span>, <span class="cm-number">10</span>),
    <span class="cm-string cm-property">"max_depth"</span>:[<span class="cm-number">3</span>,<span class="cm-number">5</span>,<span class="cm-number">8</span>],
    <span class="cm-string cm-property">"max_features"</span>:[<span class="cm-string">"log2"</span>,<span class="cm-string">"sqrt"</span>],
    <span class="cm-string cm-property">"criterion"</span>: [<span class="cm-string">"friedman_mse"</span>],
    <span class="cm-string cm-property">"subsample"</span>:[<span class="cm-number">0.8</span>, <span class="cm-number">0.9</span>, <span class="cm-number">0.95</span>, <span class="cm-number">1.0</span>],
    <span class="cm-string cm-property">"n_estimators"</span>:[<span class="cm-number">10</span>]
    }
<span class="cm-error">#passing the scoring function in the GridSearchCV</span>
<span class="cm-variable">grid</span> <span class="cm-operator">=</span> <span class="cm-variable">GridSearchCV</span>(<span class="cm-variable">GradientBoostingClassifier</span>(<span class="cm-variable">verbose</span><span class="cm-operator">=</span><span class="cm-number">2</span>), <span class="cm-variable">parameters</span>, <span class="cm-variable">cv</span><span class="cm-operator">=</span><span class="cm-number">3</span>, <span class="cm-variable">n_jobs</span><span class="cm-operator">=</span><span class="cm-operator">-</span><span class="cm-number">1</span>)

<span class="cm-variable">grid</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train_enc</span>,<span class="cm-variable">y_train_enc</span>)</pre><p class="">After the GridSearchCV the best parameters and scores can be collected:</p><pre class="source-code"><span class="cm-variable">print</span>(<span class="cm-variable">grid</span>.<span class="cm-property">best_score_</span>)
<span class="cm-variable">print</span>(<span class="cm-variable">grid</span>.<span class="cm-property">best_params_</span>)

<span class="cm-operator">&gt;&gt;</span> <span class="cm-number">0.7847310912445011</span>
{<span class="cm-string">'criterion'</span>: <span class="cm-string">'friedman_mse'</span>, <span class="cm-string">'learning_rate'</span>: <span class="cm-number">0.5</span>, <span class="cm-string">'loss'</span>: <span class="cm-string">'deviance'</span>, <span class="cm-string">'max_depth'</span>: <span class="cm-number">8</span>, <span class="cm-string">'max_features'</span>: <span class="cm-string">'sqrt'</span>, <span class="cm-string">'min_samples_leaf'</span>: <span class="cm-number">0.1</span>, <span class="cm-string">'min_samples_split'</span>: <span class="cm-number">0.55</span>, <span class="cm-string">'min_weight_fraction_leaf'</span>: <span class="cm-number">0.1</span>, <span class="cm-string">'n_estimators'</span>: <span class="cm-number">10</span>, <span class="cm-string">'subsample'</span>: <span class="cm-number">0.9</span>}</pre><p class="">Strangely enough, the outputs were worse than the baseline. If I am honest, I have no idea exactly why this is occurring. an initial hypothesis is that we trained the grid on the training set and not the whole set. Or, the grid doesn’t have all of the defaults of the GradientBoostedClassifer. Somewhat discouraged, I moved on to a Tensorflow Sequential Model (Artificial Neural Network)—which I thought was more fun to play with anyways.</p><h3>Artificial Neural Network Application—Much Better</h3><p class="">Since we already defined how we were going to prep the data for the model, the code will not be restated below. Development of the model will come first:</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">tensorflow</span> <span class="cm-keyword">as</span> <span class="cm-def">tf</span>
<span class="cm-keyword">from</span> <span class="cm-variable">tensorflow</span> <span class="cm-keyword">import</span> <span class="cm-def">keras</span>
<span class="cm-keyword">from</span> <span class="cm-variable">tensorflow</span>.<span class="cm-property">keras</span>.<span class="cm-property">models</span> <span class="cm-keyword">import</span> <span class="cm-def">Sequential</span>
<span class="cm-keyword">from</span> <span class="cm-variable">tensorflow</span>.<span class="cm-property">keras</span>.<span class="cm-property">layers</span> <span class="cm-keyword">import</span> <span class="cm-def">Dense</span>, <span class="cm-def">Activation</span>,<span class="cm-def">Dropout</span>
<span class="cm-keyword">from</span> <span class="cm-variable">tensorflow</span>.<span class="cm-property">keras</span>.<span class="cm-property">callbacks</span> <span class="cm-keyword">import</span> <span class="cm-def">EarlyStopping</span>
<span class="cm-keyword">from</span> <span class="cm-variable">keras</span>.<span class="cm-property">regularizers</span> <span class="cm-keyword">import</span> <span class="cm-def">l2</span>

<span class="cm-variable">early_stop</span> <span class="cm-operator">=</span> <span class="cm-variable">EarlyStopping</span>(<span class="cm-variable">monitor</span><span class="cm-operator">=</span><span class="cm-string">'val_loss'</span>, <span class="cm-variable">mode</span><span class="cm-operator">=</span><span class="cm-string">'min'</span>, <span class="cm-variable">verbose</span><span class="cm-operator">=</span><span class="cm-number">1</span>, <span class="cm-variable">patience</span><span class="cm-operator">=</span><span class="cm-number">10</span>)

<span class="cm-variable">model</span> <span class="cm-operator">=</span> <span class="cm-variable">Sequential</span>()

<span class="cm-error"># https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw</span>

<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dense</span>(<span class="cm-variable">units</span><span class="cm-operator">=</span><span class="cm-number">1350</span>,<span class="cm-variable">activation</span><span class="cm-operator">=</span><span class="cm-string">'relu'</span>))
<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dropout</span>(<span class="cm-number">0.5</span>))

<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dense</span>(<span class="cm-variable">units</span><span class="cm-operator">=</span><span class="cm-number">676</span>,<span class="cm-variable">activation</span><span class="cm-operator">=</span><span class="cm-string">'relu'</span>, <span class="cm-variable">kernel_regularizer</span><span class="cm-operator">=</span><span class="cm-variable">l2</span>(<span class="cm-number">0.001</span>)))
<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dropout</span>(<span class="cm-number">0.25</span>))

<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dense</span>(<span class="cm-variable">units</span><span class="cm-operator">=</span><span class="cm-number">338</span>,<span class="cm-variable">activation</span><span class="cm-operator">=</span><span class="cm-string">'relu'</span>, <span class="cm-variable">kernel_regularizer</span><span class="cm-operator">=</span><span class="cm-variable">l2</span>(<span class="cm-number">0.001</span>)))
<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dropout</span>(<span class="cm-number">0.125</span>))

<span class="cm-variable">model</span>.<span class="cm-property">add</span>(<span class="cm-variable">Dense</span>(<span class="cm-variable">units</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">activation</span><span class="cm-operator">=</span><span class="cm-string">'sigmoid'</span>))

<span class="cm-error"># For a binary classification problem</span>
<span class="cm-variable">model</span>.<span class="cm-property">compile</span>(<span class="cm-variable">loss</span><span class="cm-operator">=</span><span class="cm-string">'binary_crossentropy'</span>, <span class="cm-variable">optimizer</span><span class="cm-operator">=</span><span class="cm-string">'adam'</span>)</pre><p class="">A few notes here:</p><ul data-rte-list="default"><li><p class="">We are going to use 4 layers here since the model is pretty complex and lots of features—over 1,300! Thanks to one-hot encoding</p><ul data-rte-list="default"><li><p class="">The input layer will have 1,350 nodes—generally, this can be set to the number of features or columns. A drop out layer has been added to dispense of useless nodes</p></li><li><p class="">Two hidden layers have been added. I have halved the nodes, added a regularizer to manage overfitting. The dropout layer is halved as well</p></li><li><p class="">Relu activation is used in the first layers because of a general consensus that relu is quite flexible. If we wanted to tune these values we could later</p></li><li><p class="">In the final hidden layer all the values, except for regularization, are halved to continue to simplify the model</p></li><li><p class="">Finally, the output layer will be a single sigmoid node due to our problem being a binary classification problem</p></li></ul></li></ul><ul data-rte-list="default"><li><p class="">We will be adding early stopping to ensure we do not over fit</p></li><li><p class="">Our loss function is going to use binary_crossentropy since this is a binary classification problem and this loss function should be appropriate </p></li><li><p class="">The optimizer to be used is the adam optimizer—highly flexible and generally works quite well</p></li><li><p class="">This initial model was set up somewhat arbitrarily and should be tuned if it is to be used in some sort of production application</p></li></ul><p class="">Next, we can fit the model:</p><pre class="source-code"><span class="cm-variable">model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-variable">X_train_enc</span>, 
          <span class="cm-variable">y</span><span class="cm-operator">=</span><span class="cm-variable">y_train_enc</span>, 
          <span class="cm-variable">epochs</span><span class="cm-operator">=</span><span class="cm-number">1000</span>,
          <span class="cm-variable">validation_data</span><span class="cm-operator">=</span>(<span class="cm-variable">X_test_enc</span>, <span class="cm-variable">y_test_enc</span>), <span class="cm-variable">verbose</span><span class="cm-operator">=</span><span class="cm-number">1</span>,
          <span class="cm-variable">callbacks</span><span class="cm-operator">=</span>[<span class="cm-variable">early_stop</span>]
          )</pre><p class="">1000 epochs is going to be overkill, but the early stopping will ensure we never get close to 1000 epochs.</p><pre class="source-code"><span class="cm-variable">Epoch</span> <span class="cm-number">1</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">8</span><span class="cm-variable">s</span> <span class="cm-number">26</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.5662</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.2128</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">2</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">24</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.1535</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1206</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">3</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">23</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0809</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1024</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">4</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">24</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0564</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1522</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">5</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">8</span><span class="cm-variable">s</span> <span class="cm-number">25</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0412</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.0698</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">6</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">23</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0316</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.0833</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">7</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">8</span><span class="cm-variable">s</span> <span class="cm-number">25</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0317</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.0997</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">8</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">23</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0246</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1217</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">9</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">8</span><span class="cm-variable">s</span> <span class="cm-number">25</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0187</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.0735</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">10</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">24</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0160</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1116</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">11</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">8</span><span class="cm-variable">s</span> <span class="cm-number">25</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0166</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.0773</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">12</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">24</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0307</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1051</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">13</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">24</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0242</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1281</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">14</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">24</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0201</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.1222</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">15</span><span class="cm-operator">/</span><span class="cm-number">1000</span>
<span class="cm-number">305</span><span class="cm-operator">/</span><span class="cm-number">305</span> [<span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span><span class="cm-operator">===</span>] <span class="cm-operator">-</span> <span class="cm-number">7</span><span class="cm-variable">s</span> <span class="cm-number">23</span><span class="cm-variable">ms</span><span class="cm-operator">/</span><span class="cm-variable">step</span> <span class="cm-operator">-</span> <span class="cm-variable">loss</span>: <span class="cm-number">0.0155</span> <span class="cm-operator">-</span> <span class="cm-variable">val_loss</span>: <span class="cm-number">0.0852</span>
<span class="cm-variable">Epoch</span> <span class="cm-number">00015</span>: <span class="cm-variable">early</span> <span class="cm-variable">stopping</span></pre><p class="">Learning in 15 epochs is pretty good! Next, we can show the losses:</p><pre class="source-code"><span class="cm-variable">model_loss</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">model</span>.<span class="cm-property">history</span>.<span class="cm-property">history</span>)
<span class="cm-variable">model_loss</span>.<span class="cm-property">plot</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1613833535109-M9RBKS7NWG5P17D05PQI/blog_loss.JPG" data-image-dimensions="446x311" data-image-focal-point="0.5,0.5" alt="blog_loss.JPG" data-load="false" data-image-id="6031253fc0b3b22cdb926f91" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1613833535109-M9RBKS7NWG5P17D05PQI/blog_loss.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">I am pretty happy with the losses in the chart. Our scale is quite small even though the validation loss is not perfect. It is quite ‘chunky’. I believe this behavior occurs when adding dropout layers. With additional hyper-parameters, the gaps between training and validation loss could be further reduced.</p><p class="">Finally, we can get the predictions and determine performance:</p><pre class="source-code"><span class="cm-variable">predictions</span> <span class="cm-operator">=</span> <span class="cm-variable">model</span>.<span class="cm-property">predict_classes</span>(<span class="cm-variable">X_test_enc</span>)

<span class="cm-error"># https://en.wikipedia.org/wiki/Precision_and_recall</span>
<span class="cm-variable">print</span>(<span class="cm-variable">classification_report</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">predictions</span>))

              <span class="cm-variable">precision</span>    <span class="cm-variable">recall</span>  <span class="cm-variable">f1</span><span class="cm-operator">-</span><span class="cm-variable">score</span>   <span class="cm-variable">support</span>

       <span class="cm-variable">False</span>       <span class="cm-number">1.00</span>      <span class="cm-number">0.97</span>      <span class="cm-number">0.98</span>      <span class="cm-number">1432</span>
        <span class="cm-variable">True</span>       <span class="cm-number">0.96</span>      <span class="cm-number">1.00</span>      <span class="cm-number">0.98</span>      <span class="cm-number">1008</span>

    <span class="cm-variable">accuracy</span>                           <span class="cm-number">0.98</span>      <span class="cm-number">2440</span>
   <span class="cm-variable">macro</span> <span class="cm-variable">avg</span>       <span class="cm-number">0.98</span>      <span class="cm-number">0.98</span>      <span class="cm-number">0.98</span>      <span class="cm-number">2440</span>
<span class="cm-variable">weighted</span> <span class="cm-variable">avg</span>       <span class="cm-number">0.98</span>      <span class="cm-number">0.98</span>      <span class="cm-number">0.98</span>      <span class="cm-number">2440</span></pre><p class="">Cool! The performance is quite good and without a ton of time spent training. By far, the neural network model gives a bigger bang for the buck. </p><p class="">Next time, we will fine-tune the parameters for the neural network.</p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1613833965559-4XNU6FP15TNUHBPIJFFV/image-asset.jpeg?format=1500w" medium="image" isDefault="true" width="1500" height="1000"><media:title type="plain">Predict which Sales Leads Close Part 2</media:title></media:content></item><item><title>A Vast Majority of AI &amp; ML Projects Fail, but they Don't have to</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Wed, 17 Feb 2021 01:48:38 +0000</pubDate><link>https://www.conaxon.org/projects/a-vast-majority-of-ai-amp-ml-projects-fail-but-they-dont-have-to</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:602c1663b2afaf41bc7d6e4c</guid><description><![CDATA[5 ways to improve AI, Machine Learning, and Business Intelligence project 
outcomes.]]></description><content:encoded><![CDATA[<p class="">I attended a virtual conference where a pretty interesting statistic was shared: 75% of AI &amp; ML projects fail or benefits will not be realized. Gartner, in 2017 and 2019, seem to echo this sentiment: 80% of analytics insights will not deliver business outcomes through 2022 and 80% of AI projects will “remain alchemy, run by wizards” through 2020. Given the hype and number of businesses investing in data, the risk for negative ROI is alarmingly high. For small consultancies like Conaxon, this is not good news given that our goal is to create opportunities that allow small-to-midsize businesses to cash in on the benefits of AI and ML.</p><p class="">Here are the top 5 ways to drastically improve the success of AI and ML initiatives:</p><ul data-rte-list="default"><li><p class="">Talk to Stakeholders and Include them in Decision Making: </p><ul data-rte-list="default"><li><p class="">A general lack of understanding surrounding AI, ML, Business Intelligence, and d Data Science can make well intentioned projects dead on arrival. Naturally, we are uncertain about new technologies, change, and being left behind. These are all valid! But, if business and data science leaders spend time educating, socializing, and strategizing how data literacy gets weaved into the company culture. If your employees are in constant fear that AI and ML are going to be replacing them then it will be incredibly difficult to allow for integration. AI and ML are not going to automate away everything. AI and ML is a tool to be used in symbiosis. These are tools to make human functions more precise and efficient.</p></li></ul></li><li><p class="">Start with Decision Intelligence:</p><ul data-rte-list="default"><li><p class="">Do not get caught up in the shiny gem that is data. It is so easy to overdo it early in the game. Start simple with AI and ML. Applied AI and ML are not yet advanced enough to easily interpret chaos. You need to collaborate with the various business functions and decide which <em>decisions </em> could be better by having a piece (or pieces) of information—the more repeatable, the better. AI and ML work best when the thing you are trying to make more efficient is repeatable and a pattern can be taught/identified.  If the project does not meet those two very basic criteria, then your risk for failure increases fairly exponentially.</p></li></ul></li><li><p class="">Keep it Simple:</p><ul data-rte-list="default"><li><p class="">Don’t try and boil the ocean. Data can be overwhelming as well as liberating. Stay focused on a few initiatives that truly help make your team’s life easier. Putting dozens of dashboards with multitudes of charts and KPIs in front of executives isn’t effective.</p></li></ul></li><li><p class="">Spend a majority of the time on defining/measuring the problem:</p><ul data-rte-list="default"><li><p class="">If you start off your journey with ML and AI with a poorly defined and un-measurable then failure is imminent. Aimless, or poorly aimed, AI and ML development will result in the output being vastly different than what your stakeholders need. At the end of the project, your shiny new data product needs to be a tool that people use and integrate with—like a sword and an arm. Swords were an extension of the warriors arm. AI and ML products need to be integrated in the same way. As mentioned above, engage with the end-users early in the project. Interface with them regularly to assess . Study how they work day-to-day. </p></li></ul></li><li><p class="">Put a good team together—with a kick-ass project manager:</p><ul data-rte-list="default"><li><p class="">The team you build around your data vision will be the keystone for success. Your decision maker should be an advocate and ambassador. They should be pragmatic. They should have solid domain experience in the space the team is operating within. You should probably find a customer champion. This person should be politically savvy and have very intimate knowledge of how the operations are performed; furthermore, is well respected by the end users. Of course, you need your data scientists, engineers, and analysts. Last but not least, spend some time and money on finding a really great project manager.  A great many analytics, AI, and ML do not come to fruition because of project management related issue. This is not to say the project managers are all to blame! However, there is something to be said about the impact of a great project management professional on the outcome of an initiative.</p></li></ul></li></ul>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1613526207045-LG9EABT6HHFA6YVPKB1B/image-asset.jpeg?format=1500w" medium="image" isDefault="true" width="1500" height="1875"><media:title type="plain">A Vast Majority of AI &amp; ML Projects Fail, but they Don't have to</media:title></media:content></item><item><title>Predict which Sales Leads Close Part 1</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Tue, 09 Feb 2021 22:06:23 +0000</pubDate><link>https://www.conaxon.org/projects/predict-which-sales-leads-close</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:601ebc61f859850fa282074e</guid><description><![CDATA[Can you predict which leads you might close? We propose a simple project to 
demonstrate a machine learning use case for optimizing which leads sales 
teams are predicted to be closed.]]></description><content:encoded><![CDATA[<h3>Setting the Stage:</h3><p class="">Consider for a moment that your business is doing quite well. Sales is quickly climbing, the sales funnel is quite full, and customer service is top notch. But, because you are a responsible leader and manager the future appears somewhat hazy! Growth is wonderful, to be sure. However, growth can only scale as well as the sales team—and by extension, the rest of your operations. At some point the sales funnel will become exceedingly top heavy, business leaders will have to decide: do we hire more team members to support the increased demand or do we attempt to lean out somewhat so as to preserve margin, customer service, and specialization? </p><p class="">A tough, but highly personal choice. </p><p class="">I am willing to bet, a great many businesses would choose the option to lean out, maintain margins, and continue to develop productive their salespeople. There are a few pillars critical to projects where efficiency is the desired output, but maybe none more critical than tools. Having a diverse toolbox is essential. Data analytics and machine learning is quickly becoming an essential tool in the toolbox. </p><h3>Proposed Solution:</h3><p class="">All that said, I propose that machine learning could be used to predict which sales leads might close; therefore, allowing salespeople some insights into which leads should be prioritized first in the funnel. Secondarily, this algorithm could be used as a tool for identifying sales opportunities NOT being closed that may be critical now or in the future.</p><h3>Business Context:</h3><p class="">In order to demonstrate the capability of machine learning to address the aforementioned use case, we searched for a public dataset to perform tests. The team landed on a Kaggle dataset posted by a company called Olist—the largest department store in Brazilian marketplaces (link: <a href="https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv).">https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv).</a> This is a marketing funnel dataset from sellers that populated a form that requested to sell their products on the <a href="http://www.olist.com/">Olist Store</a>. Olist connects small businesses from all over Brazil to channels without hassle and a single contract. Merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist’s supply chain partners. </p><p class="">The sales process is as follows:</p><ol data-rte-list="default"><li><p class="">Sign-up at a landing page</p></li><li><p class="">Sales development Representative (SDR) contacts lead, collects some information and schedules an additional consultancy</p></li><li><p class="">Consultancy is made by a Sales Representative (SR). The SR may close the deal or not</p></li><li><p class="">Lead becomes a seller and starts building their catalog on Olist</p></li><li><p class="">The products are published on Olist marketplaces and ready to sell!</p></li></ol><h3>The Dataset:</h3><p class="">The dataset has information related to 8,000 Marketing Qualified Leads (MQLs) that requested a contact. these MQLs were randomly sampled from a larger set of MQLs. </p>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          <a class="
                sqs-block-image-link
                
          
        
              " href="https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv"
              
          >
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612714366006-F8TU93NW6KO8KDM4MHGD/Olist+Data+Schema" data-image-dimensions="2232x1176" data-image-focal-point="0.5,0.5" alt="source: https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv" data-load="false" data-image-id="6020117dcdba670837a0cee6" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612714366006-F8TU93NW6KO8KDM4MHGD/Olist+Data+Schema?format=1000w" />
          
        
          </a>
        

        
          
          <figcaption class="image-caption-wrapper">
            <p class="">source: https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv</p>
          </figcaption>
        
      
        </figure>
      

    
  


  


<p class="">The algorithm will use the data from the qualified leads daraset and closed leads dataset. A future projet might be demand/sales forecasting using the sellers dataset and order items dataset.</p><h3>Jumping into the Data:</h3><p class="">When testing, I like to use Jupyter Lab. I find it to be supremely easy to work with and lends itself to iteration, agility, and ease of use. First, we will import the libraries we will be using:</p><pre class="source-code"><span class="cm-keyword">import</span> <span class="cm-def">pandas</span> <span class="cm-keyword">as</span> <span class="cm-def">pd</span>
<span class="cm-keyword">import</span> <span class="cm-def">seaborn</span> <span class="cm-keyword">as</span> <span class="cm-def">sns</span>
<span class="cm-keyword">import</span> <span class="cm-def">matplotlib</span>.<span class="cm-variable">pyplot</span> <span class="cm-variable">as</span> <span class="cm-variable">plt</span>
<span class="cm-keyword">import</span> <span class="cm-def">numpy</span> <span class="cm-keyword">as</span> <span class="cm-def">np</span>
<span class="cm-keyword">from</span> <span class="cm-variable">datetime</span> <span class="cm-keyword">import</span> <span class="cm-def">datetime</span></pre><p class="">Now, we are going to read in the data for the analysis:</p><pre class="source-code"><span class="cm-variable">closed_deals</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">'olist_closed_deals_dataset.csv'</span>)
<span class="cm-variable">olist_leads</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">read_csv</span>(<span class="cm-string">'olist_marketing_qualified_leads_dataset.csv'</span>)</pre><p class="">Next, we are going to combine the qualified leads and the closed deals to create a single dataset for generating predictions. The documentation from Kaggle was really great so there is no mystery as to how the join needs to be performed.</p><pre class="source-code"><span class="cm-variable">funnel</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">merge</span>(<span class="cm-variable">olist_leads</span>,<span class="cm-variable">closed_deals</span>,<span class="cm-variable">how</span><span class="cm-operator">=</span><span class="cm-string">'left'</span>,<span class="cm-variable">on</span><span class="cm-operator">=</span><span class="cm-string">'mql_id'</span>)</pre><p class="">The next lines of code are going to be adding some potentially useful features and removing others that won’t be useful in the prediction model. The code is pretty simple and self-explanatory. Initially, time-to-close was thought to be useful in the prediction, but at the time of writing the report the features were not used in the model.  Time-to-close would likely be more important as a business analysis task rather than a prediction task. The code is left in this report for reference anyways:</p><pre class="source-code"><span class="cm-variable">funnel</span>[<span class="cm-string">'won_date'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'datetime64[ns]'</span>)
<span class="cm-variable">funnel</span>[<span class="cm-string">'first_contact_date'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">astype</span>(<span class="cm-string">'datetime64[ns]'</span>)
<span class="cm-variable">funnel</span>[<span class="cm-string">'time_to_close'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'won_date'</span>] <span class="cm-operator">-</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'first_contact_date'</span>]
<span class="cm-variable">funnel</span>[<span class="cm-string">'time_to_close'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'time_to_close'</span>].<span class="cm-property">dt</span>.<span class="cm-property">days</span>

<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>(<span class="cm-string">'declared_monthly_revenue'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>(<span class="cm-string">'declared_product_catalog_size'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>(<span class="cm-string">'average_stock'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>(<span class="cm-string">'has_company'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>(<span class="cm-string">'seller_id'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">indexNames</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[ (<span class="cm-variable">funnel</span>[<span class="cm-string">'time_to_close'</span>] <span class="cm-operator">&lt;</span> <span class="cm-number">0</span>)].<span class="cm-property">index</span>
<span class="cm-variable">funnel</span>.<span class="cm-property">drop</span>(<span class="cm-variable">indexNames</span> , <span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)</pre><p class="">It is important to note that many of these dimensions were dropped because there was very little data to begin. There are instances of imputation later in the project that could have been applied to these dropped dimensions; however, there was little data to even be able to reliably impute from as a baseline.</p><p class="">The next line of code defines what we are going to end up trying to predict—a binary TRUE or FALSE classification:</p><pre class="source-code"><span class="cm-variable">funnel</span>[<span class="cm-string">'closed_deal'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>[<span class="cm-string">'won_date'</span>].<span class="cm-property">notnull</span>()</pre><p class="">This particular project did not attempt to understand time to close, but could easily be revisited at a later time. </p><p class="">Each project typically starts with some basic exploratory data analysis. I want to have a devent understanding of the spread in time-to-close, which SRs and SDRs are closing most often, and which features might have the most importance in a prediction. Let’s start with a basic understanding of which landing pages seem to close the most deals:</p><pre class="source-code"><span class="cm-variable">pg_id</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>.<span class="cm-property">loc</span>[<span class="cm-variable">funnel</span>[<span class="cm-string">'closed_deal'</span>] <span class="cm-operator">==</span> <span class="cm-variable">True</span>]
<span class="cm-variable">pg_id</span> <span class="cm-operator">=</span> <span class="cm-variable">pg_id</span>.<span class="cm-property">landing_page_id</span>.<span class="cm-property">value_counts</span>()
<span class="cm-variable">pg_id</span>[<span class="cm-variable">pg_id</span>.<span class="cm-property">values</span> <span class="cm-operator">&gt;</span> <span class="cm-number">5</span>].<span class="cm-property">plot</span>(<span class="cm-variable">kind</span><span class="cm-operator">=</span><span class="cm-string">"bar"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">title</span>(<span class="cm-string">"Landing pages count - closed deals"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">savefig</span>(<span class="cm-string">"landing_page_counts.png"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">show</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612810357924-B41A25TJBLQ40MADUVJ8/landing_page_closure.JPG" data-image-dimensions="603x682" data-image-focal-point="0.5,0.5" alt="landing_page_closure.JPG" data-load="false" data-image-id="60218875e2e90911ca536c65" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612810357924-B41A25TJBLQ40MADUVJ8/landing_page_closure.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Next, it might be interesting whom are the most effective SRs:</p><pre class="source-code"><span class="cm-variable">sr</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>.<span class="cm-property">loc</span>[<span class="cm-variable">funnel</span>[<span class="cm-string">'closed_deal'</span>] <span class="cm-operator">==</span> <span class="cm-variable">True</span>]
<span class="cm-variable">sr</span> <span class="cm-operator">=</span> <span class="cm-variable">sr</span>.<span class="cm-property">sr_id</span>.<span class="cm-property">value_counts</span>()
<span class="cm-variable">sr</span>[<span class="cm-variable">sr</span>.<span class="cm-property">values</span> <span class="cm-operator">&gt;</span> <span class="cm-number">5</span>].<span class="cm-property">plot</span>(<span class="cm-variable">kind</span><span class="cm-operator">=</span><span class="cm-string">"bar"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">title</span>(<span class="cm-string">"closed deals - by sales rep"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">savefig</span>(<span class="cm-string">"landing_page_counts.png"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">show</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612810546641-E32FKLBU5NRCWPE5D45U/sr_closure.JPG" data-image-dimensions="597x684" data-image-focal-point="0.5,0.5" alt="sr_closure.JPG" data-load="false" data-image-id="60218932b8e87d68c78cb492" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612810546641-E32FKLBU5NRCWPE5D45U/sr_closure.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Next, we look at the most effective SDRs:</p><pre class="source-code"><span class="cm-variable">sdr</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>.<span class="cm-property">loc</span>[<span class="cm-variable">funnel</span>[<span class="cm-string">'closed_deal'</span>] <span class="cm-operator">==</span> <span class="cm-variable">True</span>]
<span class="cm-variable">sdr</span> <span class="cm-operator">=</span> <span class="cm-variable">sdr</span>.<span class="cm-property">sdr_id</span>.<span class="cm-property">value_counts</span>()
<span class="cm-variable">sdr</span>[<span class="cm-variable">sdr</span>.<span class="cm-property">values</span> <span class="cm-operator">&gt;</span> <span class="cm-number">5</span>].<span class="cm-property">plot</span>(<span class="cm-variable">kind</span><span class="cm-operator">=</span><span class="cm-string">"bar"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">title</span>(<span class="cm-string">"closed deals - by sales development rep"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">savefig</span>(<span class="cm-string">"landing_page_counts.png"</span>)
<span class="cm-variable">plt</span>.<span class="cm-property">show</span>()</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612810769594-BEKJZG0ZDI7LZXMBLPE8/sdr_closure.JPG" data-image-dimensions="583x684" data-image-focal-point="0.5,0.5" alt="sdr_closure.JPG" data-load="false" data-image-id="60218a1177d70f600949085f" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612810769594-BEKJZG0ZDI7LZXMBLPE8/sdr_closure.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">We’ll finish off the exploratory data analysis with a cursory understanding of how long it takes to close a deal based on various features in the dataset. Again, this part of the business analysis is not strictly pertinent but potentially useful knowledge for further development. The business might find it useful in the future to have a prediction of when a deal might close—thereby allowing some ability to better understand potential revenue.</p><pre class="source-code"><span class="cm-variable">sns</span>.<span class="cm-property">displot</span>(<span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">funnel</span>, <span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-string">"time_to_close"</span>, <span class="cm-variable">col</span><span class="cm-operator">=</span><span class="cm-string">"origin"</span>, <span class="cm-variable">kde</span><span class="cm-operator">=</span><span class="cm-variable">True</span>,<span class="cm-variable">col_wrap</span><span class="cm-operator">=</span><span class="cm-number">2</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811644732-HHQ319O7IP4X7WMV8UZR/time_to_close1.png" data-image-dimensions="712x1792" data-image-focal-point="0.5,0.5" alt="time_to_close1.png" data-load="false" data-image-id="60218d7c02712e1073baddd3" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811644732-HHQ319O7IP4X7WMV8UZR/time_to_close1.png?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<pre class="source-code"><span class="cm-variable">sns</span>.<span class="cm-property">displot</span>(<span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">funnel</span>, <span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-string">"time_to_close"</span>, <span class="cm-variable">col</span><span class="cm-operator">=</span><span class="cm-string">"business_segment"</span>, <span class="cm-variable">kde</span><span class="cm-operator">=</span><span class="cm-variable">True</span>,<span class="cm-variable">col_wrap</span><span class="cm-operator">=</span><span class="cm-number">2</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811741212-XVO7IBAIXN6LXU6LFU4C/time_to_close2.png" data-image-dimensions="712x6112" data-image-focal-point="0.5,0.5" alt="time_to_close2.png" data-load="false" data-image-id="60218dddcaaff826bd80aeb4" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811741212-XVO7IBAIXN6LXU6LFU4C/time_to_close2.png?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<pre class="source-code"><span class="cm-variable">sns</span>.<span class="cm-property">displot</span>(<span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">funnel</span>, <span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-string">"time_to_close"</span>, <span class="cm-variable">col</span><span class="cm-operator">=</span><span class="cm-string">"lead_type"</span>, <span class="cm-variable">kde</span><span class="cm-operator">=</span><span class="cm-variable">True</span>,<span class="cm-variable">col_wrap</span><span class="cm-operator">=</span><span class="cm-number">2</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811863488-UNNMXB37ICTSKO7VJTAC/time_to_close4.png" data-image-dimensions="712x1432" data-image-focal-point="0.5,0.5" alt="time_to_close4.png" data-load="false" data-image-id="60218e57ca876e151d30a427" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811863488-UNNMXB37ICTSKO7VJTAC/time_to_close4.png?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<pre class="source-code"><span class="cm-variable">sns</span>.<span class="cm-property">displot</span>(<span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">funnel</span>, <span class="cm-variable">x</span><span class="cm-operator">=</span><span class="cm-string">"time_to_close"</span>, <span class="cm-variable">col</span><span class="cm-operator">=</span><span class="cm-string">"business_type"</span>, <span class="cm-variable">kde</span><span class="cm-operator">=</span><span class="cm-variable">True</span>,<span class="cm-variable">col_wrap</span><span class="cm-operator">=</span><span class="cm-number">2</span>)
</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811981149-60YZXYCVQQ53TI4Z9IVO/time_to_close5.png" data-image-dimensions="712x712" data-image-focal-point="0.5,0.5" alt="time_to_close5.png" data-load="false" data-image-id="60218ecd871d884a1e0950a0" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612811981149-60YZXYCVQQ53TI4Z9IVO/time_to_close5.png?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Overall, the data analysis wasn’t too conclusive but gave decent exposure to some of the intricacies of the dataset. You’ll notice that in many areas, the data is quite sparse and not many samples to develop a robust model. It would be advisable, like in most instances, to acquire more data to test and fine tune hyper parameters.</p><p class="">After the brief data analysis, we can begin to further clean and develop the features going to be used in the prediction:</p><pre class="source-code"><span class="cm-variable">funnel_model</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel</span>.<span class="cm-property">copy</span>(<span class="cm-variable">deep</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'contact_day'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">strftime</span>(<span class="cm-string">'%d'</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'contact_month'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">strftime</span>(<span class="cm-string">'%m'</span>)
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'contact_year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'first_contact_date'</span>].<span class="cm-property">dt</span>.<span class="cm-property">year</span>
<span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>(<span class="cm-string">'time_to_close'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>(<span class="cm-string">'won_date'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>(<span class="cm-string">'first_contact_date'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>(<span class="cm-string">'mql_id'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>,<span class="cm-variable">inplace</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">funnel_model</span>.<span class="cm-property">drop_duplicates</span>()</pre><p class="">There are a few things to note with the preceding code:</p><ul data-rte-list="default"><li><p class="">Extract the contact day, month, and year because the date alone is not going to be a useful predictor</p></li><li><p class="">Drop the time to close (for the time being) as it will not be used in the initial model</p></li><li><p class="">Drop the date in which the contract was won. The date itself and it’s date components also will not be useful</p></li><li><p class="">Drop the first contact date as it will not be a useful predictor itself</p></li><li><p class="">Drop the unique qualified id because it is not useful</p></li><li><p class="">Drop any duplicates in the dataset so we ensure that bias is less likely</p></li></ul><p class="">The following code addresses a particularly thorny problem from an architectural standpoint. There were a significant number of ‘na’ or ‘nan’ values with the combination of the close deals and leads dataset. In order to properly demonstrate the use case, the ‘na’ and ‘nan’ values will need to be addressed through imputation. In this investigation, we are going to assume that if the won date is null, then the contract has been lost—thereby giving us a population of leads not won and those that have been won (remember the line of code above: funnel['closed_deal'] = funnel['won_date'].notnull()). There is no other identifier for a lead in progress or otherwise. </p><p class="">A simple line of code to determine the number of ‘na’ and ‘nan’ records is the following:</p><pre class="source-code"><span class="cm-error">#count missing values (NAs)</span>
<span class="cm-variable">missing_count</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">isna</span>().<span class="cm-property">sum</span>(),<span class="cm-variable">columns</span><span class="cm-operator">=</span>[<span class="cm-string">'Number'</span>])
<span class="cm-variable">missing_count</span>[<span class="cm-string">'Percentage'</span>] <span class="cm-operator">=</span> <span class="cm-variable">round</span>(<span class="cm-variable">missing_count</span> <span class="cm-operator">/</span> <span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>),<span class="cm-number">2</span>) <span class="cm-operator">*</span> <span class="cm-number">100</span>
<span class="cm-variable">missing_count</span></pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612825130699-4SVTOIFN5RW13Q7L28OX/nans.JPG" data-image-dimensions="467x669" data-image-focal-point="0.5,0.5" alt="nans.JPG" data-load="false" data-image-id="6021c22ab4c7a71e8269852a" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612825130699-4SVTOIFN5RW13Q7L28OX/nans.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Given that there were so many missing feature values, imputation should be sufficient to demonstrate how to properly fill the gaps in the data model. Conceptually, the imputation employed was simple. A function was created to write the unique values from a feature to a list. Then, another line of code was written to randomly choose values from that list to fill the ‘na’ or ‘nan’ within the dimension:</p><pre class="source-code"><span class="cm-variable">origin_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"origin"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">origin_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">origin_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'origin'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'origin'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">origin_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">sdr_id_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"sdr_id"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">sdr_id_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">sdr_id_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_id'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_id'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">sdr_id_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">sr_id_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"sr_id"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">sr_id_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">sr_id_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'sr_id'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sr_id'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">sr_id_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">bs_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"business_segment"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">bs_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">bs_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_segment'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_segment'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">bs_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">lead_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"lead_type"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">lead_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">lead_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_type'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_type'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">lead_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">lbp_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"lead_behaviour_profile"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">lbp_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">lbp_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_behaviour_profile'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'lead_behaviour_profile'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">lbp_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">gtin_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"has_gtin"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">gtin_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">gtin_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'has_gtin'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'has_gtin'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">gtin_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))

<span class="cm-variable">btype_list</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">"business_type"</span>].<span class="cm-property">unique</span>()
<span class="cm-variable">btype_list</span> <span class="cm-operator">=</span> [<span class="cm-variable">x</span> <span class="cm-keyword">for</span> <span class="cm-variable">x</span> <span class="cm-keyword">in</span> <span class="cm-variable">btype_list</span> <span class="cm-keyword">if</span> <span class="cm-variable">str</span>(<span class="cm-variable">x</span>) <span class="cm-operator">!=</span> <span class="cm-string">'nan'</span>]
<span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_type'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'business_type'</span>].<span class="cm-property">fillna</span>(<span class="cm-variable">pd</span>.<span class="cm-property">Series</span>(<span class="cm-variable">np</span>.<span class="cm-property">random</span>.<span class="cm-property">choice</span>(<span class="cm-variable">btype_list</span>, <span class="cm-variable">size</span><span class="cm-operator">=</span><span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">index</span>))))</pre><p class="">After filling the ‘nan’ and ‘na’ values, a combination between the SDR id and SR id to build a feature that uses the combo to predict closure:</p><pre class="source-code"><span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_sr'</span>] <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sdr_id'</span>] <span class="cm-operator">+</span> <span class="cm-variable">funnel_model</span>[<span class="cm-string">'sr_id'</span>]</pre><p class="">If we re-run the code for determining the ‘na’ or ‘nan’ value there should not be any left. You’ ll notice there is a single record still left ‘nan’ —so I drop it.</p><pre class="source-code"><span class="cm-error">#count missing values (NAs)</span>
<span class="cm-variable">missing_count</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">funnel_model</span>.<span class="cm-property">isna</span>().<span class="cm-property">sum</span>(),<span class="cm-variable">columns</span><span class="cm-operator">=</span>[<span class="cm-string">'Number'</span>])
<span class="cm-variable">missing_count</span>[<span class="cm-string">'Percentage'</span>] <span class="cm-operator">=</span> <span class="cm-variable">round</span>(<span class="cm-variable">missing_count</span> <span class="cm-operator">/</span> <span class="cm-variable">len</span>(<span class="cm-variable">funnel_model</span>),<span class="cm-number">2</span>) <span class="cm-operator">*</span> <span class="cm-number">100</span>
<span class="cm-variable">missing_count</span></pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612880886673-J3NKSQDCTCJJNJUKUHLG/nans2.JPG" data-image-dimensions="390x638" data-image-focal-point="0.5,0.5" alt="nans2.JPG" data-load="false" data-image-id="60229bf62c756643b66614f1" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612880886673-J3NKSQDCTCJJNJUKUHLG/nans2.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<pre class="source-code"><span class="cm-variable">funnel_model</span>.<span class="cm-property">loc</span>[<span class="cm-variable">funnel_model</span>[<span class="cm-string">'has_gtin'</span>].<span class="cm-property">isnull</span>()]

<span class="cm-variable">funnel_model</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>.<span class="cm-property">drop</span>(<span class="cm-variable">index</span><span class="cm-operator">=</span><span class="cm-number">7999</span>)</pre><p class="">At this point, the data model should be set properly. First, we select the features to be used in the model. The next step will be to encode the categorical data. For this implementation, one-hot encoding was used as ordinal encoding did not seem to be appropriate.</p><pre class="source-code"><span class="cm-variable">df1</span> <span class="cm-operator">=</span> <span class="cm-variable">funnel_model</span>[[<span class="cm-string">'landing_page_id'</span>, <span class="cm-string">'origin'</span>, <span class="cm-string">'sdr_id'</span>,<span class="cm-string">'sr_id'</span>,<span class="cm-string">'business_segment'</span>,
                   <span class="cm-string">'lead_type'</span>,<span class="cm-string">'lead_behaviour_profile'</span>,<span class="cm-string">'has_gtin'</span>,<span class="cm-string">'business_type'</span>,<span class="cm-string">'closed_deal'</span>,
                  <span class="cm-string">'contact_day'</span>,<span class="cm-string">'contact_month'</span>,<span class="cm-string">'contact_year'</span>,<span class="cm-string">'sdr_sr'</span>]].<span class="cm-property">copy</span>()

<span class="cm-variable">pd</span>.<span class="cm-property">get_dummies</span>(<span class="cm-variable">df1</span>)

<span class="cm-variable">pd</span>.<span class="cm-property">get_dummies</span>(<span class="cm-variable">df1</span>.<span class="cm-property">drop</span>(<span class="cm-string">'closed_deal'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>),<span class="cm-variable">drop_first</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)</pre><p class="">It was a long road to get here, but the following code will apply to model development. Support Vector Classifier and Decision Trees from Sci-kit Learn will be used initially for the prediction. We will first define the X and y. ‘X’ being the features and ‘y’ being the values we are trying to predict:</p><pre class="source-code"><span class="cm-variable">X</span> <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">get_dummies</span>(<span class="cm-variable">df1</span>.<span class="cm-property">drop</span>(<span class="cm-string">'closed_deal'</span>,<span class="cm-variable">axis</span><span class="cm-operator">=</span><span class="cm-number">1</span>),<span class="cm-variable">drop_first</span><span class="cm-operator">=</span><span class="cm-variable">True</span>)
<span class="cm-variable">y</span> <span class="cm-operator">=</span> <span class="cm-variable">df1</span>[<span class="cm-string">'closed_deal'</span>]
</pre><p class="">After defining the X and y, complete the train, test, split. I have set the test size to be a bit smaller and the training set to be larger. Due to the imbalanced classes, the hope is more closed deals will end up in the training set to learn from.</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">train_test_split</span>

<span class="cm-variable">X_train</span>, <span class="cm-variable">X_test</span>, <span class="cm-variable">y_train</span>, <span class="cm-variable">y_test</span> <span class="cm-operator">=</span> <span class="cm-variable">train_test_split</span>(<span class="cm-variable">X</span>, <span class="cm-variable">y</span>, <span class="cm-variable">test_size</span><span class="cm-operator">=</span><span class="cm-number">0.2</span>, <span class="cm-variable">random_state</span><span class="cm-operator">=</span><span class="cm-number">101</span>)
</pre><p class="">Define the model and fit the model to the training data:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">tree</span> <span class="cm-keyword">import</span> <span class="cm-def">DecisionTreeClassifier</span>

<span class="cm-variable">model</span> <span class="cm-operator">=</span> <span class="cm-variable">DecisionTreeClassifier</span>()

<span class="cm-variable">model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train</span>,<span class="cm-variable">y_train</span>)
</pre><p class="">The model is insanely simple—almost comical levels of simplicity given the complex nature of the functions being performed. However, it is much easier to create baselines with simple models so that hyper-parameters could be effectively tuned. Next, we will build our predictions:</p><pre class="source-code"><span class="cm-variable">base_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test</span>)</pre><p class="">After the predictions have been made, it is easy enough to determine accuracy through a confusion matrix and classification report:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">metrics</span> <span class="cm-keyword">import</span> <span class="cm-def">confusion_matrix</span>,<span class="cm-def">classification_report</span>,<span class="cm-def">plot_confusion_matrix</span>

<span class="cm-variable">confusion_matrix</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">base_pred</span>)

<span class="cm-variable">array</span>([[<span class="cm-number">1357</span>,   <span class="cm-number">77</span>],
       [  <span class="cm-number">98</span>,   <span class="cm-number">68</span>]], <span class="cm-variable">dtype</span><span class="cm-operator">=</span><span class="cm-variable">int64</span>)

<span class="cm-variable">plot_confusion_matrix</span>(<span class="cm-variable">model</span>,<span class="cm-variable">X_test</span>,<span class="cm-variable">y_test</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612885517710-61DSZMLFXK9M5GLJM0LF/conf_matrix.JPG" data-image-dimensions="813x400" data-image-focal-point="0.5,0.5" alt="conf_matrix.JPG" data-load="false" data-image-id="6022ae0dca9585717636676f" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612885517710-61DSZMLFXK9M5GLJM0LF/conf_matrix.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<pre class="source-code"><span class="cm-variable">print</span>(<span class="cm-variable">classification_report</span>(<span class="cm-variable">y_test</span>,<span class="cm-variable">base_pred</span>))

               <span class="cm-variable">precision</span>    <span class="cm-variable">recall</span>  <span class="cm-variable">f1</span><span class="cm-operator">-</span><span class="cm-variable">score</span>   <span class="cm-variable">support</span>

       <span class="cm-variable">False</span>       <span class="cm-number">0.93</span>      <span class="cm-number">0.95</span>      <span class="cm-number">0.94</span>      <span class="cm-number">1434</span>
        <span class="cm-variable">True</span>       <span class="cm-number">0.47</span>      <span class="cm-number">0.41</span>      <span class="cm-number">0.44</span>       <span class="cm-number">166</span>

    <span class="cm-variable">accuracy</span>                           <span class="cm-number">0.89</span>      <span class="cm-number">1600</span>
   <span class="cm-variable">macro</span> <span class="cm-variable">avg</span>       <span class="cm-number">0.70</span>      <span class="cm-number">0.68</span>      <span class="cm-number">0.69</span>      <span class="cm-number">1600</span>
<span class="cm-variable">weighted</span> <span class="cm-variable">avg</span>       <span class="cm-number">0.88</span>      <span class="cm-number">0.89</span>      <span class="cm-number">0.89</span>      <span class="cm-number">1600</span>
</pre><p class="">OOF! The results of this model reflects the terribly imbalanced classes that exist. To no surprise, the model is very accurate in regards to predicting which leads won’t close and terrible at predicting the leads that will close. It would be very unwise to use the overall accuracy score of 0.89 (89%) given that there is a biased preference by the model to predict that a lead will not close. All this said, we can try to see if a very basic Support Vector Classifier will be a more balanced model.</p><p class="">Feature Importances:</p><pre class="source-code"><span class="cm-variable">pd</span>.<span class="cm-property">DataFrame</span>(<span class="cm-variable">index</span><span class="cm-operator">=</span><span class="cm-variable">X</span>.<span class="cm-property">columns</span>,<span class="cm-variable">data</span><span class="cm-operator">=</span><span class="cm-variable">model</span>.<span class="cm-property">feature_importances_</span>,<span class="cm-variable">columns</span><span class="cm-operator">=</span>[<span class="cm-string">'Feature Importance'</span>]).<span class="cm-property">sort_values</span>(<span class="cm-variable">by</span><span class="cm-operator">=</span>[<span class="cm-string">'Feature Importance'</span>],<span class="cm-variable">ascending</span><span class="cm-operator">=</span><span class="cm-variable">False</span>)</pre><p class="">Using Grid Search, an optimized Support Vector Classifier was built to determine the best parameters possible for the basic model:</p><pre class="source-code"><span class="cm-variable">from</span> <span class="cm-variable">sklearn</span>.<span class="cm-property">model_selection</span> <span class="cm-keyword">import</span> <span class="cm-def">GridSearchCV</span>

<span class="cm-variable">svm</span> <span class="cm-operator">=</span> <span class="cm-variable">SVC</span>()
<span class="cm-variable">param_grid</span> <span class="cm-operator">=</span> {<span class="cm-string cm-property">'C'</span>:[<span class="cm-number">0.01</span>,<span class="cm-number">0.1</span>,<span class="cm-number">1</span>],<span class="cm-string cm-property">'kernel'</span>:[<span class="cm-string">'linear'</span>,<span class="cm-string">'rbf'</span>]}
<span class="cm-variable">grid</span> <span class="cm-operator">=</span> <span class="cm-variable">GridSearchCV</span>(<span class="cm-variable">svm</span>,<span class="cm-variable">param_grid</span>)

<span class="cm-error"># Note again we didn't split Train|Test</span>
<span class="cm-variable">grid</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X</span>,<span class="cm-variable">y</span>)</pre><p class="">After the Grid Search is complete, we can find the best parameters for the model to baseline:</p><pre class="source-code"><span class="cm-variable">grid</span>.<span class="cm-property">best_score_</span>

<span class="cm-number">0.9486120231394622</span>

<span class="cm-variable">grid</span>.<span class="cm-property">best_params_</span>

{<span class="cm-string">'C'</span>: <span class="cm-number">0.1</span>, <span class="cm-string">'kernel'</span>: <span class="cm-string">'linear'</span>}
</pre><p class="">Not a terrible score! But, as we noted above, we need to break down this accuracy numbers into manageable components. We can do this by, first, building the simple model using the parameters found in the Grid Search:</p><pre class="source-code"><span class="cm-variable">model</span> <span class="cm-operator">=</span> <span class="cm-variable">SVC</span>(<span class="cm-variable">kernel</span><span class="cm-operator">=</span><span class="cm-string">'linear'</span>, <span class="cm-variable">C</span><span class="cm-operator">=</span><span class="cm-number">0.1</span>)

<span class="cm-variable">model</span>.<span class="cm-property">fit</span>(<span class="cm-variable">X_train</span>, <span class="cm-variable">y_train</span>)

<span class="cm-variable">y_pred</span> <span class="cm-operator">=</span> <span class="cm-variable">model</span>.<span class="cm-property">predict</span>(<span class="cm-variable">X_test</span>)</pre><p class="">After the model has been trained, we can now see the accuracy for the second model we built:</p><pre class="source-code"><span class="cm-variable">print</span>(<span class="cm-variable">classification_report</span>(<span class="cm-variable">y_test</span>, <span class="cm-variable">y_pred</span>))

                <span class="cm-variable">precision</span>    <span class="cm-variable">recall</span>  <span class="cm-variable">f1</span><span class="cm-operator">-</span><span class="cm-variable">score</span>   <span class="cm-variable">support</span>

       <span class="cm-variable">False</span>       <span class="cm-number">0.96</span>      <span class="cm-number">0.98</span>      <span class="cm-number">0.97</span>      <span class="cm-number">1434</span>
        <span class="cm-variable">True</span>       <span class="cm-number">0.83</span>      <span class="cm-number">0.63</span>      <span class="cm-number">0.72</span>       <span class="cm-number">166</span>

    <span class="cm-variable">accuracy</span>                           <span class="cm-number">0.95</span>      <span class="cm-number">1600</span>
   <span class="cm-variable">macro</span> <span class="cm-variable">avg</span>       <span class="cm-number">0.89</span>      <span class="cm-number">0.81</span>      <span class="cm-number">0.84</span>      <span class="cm-number">1600</span>
<span class="cm-variable">weighted</span> <span class="cm-variable">avg</span>       <span class="cm-number">0.94</span>      <span class="cm-number">0.95</span>      <span class="cm-number">0.95</span>      <span class="cm-number">1600</span>
</pre><p class="">Based off this classification report above, the Support Vector Classifier performs significantly better than the Tree Based methods above—especially in the area of predicting the leads that will eventually close. There is nearly a 30% increase in the prediction accuracy of True while maintaining a high level of accuracy on False. A confusion matrix helps further contextualize accuracy:</p><pre class="source-code"><span class="cm-variable">plot_confusion_matrix</span>(<span class="cm-variable">model</span>,<span class="cm-variable">X_test</span>,<span class="cm-variable">y_test</span>)</pre>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612891705897-VWQQ5JOX6RYGS1GQPCQO/conf_matrix2.JPG" data-image-dimensions="813x385" data-image-focal-point="0.5,0.5" alt="conf_matrix2.JPG" data-load="false" data-image-id="6022c6398b52ed435c24f3c8" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612891705897-VWQQ5JOX6RYGS1GQPCQO/conf_matrix2.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<h3>Conclusion:</h3><ul data-rte-list="default"><li><p class="">Initial modeling seems to indicate that the Support Vector Classifiers are better predictors than the Tree Based methods</p></li><li><p class="">Accuracy, especially for this use case, needs to be balanced across the True and False—especially if the data will continue to be imbalanced in the future</p></li><li><p class="">There are opportunities to leverage other features that were part of the original dataset, but had poor data quality. Assuming the data quality could be improved, there would be increased opportunities to improve business outcomes</p></li></ul><h3>Opportunities &amp; Future work:</h3><ul data-rte-list="default"><li><p class="">The imbalanced nature of the dataset should be addressed through oversampling, weighting, and attempting to use gradient boosted algorithms</p></li><li><p class="">Prediction of the time to close will likely be another worthwhile venture, especially when attempting to predict future sales, prioritizing lines of business, and even resource planning</p></li><li><p class="">Thorough discussion would need to occur between those that would use something like this in process and those that have designed the algorithm itself—industrialization would need to be done with care and monitored closely over time</p></li></ul><p><a href="https://www.conaxon.org/projects/predict-which-sales-leads-close">Permalink</a><p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1612629242937-PUJ0N3AJVWPPY4HOT5YK/image-asset.jpeg?format=1500w" medium="image" isDefault="true" width="1500" height="1000"><media:title type="plain">Predict which Sales Leads Close Part 1</media:title></media:content></item><item><title>Knowledge Management as a Keystone in your Data Science &amp; Analytics Strategy</title><category>Articles</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Sun, 17 Jan 2021 02:38:54 +0000</pubDate><link>https://www.conaxon.org/projects/knowledge-management-as-a-keystone-in-your-data-science-amp-analytics-strategy</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:5ffce91061fbb82b30117574</guid><description><![CDATA[Explore how knowledge management is essential to measuring success in your 
analytics initiatives this year]]></description><content:encoded><![CDATA[<h3>Knowledge Management in Analytics</h3><p class="">In 2021, I was asked to prepare a roadmap a company just beginning along their Data Science journey. Knowledge management would be one of the keys to a healthy data science and analytics strategy for 2021 and beyond. Knowledge management (as defined for the context of this article) is the process and tools necessary to capture, disseminate, and present information generated throughout the organization—whether that be lessons learned, best practices, locations of data, project management information, tickets, and a whole host of other artifacts.</p><p class="">Documentation, more broadly knowledge management, is not a sexy topic. But, it can’t be overstated the unbridled frustration that occurs when analysts, data scientists, and machine learning engineers spend hours looking for data in random databases and obscure tables. it’s just infuriating. Most Data Scientists might acknowledge the importance of developing a robust knowledge management solution but never talk specifically about how they might deploy such a project within their organisation. Typically, there are a few reasons why most companies get knowledge management wrong with regards to analytics:</p><ul data-rte-list="default"><li><p class="">Return on investment is not immediately apparent</p></li><li><p class="">It is easy to borrow against future resources and the consequences might be perceived to be low</p></li><li><p class="">Documentation is boring </p></li><li><p class="">Difficult to maintain over time</p></li></ul><p class="">The losses can be huge over time. Consider a single Data Scientist who makes $125,000 per year—roughly $67.00 USD per hour.  Without a robust knowledge management solution, your analytics organization could be spending dozens of hours per week looking for data strewn about the business, struggling to generate queries, searching for data dictionaries, trying to figure out the transformations applied to a dataset, etc. It is easy to see that tens of thousands of dollars can be lost per year by simply not having tools to efficiently to their work.</p><p class="">Gaps (or complete lack thereof) in documentation is an exponential problem, by the time you notice there is an issue, the gradient has exploded! Therefore, companies who start off on the right foot (from the start) will have an easier time maintaining and reaping the benefits of knowledge management. All that said, starting is the next best step. </p><h3>Where do you start?</h3><p class="">It is probably easiest, and most useful, to begin with an entity reference (ER) diagram. These artifacts are likely the most important pieces of documentation that can be made available to data professionals. ER diagrams can come in all sorts of shapes, sizes, and complexities. However, the nature of these artifacts remains the same: these are the dictionaries that can help determine the location of data and how that data relates to other data objects/sources. There are tons of templates out there to model your work off of, but I have liked to use draw.io. The software is free, no license required, and simple to use—much like Visio or other flowcharting software. </p><p class="">I like to treat ER Diagrams like catalogues. They should be structured in such a way that allows the user to pose a question or search with a subject matter in mind. For example, a data professional in Marketing might want to look for data related to ‘marketing’. Therefore, maybe your ‘data catalogue’ starts simply like this:</p>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610765024814-YKUNM26H4KXELY413975/blog_ER_Inner_layer.JPG" data-image-dimensions="1174x659" data-image-focal-point="0.5,0.5" alt="blog_ER_Inner_layer.JPG" data-load="false" data-image-id="600252e034e5f821271e990f" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610765024814-YKUNM26H4KXELY413975/blog_ER_Inner_layer.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">Each will have different structures, names, themes, etc but, overall, for ease of use I find this the best way to help someone find data related to a concept. One might even draw a corollary to a graph structure. Next, move to a ‘source’ or maybe even a ‘location’ of data.</p>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610766062712-KW9JRNDAPLVU5NED345J/blog_ER_2nd_layer.JPG" data-image-dimensions="530x438" data-image-focal-point="0.5,0.5" alt="blog_ER_2nd_layer.JPG" data-load="false" data-image-id="600256ee652b4632bc8fd016" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610766062712-KW9JRNDAPLVU5NED345J/blog_ER_2nd_layer.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">The second layer becomes extremely important because it is a catalyst for understanding exactly where data of this type is stored and accessed. It is important to be somewhat verbose in this layer of the graph. It should be quite clear the location of the data in question. Finally, but not necessarily so, I like to expand the graphs to the tables where the data exists.</p>








  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
              
            "
        >
          
        
        

        
          
            
          
            <img class="thumb-image" data-image="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610766531665-GHKOXYW1DAPZ79SB92LA/blog_ER_3rd_layer.JPG" data-image-dimensions="1065x699" data-image-focal-point="0.5,0.5" alt="blog_ER_3rd_layer.JPG" data-load="false" data-image-id="600258c3655a78118b8bfaea" data-type="image" src="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610766531665-GHKOXYW1DAPZ79SB92LA/blog_ER_3rd_layer.JPG?format=1000w" />
          
        
          
        

        
      
        </figure>
      

    
  


  


<p class="">The outer parts of the graph is where the complexity can become cumbersome (but it is worth it). This is the basic structure I tend to follow when working on a project such as an ER Diagram. The version I like to use is not textbook quality, but it is a framework I have adapted over time serving in many different roles and companies.</p><h3>A Data Dictionary is Nice: But now what?</h3><p class="">After creating your data dictionary tool, some might be done! There will be situations out there where is no need to continue. However, in some instances the journey might continue on to other knowledge management tools.:</p><ul data-rte-list="default"><li><p class="">Develop a knowledge center in a tool like Sharepoint or Confluence: find a way to consolidate all of the artifacts related to analytics on a single platform (think, ‘one-stop shopping’)</p></li><li><p class="">Start a series of training or podcasts that encourage data literacy throughout your company. Generate and disseminate knowledge in ways that enable you, as a data professional, to be more effective</p></li><li><p class="">Find time to communicate to the rest of your organization current projects, status, and requests for feedback </p></li></ul><h3>When does it all End?</h3><p class="">Well….the job isn’t ever done! But, there is a point where maintenance is not as terrible. Largely, the end game will be determined according to each unique situation. Take steps to spread the work out amongst a few different team members, if possible. Another idea might be to set a rotation where a day or two per month is devoted solely to collecting knowledge, documenting that knowledge, and writing a brief summary that let’s other teams know there have been updates.</p><h3>What is the value in the end?</h3><p class="">The easy part about writing this article is that there is little to be debated. Organizations that collect, store, disseminate, and maintain their knowledge can have a competitive edge in creating a sustainable business. Because of the growth of data science, business intelligence, and analytics within modern companies, it only makes more sense to better organize information generated by a burgeoning profit center within contemporary institutions.</p><p class="">Specifically, in the context of the analytics operations there are some key value propositions:</p><ul data-rte-list="default"><li><p class="">Efficiency in development of queries, models, data models, algorithms, can help reduce go-to-market time—thereby potentially increasing return on your investment</p></li><li><p class="">Your analysts, scientists, and engineers will be less likely to be frustrated with finding important data</p></li><li><p class="">Better baseline future projects/initiatives by having a quick reference on what went wrong and right on past developments</p></li></ul>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1610767964259-FNVDK3KVL6DL9W30IMZ2/image-asset.jpeg?format=1500w" medium="image" isDefault="true" width="1500" height="1125"><media:title type="plain">Knowledge Management as a Keystone in your Data Science &amp; Analytics Strategy</media:title></media:content></item><item><title>Decision Intelligence: Data is an enabler to better decision making</title><category>LinkedIn Livestream</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Thu, 31 Dec 2020 00:09:20 +0000</pubDate><link>https://www.linkedin.com/posts/charleselwood_strategy-business-decision-ugcPost-6749320773132976128-7tpF</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:5fecfbc818d44937a97de8e6</guid><description><![CDATA[Discussion surrounding Decision Intelligence in your organization and 
manage the fear that can come along with using data to drive decision 
making]]></description><content:encoded><![CDATA[<p class="">Charles Elwood (SolisMatica), Andrew Hoekstra (Pointe Vector), and I break down how to get data into the hands of those within the organization that make decisions everyday and why enabling decision intelligence will be important to unlocking sales, efficiencies, and innovation. The team also tackles dealing with the fear (yes, fear) that accompanies the use of data that might reflect a poor image of performance or an unpopular truth.</p><iframe allow="autoplay; fullscreen" scrolling="no" data-image-dimensions="854x480" allowfullscreen="true" src="//cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FCJKfi6TFcK4%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DCJKfi6TFcK4&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FCJKfi6TFcK4%2Fhqdefault.jpg&amp;key=61d05c9d54e8455ea7a9677c366be814&amp;type=text%2Fhtml&amp;schema=youtube&amp;wmode=opaque" width="854" data-embed="true" frameborder="0" title="YouTube embed" class="embedly-embed" height="480"></iframe><p><a href="https://www.conaxon.org/projects/decision-intelligence-data-is-an-enabler-to-better-decision-making">Permalink</a><p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1609371313564-XQD58KAWJRPFM5COWGU5/image-asset.jpeg?format=1500w" medium="image" isDefault="true" width="1500" height="1200"><media:title type="plain">Decision Intelligence: Data is an enabler to better decision making</media:title></media:content></item><item><title>Data Preparation &amp; Our Top Tips</title><category>LinkedIn Livestream</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Wed, 09 Dec 2020 13:19:40 +0000</pubDate><link>https://www.linkedin.com/video/live/urn:li:ugcPost:6737379918000353280/</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:5fd0c9795c296e7a687d32c4</guid><description><![CDATA[Tyler, Charles, and Andrew talk about their go-to strategies for how best 
to wrangle large, unruly data sets to best set your data science project up 
for success.]]></description><content:encoded><![CDATA[<p class="">Conaxon was invited back to speak with SolisMatica and Pointe Vector to discuss our top tips when it comes to preparing your data for a visualization or machine learning project. This was an exciting and (at times) light-hearted approach to diving into the most important step in working with data. </p><iframe scrolling="no" data-image-dimensions="854x480" allowfullscreen="" src="//www.youtube.com/embed/OcVhHtrFFcg?feature=youtu.be&amp;wmode=opaque&amp;enablejsapi=1" width="854" data-embed="true" frameborder="0" height="480">
</iframe><p><a href="https://www.conaxon.org/projects/data-preparation-amp-our-top-tips">Permalink</a><p>]]></content:encoded><media:content type="image/jpeg" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1607519707579-2M0TS8TG8F30SOU0GA61/data_prep.JPG?format=1500w" medium="image" isDefault="true" width="700" height="745"><media:title type="plain">Data Preparation &amp; Our Top Tips</media:title></media:content></item><item><title>So you want a career in Data Analytics--Here's how we think you can do it FAST!</title><category>LinkedIn Livestream</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Fri, 20 Nov 2020 18:47:54 +0000</pubDate><link>https://www.linkedin.com/feed/update/urn:li:activity:6724673711355568129/</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:5fb80d0877c80e175411921a</guid><description><![CDATA[Charles Elwood, SolisMatica, has me back again with the usual crew to talk 
careers in analytics and how we think you should approach achieving your 
goals.]]></description><content:encoded><![CDATA[<p class="">In this livestream Pointe Vector, SolisMatica, and Conaxon team up to talk about what it takes to get a job in the realm of data analytics. Largely, our thoughts could apply to any  career you want to get into, but we tend to focus on key strategies that have worked for us specifically in data analytics. </p><p data-rte-preserve-empty="true" class=""></p><iframe scrolling="no" data-image-dimensions="854x480" allowfullscreen="" src="//www.youtube.com/embed/NHjdpADnDNI?wmode=opaque&amp;enablejsapi=1" width="854" data-embed="true" frameborder="0" height="480">
</iframe><p><a href="https://www.conaxon.org/projects/so-you-want-a-career-in-data-analytics-heres-how-we-think-you-can-do-it-fast">Permalink</a><p>]]></content:encoded><media:content type="image/png" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1605897744296-BTIAEHD6WQX1T9LAPLYX/blog_post_2.PNG?format=1500w" medium="image" isDefault="true" width="835" height="468"><media:title type="plain">So you want a career in Data Analytics--Here's how we think you can do it FAST!</media:title></media:content></item><item><title>Data Analytics in the Automotive Industry</title><category>LinkedIn Livestream</category><dc:creator>Tyler Betthauser</dc:creator><pubDate>Fri, 20 Nov 2020 18:21:00 +0000</pubDate><link>https://www.linkedin.com/feed/update/urn:li:activity:6689139009626640384/</link><guid isPermaLink="false">5d9c9bf956b0ea2534905eff:5d9c9bf956b0ea2534905f4f:5fb804cd17feb45b1850a194</guid><description><![CDATA[Charles Elwood, from SolisMatica, guides a group of impassioned auto 
industry data junkies through a discussion about where data analytics is 
going for the auto industry and where it has been.]]></description><content:encoded><![CDATA[<p class="">Much of my early career has been spent in the automotive industry. Thanks to Charles Elwood from SolisMatica, I have been able to share my journey </p><p class="">There is so much data out there that can be used to develop key insights into customers, product use, quality, and so much more. I spend some time with some industry experts from SolisMatica, Pointe Vector, and We Predict Inc talking about the use of data analytics within the automotive industry—past, present, and future. Check out the link to my page and listen in on the action!</p><iframe scrolling="no" data-image-dimensions="854x480" allowfullscreen="" src="//www.youtube.com/embed/XawAjsnQqWE?wmode=opaque&amp;enablejsapi=1" width="854" data-embed="true" frameborder="0" height="480">
</iframe><p><a href="https://www.conaxon.org/projects/data-analytics-in-the-automotive-industry">Permalink</a><p>]]></content:encoded><media:content type="image/png" url="https://images.squarespace-cdn.com/content/v1/5d9c9bf956b0ea2534905eff/1605896750805-VZOHA1GWQOWQ6ZT606RL/blog_post_1.PNG?format=1500w" medium="image" isDefault="true" width="834" height="469"><media:title type="plain">Data Analytics in the Automotive Industry</media:title></media:content></item></channel></rss>