Thoughts on Analytics - Conaxon

How to: Parse Android Logs for Analytics and Machine Learning Applications

Tyler Betthauser — Sun, 29 Aug 2021 00:53:22 +0000

Introduction: What are Logs?

Building Android based apps, or any software for that matter, will eventually end up in understanding why a bug is occurring. Bugs are just a natural part of software development. A key tool in understanding the state of your software at the time an issue happens are logs. Think of logs as a ledger for what is happening when the code is running. Engineers can print almost anything to the logs that might help them understand problems that pop up in the future.

Given that logs are often structured, contain a ton of useful data, easy to acquire, and key to development software logs are ripe for sophisticated analysis and maybe even applying machine learning to them. There are lots of tools for log analytics like: Scalyr, Logz.io, Sematext, GrayLog, Nagios, and many others (https://opensource.com/article/19/4/log-analysis-tools). In many cases, utilizing an open-source, pre-built, will work in a pinch and be pretty reliable when a mission critical bug plagues the backlog. However, it might be useful to have a way of creating your own customized solution.

Android LogCat Logs:

The structure of the Android Logs are as follows:

The main files that can be analyzed are the radio, main, event, and system logs. Each log file contains different characteristics about the system at any given time.

Each message in the log consists of the following elements:

A tag indicating the part of the system or application that the message came from
A timestamp (at what time this message came)
The message log level (or priority of the event represented by the message)
The log message itself( detail description of error or exception or information)

There are a few different log types:

Application log -

Utilize the android.util.Log class methods to write messages of different priority to the log file
Java classes declare their tag statically as a string and can be many layers deep

System log -

Utilize the android.util.Slog class
Many frameworks use the system logs to separate certain messages from a potentially messy application log

Event log -

Event logs messages are created using android.util.EventLog class
Log entries consist of binary tags and they are followed by binary parameters
The message tag codes are stored on the system at: /system/etc/event-log-tags

Radio log

Used for radio and phone(modem) related information
Log entries consist of binary tags code and message for Network info

Android Log Structure:

tv_sec tv_nsec priority pid tid tag messageLen Message

tag: log tag
tv_sec & tv_nsec: the timestamp of the log messages
- In the logs we are going to parse the date and timestamp (down to the milliseconds)
pid: process Id
tid: thread id
Priority value is one of the following character values:
- V: Verbose (lowest priority)*
- D: Debug*
- I: Info*
- W: Warning*
- E: Error*
- F: Fatal*
- S: Silent (highest priority, on which nothing is ever printed)

Code for Parsing:

The parsing of the files is fairly straightforward—especially because the text files are delimited by simple whitespace.

import pandas as pd
import numpy as np
import seaborn as sns
import re
import os, zipfile
import gzip
import shutil
import datetime
import matplotlib.pyplot as plt

After the import of key libraries, then you will check the working directory and assign it as a variable. This will all be done to allow for the script to be placed in the directory of the log files:

# define the current working directory as a variable for extracting all the log files that ar
cwd = os.getcwd()
# define the search path for the rest of the script to reference
search_path = os.getcwd()
#print
print(cwd)

The cwd should be within the folder where the log files are located. We’ll define a function to be used later that will programmatically level out the arrays. Then, we get to work decompressing the log files so everything ends up as a text file:

# Function to make the array lengths the same later
def pad_dict_list(dict_list, padel):
    lmax = 0
    for lname in dict_list.keys():
        lmax = max(lmax, len(dict_list[lname]))
    for lname in dict_list.keys():
        ll = len(dict_list[lname])
        if  ll < lmax:
            dict_list[lname] += [padel] * (lmax - ll)
    return dict_list

file_type = ".gz"
for fname in os.listdir(path=search_path):
    if fname.endswith(file_type):
        with gzip.open(fname,'rb') as f_in:
            with open(fname+'.log','wb') as f_out:
                shutil.copyfileobj(f_in,f_out)

Next lines do the following:

need to get a list of all the main.log files into a list
need to loop through the list
read / parse each file
append each parsed line to the appropriate empty list
strip out some of the files from the list of files we are going to loop over and read

mainLogs = []        
keyword = 'main'
for fname in os.listdir(cwd):
    if keyword in fname:
        mainLogs.append(fname)  
        
mainLogs = [item for item in mainLogs if not item.endswith('.gz')]
    
date = []
time = []
processID = []
threadID = []
priority = []
app = []
tagsText = []
readLine = []

for main in mainLogs:
    with open(main,encoding='utf8',errors='surrogateescape',newline='\n') as logs:
        try:
            for line in logs:
                lines = line.split()
                #for debugging
                readLine.append(lines)
                date.append(lines[0])
                time.append(lines[1])
                processID.append(lines[2])
                threadID.append(lines[3])
                priority.append(lines[4])
                app.append(lines[5])
                tagsText.append(lines[6:])
        except IndexError:
             pass

After we have written our parsed files to the lists we need to combine the messages and tags together since we split by whitespace. This next little piece of code will recombine tags and texts to a human readable string:

tagsTextComb = []
for innerlist in tagsText:
    tagsTextComb.append(' '.join(innerlist)+" ")

Next lines of code will assess the length of each list. In order for a dictionary of lists to be transformed into a pandas dataframe, each of the lists must be the same length.

print("length of Date"+' '+str(len(date)))
print("length of Time"+' '+str(len(time)))
print("length of processID"+' '+str(len(processID)))
print("length of threadID"+' '+str(len(threadID)))
print("length of priority"+' '+str(len(priority)))
print("length of app"+' '+str(len(app)))
print("length of tagsText"+' '+str(len(tagsText)))
print("length of tagsTextComb"+' '+str(len(tagsTextComb)))

length of Date 3829775
length of Time 3829775
length of processID 3829775
length of threadID 3829775
length of priority 3829775
length of app 3829770
length of tagsText 3829770
length of tagsTextComb 3829770

The following code finalizes the processing of the main log:

Combine the lists into a dictionary
Call the function that pads the lists and evens them out
Create the dataframe for the main log

mainDict = {'date': date, 'time': time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb}

pad_dict_list(mainDict,'x')

dfMain = pd.DataFrame(mainDict)

For the remainder of this post, we will process the remainder of the log files, combine them together, and cleaned for a bit of analysis:

crashLogs = []        
keyword = 'crash'
for fname in os.listdir(cwd):
    if keyword in fname:
        crashLogs.append(fname)
        
crashLogs = [item for item in crashLogs if item.endswith('.log')]

crashDate = []
crashTime = []
crashProcessID = []
crashThreadID = []
crashPriority = []
crashApp = []
crashTagsText = []
crashReadLine = []

for crash in crashLogs:
    with open(crash,encoding='utf8',errors='surrogateescape',newline='\n') as logs:
        next(logs)
        try:
            for line in logs:
                lines = line.split()
                #for debugging
                crashReadLine.append(lines)
                crashDate.append(lines[0])
                crashTime.append(lines[1])
                crashProcessID.append(lines[2])
                crashThreadID.append(lines[3])
                crashPriority.append(lines[4])
                crashApp.append(lines[5])
                crashTagsText.append(lines[6:])
        except IndexError:
             pass

crashTagsTextComb = []
for innerlist in crashTagsText:
    crashTagsTextComb.append(' '.join(innerlist)+" ")

crashDict = {'date':crashDate,'time':crashTime,'processID':crashProcessID,'threadID':crashThreadID,'priority':crashPriority,'app':crashApp,'tagsText':crashTagsTextComb}
pad_dict_list(crashDict,'x')
dfCrash = pd.DataFrame(crashDict)

eventsLogs = []
keyword = 'event'
for fname in os.listdir(cwd):
    if keyword in fname:
        eventsLogs.append(fname)

eventsLogs = [item for item in eventsLogs if not item.endswith('.gz')]

date = []
time = []
processID = []
threadID = []
priority = []
app = []
tagsText = []
readLine = []

for event in eventsLogs:
    with open(event,encoding='utf8',errors='surrogateescape',newline='\n') as logs:
        next(logs)
        try:
            for line in logs:
                lines = line.split()
                #for debugging
                readLine.append(lines)
                date.append(lines[0])
                time.append(lines[1])
                processID.append(lines[2])
                threadID.append(lines[3])
                priority.append(lines[4])
                app.append(lines[5])
                tagsText.append(lines[6:])
        except IndexError:
             pass
             
tagsTextComb = []
for innerlist in tagsText:
    tagsTextComb.append(' '.join(innerlist)+" ")

eventsDict = {'date':date,'time':time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb}
pad_dict_list(eventsDict,'x')
dfEvents = pd.DataFrame(eventsDict)

sysLogs = []
keyword = 'system'
for fname in os.listdir(cwd):
    if keyword in fname:
        sysLogs.append(fname)

sysLogs = [item for item in sysLogs if not item.endswith('.gz')]       

date = []
time = []
processID = []
threadID = []
priority = []
app = []
tagsText = []
readLine = []

for sys in sysLogs:
    with open(sys,encoding='utf8',errors='surrogateescape',newline='\n') as logs:
        try:
            for line in logs:
                lines = line.split()
                #for debugging
                readLine.append(lines)
                date.append(lines[0])
                time.append(lines[1])
                processID.append(lines[2])
                threadID.append(lines[3])
                priority.append(lines[4])
                app.append(lines[5])
                tagsText.append(lines[6:])
        except IndexError:
             pass
             
tagsTextComb = []
for innerlist in tagsText:
    tagsTextComb.append(' '.join(innerlist)+" ")

sysDicts = {'date':date,'time':time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb}
pad_dict_list(sysDicts,'x')
dfSys = pd.DataFrame(sysDicts)

radioLogs = []
keyword = 'radio'
for fname in os.listdir(cwd):
    if keyword in fname:
        radioLogs.append(fname)

radioLogs = [item for item in sysLogs if not item.endswith('.gz')]       

date = []
time = []
processID = []
threadID = []
priority = []
app = []
tagsText = []
readLine = []

for radio in radioLogs:
    with open(radio,encoding='utf8',errors='surrogateescape',newline='\n') as logs:
        try:
            for line in logs:
                lines = line.split()
                #for debugging
                readLine.append(lines)
                date.append(lines[0])
                time.append(lines[1])
                processID.append(lines[2])
                threadID.append(lines[3])
                priority.append(lines[4])
                app.append(lines[5])
                tagsText.append(lines[6:])
        except IndexError:
             pass
             
tagsTextComb = []
for innerlist in tagsText:
    tagsTextComb.append(' '.join(innerlist)+" ")

radioDicts = {'date':date,'time':time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb}
pad_dict_list(radioDicts,'x')
dfRadio = pd.DataFrame(radioDicts)

frames = [dfRadio, dfSys, dfMain, dfCrash, dfEvents]
df = pd.concat(frames)

This code should help you get started! In a follow up piece, we’ll go over some basic analytics, cleaning, and applications.

Keep Analytics, Machine Learning, and Artificial Intelligence to Simple Use-Cases with Collective Vantage

Tyler Betthauser — Tue, 18 May 2021 00:27:49 +0000

Keep it Simple, Stupid (KISS):

We’ve talked to businesses and economic development organizations about digitization, feelings about data, typical challenges with implementation of analytics, and the future adoption of machine learning / artificial intelligence. Previously, these struggles were discussed:

Incentives to digitalize early (or at all) in a small or micro business are quite small--especially if there is little realized return on investment due to an inability to derive insights from the data collected.
Data is expensive
Data is hard to collect and synthesize correctly
There isn't enough data
Data is not timely or difficult to keep timely

We set up Collective Vantage to combat these challenges in an easy, compelling, and affordable technology that spreads the load across networks of businesses that are onboarded. This post is going to detail a couple hour project that demonstrates just how effective a simple machine learning model can be for a micro/small business looking to understand how their future sales might look.

The Dataset:

For this project, we are using a retail dataset from Kaggle: https://www.kaggle.com/manjeetsingh/retaildataset . Since retail organizations are some of the most plentiful of the small/micro businesses at 2.6 million as of 2020 it makes sense to work on something related to retail.

What’s in this data?

We are given historical sales data for 45 stores located in different regions. The company runs several promotional markdown events throughout the year in their stores. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. Store data and macro-economic dats is also provided as well.

What kind of Features are Included?

Contains additional data related to the store, department, and regional activity for the given dates.

Store Number
Date
Temperature
Fuel_Prices
MarkDown1-5 - % markdown amounts
CPI (Consumer Price Index)
Unemployment Rate
IsHoliday - True/False indicator of Holiday

What does the Sales Data Look Like?

Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:

Store Number
Department Number
Date
Sales Number

Yep, you can Collect this Data Yourself (and maybe already do)!

A theme when talking with small and micro business mentors is that most do not have reliable methods in place to collect operational and financial data. If you are reading this and head up a retail establishment, we hope to convey that tools like QuickBooks, Salesforce, Ecommerce Software, and many others can hold all of this very basic data. It is also quite easy to extract the data out of these systems for analysis. Even Excel can be a reliable when first starting out.

A key advantage to Collective Vantage is that the technology is designed around making data collection and aggregation easier across various tools used by businesses. It is often a daunting task to attempt this key step.

Onto Some Code!

Import the libraries we need

import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
import datetime
%matplotlib inline
from sklearn.ensemble import GradientBoostingRegressor,AdaBoostRegressor,RandomForestRegressor

Read in the data

store = pd.read_csv("store.csv")
feature = pd.read_csv("features.csv")
sales = pd.read_csv("sales.csv")

Get our bearings on what the structure of the data looks like for the store. We can see there is the store Id, Type, and the Size. Given that this is anonymized data, the values are somewhat non-sensical. However, the main concept is that these high-level data points can still be useful in prediction tasks. Plus, these can be very easy for a business to collect.

store.describe().transpose()

The features table holds good information about the store, markdowns, holidays, and macro-economic data. Again, all data that can easily be acquired, stored and used for analytics. The main thing to notice is that we have to clean up some of the blank fields.

feature.describe().transpose()

Finally, a cursory look at the sales data

Next step is to combine all the tables together to a single view

store_feat = store.merge(right = feature, on = 'Store')
df = store_feat.merge(right = sales, on = ['Store', 'Date', 'IsHoliday'])
df.sample(10)

As mentioned previously, the first thing to tackle is making sure there are no blank records. In this project, if the week did not have a markdown, then the record is blank instead of 0. Since there is likely information in the weeks with and without markdown we will simply fill the blanks with 0’s.

df.isna().sum()
df['MarkDown1'] = df['MarkDown1'].fillna(0)
df['MarkDown2'] = df['MarkDown2'].fillna(0)
df['MarkDown3'] = df['MarkDown3'].fillna(0)
df['MarkDown4'] = df['MarkDown4'].fillna(0)
df['MarkDown5'] = df['MarkDown5'].fillna(0)

Next some features will be created to derive more information from the dimensions that already exist:

def day_of_year(date_str):
    date = datetime.datetime.strptime(date_str, '%d/%m/%Y')
    return date.timetuple().tm_yday

def day(date_str):
    date = datetime.datetime.strptime(date_str, '%d/%m/%Y')
    return date.timetuple().tm_mon

def year(date_str):
    date = datetime.datetime.strptime(date_str, '%d/%m/%Y')
    return date.timetuple().tm_year

def woy(date_str):
    date = datetime.datetime.strptime(date_str, '%d/%m/%Y')
    return date.timetuple().tm_year

df['DayOfYear'] = df['Date'].map(day_of_year)
df['MonthOfYear'] = df['Date'].map(day)
df['Year'] = df['Date'].map(year)
df['DayOfYearCos'] = np.cos(df['DayOfYear'])
df['DayOfYearSin'] = np.sin(df['DayOfYear'])
df['Date'] = pd.to_datetime(df['Date'])
df["WeekofYear"] = df.Date.dt.week

df['IsHoliday'] = df['IsHoliday'].astype('category')
df['Dept'] = df['Dept'].astype('category')
df['Store'] = df['Store'].astype('category')

Some key things to note on the features created with the code above:

Deconstruct the date to get the year, day, month, and week of the year to make sure we increase the opportunity for our model to pick up on trends—hopefully increasing predictive power
Calculate the Sin and Cosine of the day of the year as these features help maximize our ability to fit the cyclical nature of the retail data
- https://medium.com/swlh/time-series-forecasting-with-a-twist-27350e97a2cb
- https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca
- https://towardsdatascience.com/taking-seasonality-into-consideration-for-time-series-analysis-4e1f4fbb768f
Convert the IsHoliday, Dept, and Store dimensions to categories so they may be encoded using pd.get_dummies. Department and Store are given an ordinal encoding in the dataset. Because these encodings represent unique categories we do not want the ordinal nature of the encodings to be picked up by the algorithm since these data really should be totally distinct. The Department dimension should be treated along the same lines

After the cleaning the data and creating some additional features, visualizing the data highlights some key things to keep in mind later. Firstly, a correlation plot is generated to assess how each of the variables are correlated with each other—not to be confused with causation.

df_temp = df.copy(deep=True)
df_temp['tot_MarkDown'] = df_temp['MarkDown1'] + df_temp['MarkDown2'] +df_temp['MarkDown3'] +df_temp['MarkDown4'] + df_temp['MarkDown5']
df_temp.drop(['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5','Year'], inplace = True, axis = 1)
fig, ax = plt.subplots(figsize=(20,12))
sns.heatmap(df_temp.corr(),annot=True)
# df_temp.head

There are not a ton of highly correlated features in this dataset. MonthOfYear and DayOfYear are going to be somewhat correlated.

The next chart looks at the timeseries data to establish an understanding of how the trends relate:

df[['Date', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 
    'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']].plot(x='Date', subplots=True, figsize=(20,15))
plt.show()

It is easy to notice that there really aren’t a ton of variables that trend along the sales cycle. A lack of observable trend is not necessarily a problem, but it certainly makes the model more abstract and potentially less interpretable. What is great about machine learning, the obscure connections between various inputs can be found and exploited to produce awesome insights—a level of inference that human intuition just can’t have without extensive time and effort.

Next, visualize the weekly sales numbers over time to get a closer look at the sales over time:

df_time = df.groupby('Date').sum()['Weekly_Sales'].reset_index()
fig, ax = plt.subplots(figsize=(20,12))
ax.plot('Date', 'Weekly_Sales', data=df_time)

A closer look reveals more of the cyclical nature of the sales over time. Dips in sales seems to occur just after the New Year, but peak around Christmas, Thanksgiving, Memorial Day, and 4th of July. So, pretty typical sales behavior from a retail establishment. Funnily enough, the trend also looks like a WAVE! T

Furthermore, the seasonality can be further visualized by looking at the sales split by month:

df_seas = df.groupby(df.Date.apply(lambda x: x.month)).sum()['Weekly_Sales'].reset_index()
plt.figure(figsize=(10, 5))
sns.barplot(x=df_seas.Date,y=df_seas.Weekly_Sales)

Interesting the peaks in April and December.. The dips in January and November are also interesting given that the largest peaks in the sales timeseries charts.

Finally, the sales by store type will be visualized:

df_store_type = df.groupby('Type').sum()['Weekly_Sales'].reset_index()
fig, ax = plt.subplots(figsize=(20,12))
ax.bar('Type', 'Weekly_Sales', data=df_store_type)

Interesting to note here that Store Type A has a significant advantage over Types B and C in terms of predictive capability because of the amount of sales is not distributed evenly. The model we will develop will include sales from all the store types. In the future, it might be useful to develop a model specific to each store type.

Data preparation for modeling is next:

model = df.set_index(['Date', 'Store', 'Dept']).sort_index()
model_data = model.reset_index()
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
model_data[['Temperature','Fuel_Price','MarkDown1','MarkDown2',
            'MarkDown3','MarkDown4','MarkDown5','CPI','Unemployment',
            'Size']] = mms.fit_transform(model_data[['Temperature','Fuel_Price','MarkDown1','MarkDown2',
                                                     'MarkDown3','MarkDown4','MarkDown5','CPI',
                                                     'Unemployment','Size']])

model_data = pd.get_dummies(model_data,drop_first=True)
final_model = model_data.set_index('Date')

The block of code does a few key things:

Sort the table by the Date, Store and Department
Scale the numeric features down to between 0 and 1. This step is necessary in order to bring everything to the same scale

Splitting the data to a training and prediction set occurs next—in addition to defining the features and what we are actually trying to predict. Training will be used to evaluate the basic model and prediction will be used to test how good our model actually is on unseen observations:

training_model = final_model[:'2012-01-01']
training_model.reset_index(inplace=True)
pred = final_model['2012-01-01':]
pred.reset_index(inplace=True)
X_model_train = training_model.drop(columns=['Weekly_Sales', 'Year', 'DayOfYear','Date'])
y_model_train = training_model['Weekly_Sales']
X_pred = pred.drop(columns=['Weekly_Sales', 'Year', 'DayOfYear','Date'])
y_pred = pred['Weekly_Sales']

For this project, we are going to split the data a bit differently. Both training and prediction will be split into their own train and test sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_model_train, y_model_train, test_size=0.10, random_state=0)

Splitting the data allows us to then build the basic model that will be the benchmark. Gradient Boosting Regressors are one of my go-to algorithms from Sklearn because of the resilience to overfitting, tunability, and flexibility.

from sklearn.ensemble import GradientBoostingRegressor

gbr_regressor = GradientBoostingRegressor()

gbr_regressor = gbr_regressor.fit(X_train, y_train)
gbr_regressor.score(X_test, y_test)

future_pred = gbr_regressor.predict(X_pred)
gbr_regressor.score(X_pred, y_pred)

Training score = 0.697 (69%)
Test score = 0.743 (74%)
The base model mean squared error (MSE) on test set: $166,493,765.3
The base model mean absolute error (MAE) on test set: $8201.4
The base model root mean squared error (RMSE) on test set: $12,903.2

When looking at the scores for the Gradient Boosted Regressor, it is important to note that the Score is actually an R^2 value. The R^2 value is, generally, a metric that provides insight into how well the model fits the data. There’s lots of room for interpretability. Domain experience will dictate whether or not the model performs the fit well enough. In this project, the score is not terrible but might be able to be optimized. An area of concern is the assessment of the mean squared error, mean absolute error, and root mean squared error. A very high mean squared error indicates that there are significant errors that exist and can be problematic in future predictions. A Mean Absolute Error of $8201.4 seems to be acceptable, but would need to be evaluated by the stakeholders in the retail shop. RMSE is a less interpretable metric for accuracy. However, RMSE does a nice job of balancing the extreme errors against the more ‘normal’ prediction errors. There is definitely some room to improve.

Sklearn offers GridSearchCV and RandomizedSearchCV in order to test lots of parameters. GridSearchCV is very slow. RandomizedSearchCV is much faster and returns results faster:

from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV

parameters = {'learning_rate':[0.05, 0.1, 0.5, 1], 
              'min_samples_split':[2,5,10], 
              'max_depth':[2,3,5],
             'n_estimators':[100,150,250]}

gbr_regressor = GradientBoostingRegressor()
cv_test= KFold(n_splits=5)
clf = RandomizedSearchCV(gbr_regressor, parameters,cv=cv_test,n_jobs=-1)
clf.fit(X_train, y_train)

{'n_estimators': 250, 'min_samples_split': 10, 'max_depth': 5, 'learning_rate': 0.5}

After about 30 min, the model returns the best parameters. The model can be re-run with the new parameters to get an accuracy score:

gbr_regressor_tuned = GradientBoostingRegressor(n_estimators = 250, 
                                                min_samples_split = 10, 
                                                max_depth = 5, 
                                                learning_rate = 0.5)

X_train_pred, X_test_pred, y_train_pred, y_test_pred = train_test_split(X_pred, y_pred, test_size=0.10, random_state=0)

gbr_regressor_tuned = gbr_regressor_tuned.fit(X_train_pred, y_train_pred)
gbr_regressor_tuned.score(X_test_pred, y_test_pred)

future_pred = gbr_regressor_tuned.predict(X_test_pred)
gbr_regressor_tuned.score(X_test_pred, y_test_pred)

Tuned Prediction Set Score 96.7%
The tuned model mean squared error (MSE) on prediction set: $16,116,078.3
The tuned model mean absolute error (MAE) on prediction set: $2,379.7
The tuned model root mean squared error (RMSE) on prediction set: $4,011.6

Note here that our R^2 jumped by over 20%. Depending on the use-case that type of increase can indicate over-fitting of the model. However, MSE, MAE, and RMSE all declined precipitously. The hyper-parameter tuning seems to have helped significantly with the predictive power of the model.

Cool, but why does this matter?

The ability, at least generally, to predict sales from week to week is hugely important in a tighter margin retail space—where the stakes are high:

Improve the ability to staff more efficiently
Purchase material in a more efficient way
Reduce the level of risk when planning financial and capital investments

Sorry, I just don’t have enough data to do this kind of Modeling!

For many businesses, there just won’t be enough data to build a model. Unfortunately, a vast majority of small/micro businesses simply will not have enough data. So what are these businesses supposed to do? What if you have a new business trying to forecast potential sales within a business community? Collective Vantage wants to eliminate the need to worry about these hurdles. Even if you only have a small amount of data, there are probably similar businesses to yours that probably already has enough to build a forecast.

Even if I had data to Forecast, I don’t know how to code or Productionize anything

Collective Vantage condenses the technical know-how to a manageable level. Conaxon brings the tech and the users bring the domain experience. It’s as simple as asking a question like: “What might my sales look like next week, month, year?”

Be Wary of Second Hand Data: Here's How Collective Vantage Addresses that Fear

Tyler Betthauser — Tue, 11 May 2021 17:38:57 +0000

I have huge respect for Cassie Kozyrkov, Chief Decision Scientist at Google. Her succinct, but content rich articles are quite helpful to conceptualize complex topics. Her most recent post on LinkedIn can be found here: https://bit.ly/3vSD2y0

This most recent post certainly did not disappoint! You should definitely check out the post.

All summed up, the video details the dangers of relying too heavily on data collected, aggregated, and transformed by someone other than the analyst or data scientist doing the analysis or development of an algorithm. This tip is vastly underrated. There have been so many instances where I have been simply given data and assumed the content was the ‘gold source’ level transactions that can be trusted. Sometimes there would be options to collect my own data and other times that option was not available. Each data scientist and analyst will need to conduct a cost benefit analysis to identify the best course of action that gets the most accurate result.

Cassie’s’ post got me thinking about how Conaxon & Collective Vantage addresses this big assumption in our technology and culture we are building:

Collective Vantage is being built because of the ambiguity around where data is coming from and how that data is collected. How can you trust conclusions from analysis built on data that can’t be trusted? One of our key challenges is creating technologies that standardize data collection across a wide range of businesses that have different levels of digitalization—reducing the chances of poor quality data making its way into users and customers hands
Collective Vantage doesn’t just standardize data collection, but makes data collection easy through custom integrations with popular platforms. “Set it and forget it'“ as it were!
Trust through Expertise. Conaxon has experts devoted to understanding how data is collected and generated by users and acts as a digitalization partner. The closer Conaxon can work with its’ data contributors, the better we can maintain the highest level of data quality possible

Check out Collective Vantage (a product of Conaxon) and sign up to be one of a few companies to pilot the project: https://collective-vantage.crd.co

Key Struggles with Analytics, Decision Intelligence, and Machine Learning Micro-Business Communities

Tyler Betthauser — Fri, 07 May 2021 03:33:52 +0000

A Big Market has a ton of Untapped Potential:

Conaxon has spent a lot of time interviewing small business owners, representatives from economic development organizations, chamber of commerce’s, and small-business development centers. Out of these conversations, a burgeoning passion for finding a way to make machine learning (ml), artificial intelligence (ai), analytics, and decision intelligence (DI) more plausible for the micro-businesses out there. By focusing on how to apply these technologies, Conaxon can help take the fear and risk out of key decision making for the small companies that could use the help.

Our mission is to optimize decision making processes for small and micro businesses using a platform that aggregates shared data from contributors and recommends data to users that might help answer key business questions.

Nearly 99% of businesses in America are considered small—topping out at about 30 million. A further subset of the small businesses are called micro-businesses and they make up nearly 75% of the small businesses. We think a vast majority of those micro businesses are underserved due to a focus on firms that can deliver huge contracts.

What we’ve Found in our Research so far:

Here are the key struggles and how our (pre-launch) product Collective Vantage attempts to address some of those:

Cost

For most micro-businesses, budgets are tight! Return on investment is of huge importance when making purchases—especially when it comes to technology. Given how hard conveying the value of analytics is in terms of financial return for a large organization, the uphill climb is even steeper for a small business. The constituent components for a production-level analytics, ai, and ml application are still quite expensive to build, maintain and grow. A data engineer in most areas commands northward of $70,000. Many Data Scientists make northward of $80,000 to $90,000. Business intelligence professionals and analysts can make between $65,000 and $80,000. If a consultant is being considered, they can charge several hundred per hour. Marketing databases that aid in customer segmentation can also cost many thousands of dollars—and are often out of date quickly. It doesn’t take long to realize your typical small business will struggle to afford tools, human capital, and data.

All these costs in mind, Conaxon realized if the decision making process could be optimized, then a lot of good can be done for a many businesses. Essentially, automating the process of research, aggregating data, cleaning, enrichment, and basic presentation would save considerable time and create economies of scale required to quash the cost. Conaxon believes Collective Vantage has the technology to achieve these efficiencies and can deliver an extremely competitive product relative to other players in the space.

There Just Isn’t Enough Data

In all of our interviews, it was very obvious that small & micro businesses struggle to collect and store enough structured data to derive insights. Most companies, logically, are focused on the operations and delighting customers. Data isn’t really at the top of most small businesses to-do list.

This feedback prompted a thought! Why not aggregate data across a network of similar businesses to create a large dataset that can be anonymized, cleaned, enriched and more useful? This type of model provides some unique advantages:

Users / contributors get a truer sense of the larger context of a market, customer segment, or any other subject matter that gets aggregated. The whole is greater than the sum of its’ constituent parts
Since the goal is to aggregate completely across a network, a user / contributor can ‘look across’ to other products and services being offered to find opportunities for innovation within their own local markets. Diversity in analytics is a key advantage we want to provide to contributors and users
Data quality is more likely to be less of a concern because there is an incentive to not poison the well everyone is drinking from in terms of what gets contributed
Contributors do not need a ton of data to get started. In fact, we want to encourage large networks of businesses to start small continue to grow their capabilities while still being able to adopt analytics, ai, ml, and DI in the short-term. Costs can remain low with this approach as well.

Data can take Awhile to Collect

Interviews with our target customers seem to indicate that, in most cases, there would be a pretty long lead time to be able to have a sizeable enough dataset for analytics. If a retail establishment wanted to do any sort of sophisticated forecasting, it would take years to collect sufficient number of observations. Consider a small antique shop that wants to do customer segmentation. Unless an antique shop brings in hundreds of customers in a short amount of time, the database will become outdated quickly and it will be impossible to detect drift.

Collective Vantage addresses the timeliness issue by taking small sets from many sources and continually supporting contributors in expanding their collection capabilities.

Complexity

We have been learning that many small and micro businesses simply can’t yet tackle these complex topics on their own. Many micro businesses are an owner-operator with a few employees that do not specialize in analytics, let alone ai or ml.

Our solution is focused on being able to quickly tackle the decision making process by being the analyst for the customer—but much faster and more cheaply than hiring someone or performing the work themselves. The simplicity comes from the customer/user inputting a question they want an answer to so to make a decision. Once the question has been asked, Collective Vantage handles the rest. The platform returns a set of recommended data and analysis that can help answer the query in an informed way.

When cash is tight, ROI is king

Many businesses we talked to expressed how tough it is to just pay the bills. Insufficient cashflow is the third leading reason entrepreneurs close their doors. For those contributing data to Collective Vantage, we want to show that data truly is an asset, each entity owns their data, and each entity can derive monetary value from their data assets. To address the concern over ROI, Collective Vantage will provide the opportunity to earn crypto or cash when a contributors data is sold (with permission of course). Because we believe that each contributor owns their data and is contributing to a valuable dataset that outside entities that will want access, it makes sense to redistribute those revenues back to contributors.

Join the Team

Conaxon is looking to find 100 businesses to partner with on our launch. Please visit: https://collective-vantage.crd.co to sign up and learn more about Collective Vantage.

Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 2)

Tyler Betthauser — Wed, 17 Mar 2021 00:27:23 +0000

Last time, we looked at the process for setting up a project in Google Cloud, enabling the API, and utilizing the API to get data that we can analyze. In this article, we are going to do a bit of data cleaning, analysis, NLP/unsupervised sentiment analysis, and data visualization in PowerBI.

Where we left off….

Last time, we created a Pandas DataFrame that houses the commentary from a YouTube Channel:

data_threads={'comment':comments_pop,'comment_id':comment_id_pop,'reply_count':reply_count_pop,'like_count':like_count_pop,'channel_id':channel_id_pop,'video_id':video_id_pop}
threads=pd.DataFrame(data_threads)
threads.head()

After creating this table, I removed the duplicates in the even that we actually DO have any duplicates. This is more of a best practice than an actual need. I did have an issue in previous versions of the script where there had been duplicate comments generated. This will also ensure that when we get to calculating metrics and counting there will be no risk of artificially inflating values.

threads.drop_duplicates(inplace=True)

Next, we merge the high-level statistics with the comments:

result = pd.merge(threads, df, how="inner", on=["video_id"])

Cleaning the Comment Text:

Before applying any sort of sentiment analysis, or analysis in general, we absolutely should clean the comments. We would not be able to scale an analysis on hundreds of thousands of comments without some sort of cleaning. Let us start with removing tags, for example:

def remove_tags(string):
    result = re.sub('<.*?>','',string)
    return result

result['comment']=result['comment'].apply(lambda cw : remove_tags(cw))

There are lots of things we can do with emojis and emoticons. They convey sentiment via a pictogram. Unfortunately, a lexicon cannot interpret a picture. The emojis will need to be converted to a phrase. We can do that in this way:

from emot.emo_unicode import UNICODE_EMO, EMOTICONS

def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
        return text
  
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
        return text

result['comment'] = result['comment'].apply(convert_emoticons)
result['comment'] = result['comment'].apply(convert_emojis)

URLs cannot be interpreted as much of anything. We will remove them:

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

result['comment'] = result['comment'].apply(remove_urls)

HTML is also another piece that should be cleaned up:

from bs4 import BeautifulSoup

def html(text):
    return BeautifulSoup(text, "lxml").text

result['comment'] = result['comment'].apply(html)

The next few lines were generated for future studies in Natural Language Processing (NLP), but not necessarily used here. However, they are useful functions to reference back to later if you happen to be on this journey as well:

Remove Punctuation
Tokenize
Remove Stop Words
Lemmatize
Generate the number of words in a comment
Generate the number of sentences in a comment

import string
string.punctuation
def remove_punctuation(text):
    no_punct=[words for words in text if words not in string.punctuation]
    words_wo_punct=''.join(no_punct)
    return words_wo_punct
result['comment_no_punc']=result['comment'].apply(lambda x: remove_punctuation(x))

def tokenize(text):
    split=re.split("\W+",text) 
    return split
result['comment_no_punc_tokens']=result['comment_no_punc'].apply(lambda x: tokenize(x.lower()))
result.head(1)

#Importing stopwords from nltk library
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
# Function to remove the stopwords
def stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

result['title_wo_punct_split_wo_stopwords'] = result['comment_no_punc_tokens'].apply(stopwords)

import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} 

def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

result['title_wo_punct_split_wo_stopwords_lemma'] = result['title_wo_punct_split_wo_stopwords'].apply(lemmatize_words)

result['num_words'] = result['comment'].apply(lambda x: len(x.split()))
result['num_sentences'] = result['comment'].apply(lambda x: len(re.split( '~ ...' ,'~'.join(x.split('.')))))

Implement NLTK VADER Lexicon:

One thing to note here is that the optimal solution for sentiment analysis is actually LABELING the data manually BEFORE attempting sentiment analysis. By doing this, you ensure that the labeling is unique to your own use case—which will be explored a bit later. The code to run the comments against the NLTK can be found below:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
nltk.downloader.download('vader_lexicon')

sid = SIA()

sentiment = []

for comment in result['comment']:
    sentiment.append(sid.polarity_scores(comment) )

result['sentiment'] = sentiment

result = result.drop('sentiment', 1).assign(**pd.DataFrame(result.sentiment.values.tolist()))

This piece of the code will take awhile, depending on the number of comments. In this block, we simply loop through the comments, write the results from SentimentIntensityAnalyzer to the empty sentiment list, and then creates another column to the result/comments table. The result of the sentiment analyzer returns a dictionary response. We can split this dictionary into separate columns by executing the following code:

result = result.drop('sentiment', 1).assign(**pd.DataFrame(result.sentiment.values.tolist()))

Finally, we write the file to a csv so we can connect PowerBI. We are going to now create a simple visualization that would allow a user to track the sentiment on their YouTube content.

Visualizing the YouTube Data Collected:

Since the goal was to measure engagement on the channel, the more basic stats are shown at the top. Comments, Likes, Dislikes, and Views measure a general level of engagement with the topic of the video. One could probably even created a weighted average calculation with the average sentiment, thereby creating a single engagement score. The bar chart simply shows the top 10 videos. Naturally, it would make sense to focus on the videos that were viewed most so as to analyze why that video may have been more successful. Comments are listed down below along with the predicted sentiment values. NOTE: the compound score is a kind of ‘blended’ sentiment score. This reddit feed seems to explain the mathematics of the scoring well enough: https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk. Finally, there is a dual-axis chart that shows average compound score shown over time relative to the views. There is actually little correlation between the two, but I thought that juxtaposing these values would be a good method for an analyst to notice large swings in the sentiment over time. Large contrasts in sentiment would be key areas to investigate why the content was either well received or not.

Considerations:

The sentiment analyzer struggles with context. The sentiment analyzer struggles somewhat with things like ‘bad ass’. In most contexts, this colloquialism is actually a ‘positive’ for some content even though the words technically might be considered ‘negative’. The limitation of the sentiment analyzer is that other people without domain expertise in the subject matter is categorizing the words as negative, neutral, or positive. On aggregate, I think this methodology is reliable and the benefits outweigh the risks of mis-labeling
Take steps to do human labeling before attempting to do a thorough sentiment analysis. The methods we employ here are somewhat of a ‘quick and dirty’ method of measuring sentiment

Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 1)

Tyler Betthauser — Sun, 14 Mar 2021 03:02:29 +0000

I haven’t spent much time managing marketing teams, content developers, creating campaigns, or any of those projects. But, from the outside looking in, I can imagine that there must be an immense amount of work and intuition required to come up with good marketing content. Some might say that the author should just ‘know their audience’ or ‘just create what you want to create’. In general, I agree. However, companies like YouTube have changed the game somewhat in recent years. Audiences have never been larger, more diverse, more targeted, and accessible. It would be nearly impossible to grow a marketing strategy through intuition alone and maintain success. Creators and Authors should have tools that give them an opportunity to keep up with a very fickle audience. With a little data, a few lines of code, and some visualizations these creatives just might have a chance to be ahead of the curve….faster.

In this few part series, we are going to spin up a simple project to demonstrate a neat use case. Consider for a moment that you are a marketing leader or content creator responsible for generating campaigns, marketing videos, podcasts, etc. We’ll also assume that while this leader has a lot of experience, they are also quite pragmatic and aware there are some opportunities to start off on the right foot. We can try to understand the landscape for the companies products and content before investing dollars into a project. Seeing as this company is heavily B2C, YouTube is a keystone in delivering good content to large audiences quickly.

In this series, we will cover the following:

Setting up a project in Google so as to activate the v3 data API for YouTube
Get familiar with the APIs we will need to use and what the APIs provide
Make calls to the appropriate APIs and store the results
Clean the comment data to be processed further by the VADER Lexicon from NLTK
Create a simple PowerBI dashboard to visualize/analyze some of the results
Discuss some future improvements that can be made

Setting up your project with Google:

To begin, you’ll have to set up a project with Google so as to activate the API, get your API Keys, configure Oauth, etc.

If you don’t have an account with Google/Google Cloud, sign up for a Google account
Sign into Google Cloud
Create the project and name according to your needs
Once the project has been created, search for ‘youtube’ in the search bar at the top of the Google Cloud Workspace
One of the top results will be: YouTube Data API v3
Select the API to move forward
Enable the API for your project and move onto the next screen
At the top right, click the button to create credentials
Choose the YouTube Data API v3 from the list under the question: “Which API are you using?”
For the next question: “Where will you be calling the API from?” use the answer: “Other UI (e.g. Windows, CLI tool)”
For the next question: “What data will you be accessing?” use the answer: “Public Data”
1. Unless, of course, you know the app will need personal information. For this use case, we will not need to use personal data. Plus, there is extra levels of scrutiny involved with accessing personal data
After proceeding to the next page, you will be presented with an API Key. Make sure to save this key.
Next, click on the ‘CREATE CREDENTIALS’ button. You’ll want to create an OAuth Client ID
Click on ‘Configure Consent Screen”
I chose an External User Type and selected Create
Fill out the form and continue
Add the first three scopes. If you are only going after the comments, you won’t need to add most of these APIs. You will not need to add sensitive or restricted scopes
Add users as necessary
I published my app in the following screens because it seemed as though there were issues when running code if the project was not published. We are only testing anyways so there is little risk of huge impacts
Under the Credentials tab, create an Create OAuth client ID
Name the application type as a ‘desktop app’ and Name the client any way you want
Save your client ID and Client Secret
Download the .json file from the ‘OAuth 2.0 Client IDs’ and save the file in the same directory that you will be developing within
Optional: Add a service account

It takes a bit of time to set up the API and OAuth, but it is worth it in the end. There are tons of other walkthroughs on YouTube if you get stuck.

Starting to Build the Script:

Import libraries and defining some variables/functions that enable the OAuth to operate:

import os
import numpy as np
import re

CLIENT_SECRETS_FILE = "client_secret_2.json"

SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

import google.oauth2.credentials
 
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
import google.oauth2.credentials
 
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

import pickle
import google.oauth2.credentials
 
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

def get_authenticated_service():
    credentials = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not exist
    if not credentials or not credentials.valid:
        # Check if the credentials have expired
        if credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES)
            credentials = flow.run_console()
 
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)
 
    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
 
if __name__ == '__main__':
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
    service = get_authenticated_service()

A key part of this block of code is to save off the credentials for the authenticated service. You do not want to have to re-authenticate EVERY single time you run the script. This block will fix this problem.

api_key = ''
youtube=build('youtube','v3',developerKey=api_key)

Define your API Key here in these lines of code. This builds a ‘key’ as it were to then call the different parts of the API like comments, search function, etc.

Using the Search Function to get the Channels we will analyze:

You can use the search function to get a set of results from the YouTube search function. You should familiarize yourself with the documentation: https://developers.google.com/youtube/v3/getting-started

We will not be exploring the API functions in depth here in this article. Here is an example of using the search method of the API:

snippets = youtube.search().list(part='id,snippet',type='channel',q='t.rex arms').execute()

channelId = snippets['items'][0]['snippet']['channelId']
print(channelId)
>> UCU-ljC8EvKZFhJ-pct_5rMQ

I cut a bit of a corner here. I knew that the channel Id I wanted existed at the 0 index. This Channel Id will be used to kick off the rest of the script. The concept is as follows:

Search YouTube for a Channel(s) (the Seed) >> Extract the Stats and the Playlist Id from channels.list() >> Get the list of videos and their Ids from the playlistItems() >> Use the channel/video ids to get the comments from each video. The next lines of code will demonstrate the results that the YouTube API will return:

stats = youtube.channels().list(part='statistics',id=channelId).execute()
stats['items']

>> [{'kind': 'youtube#channel',
  'etag': '6p3MzT5MtiAPsl3LjZUa1Jrfp78',
  'id': 'UCU-ljC8EvKZFhJ-pct_5rMQ',
  'statistics': {'viewCount': '103419822',
   'subscriberCount': '975000',
   'hiddenSubscriberCount': False,
   'videoCount': '145'}}]

content = youtube.channels().list(id = channelId, part='contentDetails').execute()
content['items']

>> [{'kind': 'youtube#channel',
  'etag': 'NHEVnfNtoeJIhQaZFf68M1xiH9c',
  'id': 'UCU-ljC8EvKZFhJ-pct_5rMQ',
  'contentDetails': {'relatedPlaylists': {'likes': '',
    'favorites': '',
    'uploads': 'UUU-ljC8EvKZFhJ-pct_5rMQ'}}}]

uploadId = content['items'][0]['contentDetails']['relatedPlaylists']['uploads']
uploadId

>> 'UUU-ljC8EvKZFhJ-pct_5rMQ'

After getting the uploads playlist, we should be able to go get the videos from the playlist. If there was more than one playlist, you could simply write the playlist ids to an empty list and loop through all of them to get the videos. Next, we get the videos from the playlist:

while 1: res=youtube.playlistItems().list(playlistId=uploadId,maxResults=50,part='snippet',pageToken=nextPage_token).execute()
    allVideos += res['items']
    nextPage_token = res.get('nextPageToken')
    if nextPage_token is None:
        break

video_ids=[]
channelId = []
for i in range(0,143):
    video_ids.append(allVideos[i]['snippet']['resourceId']['videoId'])
    channelId.append(allVideos[i]['snippet']['channelId'])

stats = []
for i in range(0,len(video_ids),40):
    res = (youtube).videos().list(id=','.join(video_ids[i:i+40]),part='statistics').execute()
    stats += res['items']

A while loop grabs any and all videos. Depending on your own use case, there might be a need to stop after so many calls. Remember, you only have 10,000 calls per day. The two other blocks simply appends data to a list for post processing later. I would probably not hard code a range in production level code since playlists will have different numbers of videos.

Next, we deconstruct the results of the other lists to separate ‘columns’ to be used in a dataframe/table:

import pandas as pd
data={'title':title,'video_id':videoid,'video_description':video_description,'publishedDate':publishedDate,'likes':liked,'dislikes':disliked,'views':views,'comment_count':comment}
df=pd.DataFrame(data)
df.head()

We go after the comments with the following lines of code:

channelId = list(set(channelId))
allComments = []
video_id_pop = []
channel_id_pop = []
video_title_pop = []
video_desc_pop = []
comments_pop = []
comment_id_pop = []
reply_count_pop = []
like_count_pop = []

for channel in channelId:
    res=youtube.commentThreads().list(allThreadsRelatedToChannelId=channel,
                                      part='id,snippet',
                                      maxResults=100).execute()

    try:
        nextPageToken = res['nextPageToken']

    except KeyError:
        nextPageToken = None

    except TypeError:
        nextPageToken = None
    
    comments_temp = []
    comment_id_temp = []
    reply_count_temp = []
    like_count_temp = []
    channel_id_temp = []
    video_id_temp = []

    for item in res['items']:
        allComments.append(res['items'])
        comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
        comment_id_temp.append(item['snippet']['topLevelComment']['id'])
        reply_count_temp.append(item['snippet']['totalReplyCount'])
        like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
        channel_id_temp.append(item['snippet']['channelId'])
        video_id_temp.append(item['snippet']['videoId'])

    comments_pop.extend(comments_temp)
    comment_id_pop.extend(comment_id_temp)
    reply_count_pop.extend(reply_count_temp)
    like_count_pop.extend(like_count_temp)
    channel_id_pop.extend(channel_id_temp)
    video_id_pop.extend(video_id_temp)
    
    while (nextPageToken):
        try:
            res=youtube.commentThreads().list(allThreadsRelatedToChannelId=channel,
                                      part='id,snippet',
                                      maxResults=100,pageToken=nextPageToken).execute()
            
            comments_temp = []
            comment_id_temp = []
            reply_count_temp = []
            like_count_temp = []
            channel_id_temp = []
            video_id_temp = []

            for item in res['items']:
                allComments.append(res['items'])
                comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
                comment_id_temp.append(item['snippet']['topLevelComment']['id'])
                reply_count_temp.append(item['snippet']['totalReplyCount'])
                like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
                channel_id_temp.append(item['snippet']['channelId'])
                video_id_temp.append(item['snippet']['videoId'])

            comments_pop.extend(comments_temp)
            comment_id_pop.extend(comment_id_temp)
            reply_count_pop.extend(reply_count_temp)
            like_count_pop.extend(like_count_temp)
            channel_id_pop.extend(channel_id_temp)
            video_id_pop.extend(video_id_temp)
            
            nextPageToken = res['nextPageToken']
            
        except KeyError:
            break

data_threads={'comment':comments_pop,'comment_id':comment_id_pop,'reply_count':reply_count_pop,'like_count':like_count_pop,'channel_id':channel_id_pop,'video_id':video_id_pop}
threads=pd.DataFrame(data_threads)
threads.head()

The code above looks complicated, but there isn’t too much to the functions. YouTube has a function to grab all the threads that relate to a channel id. Because the results are paginated, you will need to incorporate the nextPageToken and loop through the pages until complete. In some applications, you may want to cut the calls off early—especially if the channel has a ton of engagement.

We’ll cover data cleaning, feature engineering, use of NLTK/VADER for sentiment analysis, and a simple dashboard in PowerBI in the next article!

Sales Forecasting: Predict Your Sales Cycle Using Machine Learning

Tyler Betthauser — Sat, 27 Feb 2021 23:38:25 +0000

Business Context:

There is no shortage of methods to forecast sales. To demonstrate one of those methods, we look back at the O-List data from Kaggle. The forecasting methodology that will be focused on is analyzing the sales cycle to predict how long a sales lead might take to close. So, not only will we be able to predict if a lead will close, but also how long it might take to close the deal.

The benefits of sales forecasting are pretty straightforward:

Improved financial planning
More precise work-load balance at each level of the organization
Better insights into velocity or growth

It needs to be said that this type of sales forecasting might not work for every business model and data model, for that matter.

In this post, the following will be covered:

Feature Engineering
Data Quality Improvements
Testing various models

Libraries & Reading in the Data:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import date
from datetime import datetime
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from tensorflow import keras

closed_deals = pd.read_csv('olist_closed_deals_dataset.csv')
olist_leads = pd.read_csv('olist_marketing_qualified_leads_dataset.csv')

Next, we merge the funnel and the qualified leads datasets:

funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')

Data Cleaning and Feature Engineering

The next section, we focus on some initial cleaning and feature engineering on the dataset. One thing to note with this investigation. There was not a ton of leads actually closed. So, we built some simulated data to increase the data that a model could be trained on.

funnel['won_date'] = funnel['won_date'].astype('datetime64[ns]')
funnel['first_contact_date'] = funnel['first_contact_date'].astype('datetime64[ns]')

#drop dimensions that likely will not be that reliable in collecting within the business process
funnel.drop(['declared_monthly_revenue','declared_product_catalog_size','average_stock','has_company','seller_id','mql_id'],axis=1,inplace=True)

An unfortunate deficiency of the O-list data is that there is not a reliable source of revenue data per lead. While we can successfully complete the task, the case study would be closer to real-world if there was more samples with a richer context. Next, a copy of the dataframe will be created and some time based features created:

funnel_model = funnel.copy(deep=True)

funnel_model['contact_day'] = funnel_model['first_contact_date'].dt.strftime('%d')
funnel_model['contact_month'] = funnel_model['first_contact_date'].dt.strftime('%m')
funnel_model['contact_year'] = funnel_model['first_contact_date'].dt.year

Intuitively, the contact date information will be predictors for length of deal closure. These features will be especially important if there is any seasonality present. Since we only have a years worth of data in the set, it would be tough to make a judgement on seasonalities. The most important part of the data prep is addressing the NA’s that exist:

#count missing values (NAs)
missing_count = pd.DataFrame(funnel_model.isna().sum(),columns=['Number'])
missing_count['Percentage'] = round(missing_count / len(funnel_model),2) * 100
missing_count

The next block of code addresses the NAs in the records given the dimensions above. Quite simply, a list of unique values was pulled from the dimension. Then, randomly those values are applied where the record is NA. This methodology was used to try and preserve as much of the distributions native to the data as possible, while also giving some more data to train.

origin_list = funnel_model["origin"].unique()
origin_list = [x for x in origin_list if str(x) != 'nan']
funnel_model['origin'] = funnel_model['origin'].fillna(pd.Series(np.random.choice(origin_list, size=len(funnel_model.index))))

sdr_id_list = funnel_model["sdr_id"].unique()
sdr_id_list = [x for x in sdr_id_list if str(x) != 'nan']
funnel_model['sdr_id'] = funnel_model['sdr_id'].fillna(pd.Series(np.random.choice(sdr_id_list, size=len(funnel_model.index))))

sr_id_list = funnel_model["sr_id"].unique()
sr_id_list = [x for x in sr_id_list if str(x) != 'nan']
funnel_model['sr_id'] = funnel_model['sr_id'].fillna(pd.Series(np.random.choice(sr_id_list, size=len(funnel_model.index))))

bs_list = funnel_model["business_segment"].unique()
bs_list = [x for x in bs_list if str(x) != 'nan']
funnel_model['business_segment'] = funnel_model['business_segment'].fillna(pd.Series(np.random.choice(bs_list, size=len(funnel_model.index))))

lead_list = funnel_model["lead_type"].unique()
lead_list = [x for x in lead_list if str(x) != 'nan']
funnel_model['lead_type'] = funnel_model['lead_type'].fillna(pd.Series(np.random.choice(lead_list, size=len(funnel_model.index))))

lbp_list = funnel_model["lead_behaviour_profile"].unique()
lbp_list = [x for x in lbp_list if str(x) != 'nan']
funnel_model['lead_behaviour_profile'] = funnel_model['lead_behaviour_profile'].fillna(pd.Series(np.random.choice(lbp_list, size=len(funnel_model.index))))

gtin_list = funnel_model["has_gtin"].unique()
gtin_list = [x for x in gtin_list if str(x) != 'nan']
funnel_model['has_gtin'] = funnel_model['has_gtin'].fillna(pd.Series(np.random.choice(gtin_list, size=len(funnel_model.index))))

btype_list = funnel_model["business_type"].unique()
btype_list = [x for x in btype_list if str(x) != 'nan']
funnel_model['business_type'] = funnel_model['business_type'].fillna(pd.Series(np.random.choice(btype_list, size=len(funnel_model.index))))

dt_list = funnel_model["won_date"].unique()
funnel_model['won_date'] = funnel_model['won_date'].fillna(pd.Series(np.random.choice(dt_list, size=len(funnel_model.index))))

The next block of code finishes off the feature engineering and data quality improvements:

funnel_model['won_day'] = funnel_model['won_date'].dt.strftime('%d')
funnel_model['won_month'] = funnel_model['won_date'].dt.strftime('%m')
funnel_model['won_year'] = funnel_model['won_date'].dt.year

funnel_model['time_to_close'] = funnel_model['won_date'] - funnel_model['first_contact_date']
funnel_model['time_to_close'] = funnel_model['time_to_close'].dt.days

#we can create a combo feature now of sdr and sr
funnel_model['sdr_sr'] = funnel_model['sdr_id'] + funnel_model['sr_id']

This code should fill our NA fields with data that is representative of some reality—which should be sufficient to demonstrate the use case effectively. A few more lines of code to clean up the dataset:

#drop any rows where there are already suspect data like when there is a negative close time
indexNames = funnel_model[ (funnel_model['time_to_close'] < 0)].index
funnel_model.drop(indexNames , inplace=True)

funnel_model = funnel_model.dropna(subset=['won_date'])

funnel_model = funnel_model.drop(['won_date','first_contact_date'],axis=1)

Model Development, Train/Test/Split, & Defining X,y

df = funnel_model.copy(deep=True)

#define the X, y variables
X = pd.get_dummies(df.drop('time_to_close',axis=1),drop_first=True)
y = df['time_to_close']

#always split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

# prepare input data
def prepare_inputs(X_train, X_test):
    ohe = OneHotEncoder(handle_unknown='ignore')
    ohe.fit(X_train)
    X_train_enc = ohe.transform(X_train)
    X_test_enc = ohe.transform(X_test)
    return X_train_enc, X_test_enc

from sklearn.preprocessing import OneHotEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

In the code above, we one-hot encode the categorical variables. This is necessary because the algorithms we are going to use requires a binary array to be fed as inputs. I have found that it is best to split the data and THEN one-hot encode ‘X’.

support vector regression is up first

First, support vector regression is going to be used to predict the sales cycle time. We had good accuracy with this algorithm in our classification exercise.

from sklearn.svm import SVR,LinearSVR
base_model = SVR()
base_model.fit(X_train_enc,y_train)
base_preds = base_model.predict(X_test_enc)
from sklearn.metrics import mean_absolute_error,mean_squared_error
mean_absolute_error(y_test,base_preds)
>> 73.15358191010293
np.sqrt(mean_squared_error(y_test,base_preds))
>> 93.54240378007681
y_test.mean()
>> 112.10120240480961

Some interesting things to note here in this code:

The scale of measured accuracy is the same of the target variable. So, in this case the SVR model is about 73 days off. Not ALL the predictions are off by 73 days, but on average the predictions can be inaccurate by 73 days—this is not good!
Pay attention to the mean of y_test() though: 112 days. The dataset itself has quite a bit of variation in the time to close. The fact our predictions are quite a few days less than the y_test mean is actually positive

Given the positivity with this model, we can try to tune the hyper-parameters so as to improve accuracy:

param_grid = {'C':[0.001,0.01,0.1,0.5,1],
             'kernel':['linear','rbf','poly'],
              'gamma':['scale','auto'],
              'degree':[2,3,4],
              'epsilon':[0,0.01,0.1,0.5,1,2]}

from sklearn.model_selection import GridSearchCV
svr = SVR()
grid = GridSearchCV(svr,param_grid=param_grid,cv=3, n_jobs=-1)
grid.fit(X_train_enc,y_train)
>> SVR(C=1, degree=2, epsilon=2, kernel='linear')
grid.best_params_
>> {'C': 1, 'degree': 2, 'epsilon': 2, 'gamma': 'scale', 'kernel': 'linear'}
print(grid.best_score_)
>> 0.9000862164349251
preds = svr.predict(X_test_enc)

from sklearn.metrics import mean_absolute_error,mean_squared_error
MAE = mean_absolute_error(y_test,preds)
MSE = mean_squared_error(y_test,preds)
RMSE = np.sqrt(MSE)
print(MAE)
>> 9.492129595872589
print(MSE)
>> 853.6142931876532
print(RMSE)
>> 29.21667833939466

Note here:

GridSearchCV is very slow given the number of features in the dataset—there are over 1500. We should probably be using some sort of dimensionality reduction in order to reduce the training time and make the process more efficient. SVR with GridSearchCV takes several hours to run in this investigation—that may not be tenable for some applications
GridSearchCV should be used smartly. The more variables that get added to the parameters the greater the training time
n_jobs set to -1 helps training time and optimizes the use of your machines resources
Large outliers in the data creates some difficulty creating accurate predictions—hence the terrible mean squared error.

Try a Simple Linear Regression:

After the long training of the Support Vector Regression and less than wonderful performance, maybe a simple linear regression will be more effective:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_enc,y_train)
# We only pass in test features
# The model predicts its own y hat
# We can then compare these results to the true y test label value
test_predictions = model.predict(X_test_enc)

MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 0.057113830282917256
print(MSE)
>> 3.253882131975636
print(RMSE)
>> 1.803852026075209

Linear Regression model seems to perform the best so far. A RMSE/MSE of 1-3 days is pretty accurate and serviceable for a sales forecasting. Still curious if we can further fine tune the results via Ridge Regression.

Ridge Regression Model Training:

from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train_enc,y_train)
test_predictions = ridge_model.predict(X_test_enc)

MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 7.6568575927885645
print(MSE)
>> 147.21235108694535
print(RMSE)
>> 12.133109703903008

The Ridge Model is not as performant! Using this model as a baseline, we can use RidgeCV to see if it is not possible to improve the results. The RidgeCV model can be set up as such:

# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
# Negative RMSE so all metrics follow convention "Higher is better"

# See all options: sklearn.metrics.SCORERS.keys()
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_root_mean_squared_error')

# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train_enc,y_train)

ridge_cv_model.alpha_
>> 0.1

test_predictions = ridge_cv_model.predict(X_test_enc)

MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 0.22657748641706255
print(MSE)
>> 3.4295154423291936
print(RMSE)
>> 1.8518950948499198

The improved RidgeCV model performs similar to the more basic Simple Linear Regression.

Attempt a LassoCV

Next, we try to understand how a LassoCV might do in terms of accuracy:

from sklearn.linear_model import LassoCV
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)

lasso_cv_model.fit(X_train_enc,y_train)

lasso_cv_model.alpha_
>> 3.0222656025281607

test_predictions = lasso_cv_model.predict(X_test_enc)
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 47.70224782216077
print(MSE)
>> 3692.4097816866115
print(RMSE)
>> 60.76520206241901

Obviously, the model above is untenable and not a worthy candidate for something like a sales forecast—especially compared to the other models we have tried.

Elastic Net Models:

Since Elastic Net Models attempt to keep the benefits of both the Lasso and Ridge Models. A few lines of code will tell us what the performance result might be:

from sklearn.linear_model import ElasticNetCV
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1],tol=0.01)
elastic_model.fit(X_train_enc,y_train)
test_predictions = elastic_model.predict(X_test_enc)
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
>> 4.714205461929477
print(MSE)
>> 66.5452147436821
print(RMSE)
>> 8.157525037882635

Conclusions:

ElasticNet is a pretty happy middle ground between Ridge and Lasso, but still doesn’t perform nearly as well as Linear Regression or RidgeCV
Model training time was vastly quicker on Linear Regression and RidgeCV—this might be an important consideration in a production implementation

Predict which Sales Leads Close Part 2

Tyler Betthauser — Sat, 20 Feb 2021 19:46:00 +0000

Introduction

Last time we left off on this project, the technologies we chose weren’t particularly great at predicting one of the classes: Closed Lead versus Open Lead. In this article, a few different methods are employed to overcome the challenges of imbalanced classes, encoding all of the categorical variables, and hyper-parameter tuning.

Gradient-Boosted / Ensemble Algorithms Might Help

Firstly, we will import the GradientBoostingClassifier from sklearn. Catboost and XGBoost were considered for this investigation, but these algorithms are a little harder and more complicated to implement. Sklearn seemed to be more familiar and easier to understand and tune. This is not to say that Catboost and XGBoost are not good solutions—in literature, they are great!

from sklearn.ensemble import GradientBoostingClassifier

Since the feature engineering and data cleaning have already been completed, we can create a dataframe that includes all of the features we want to include in the model:

df5 = funnel_model[['landing_page_id', 'origin', 'sdr_id','sr_id','business_segment',
                   'lead_type','lead_behaviour_profile','has_gtin','business_type',
                  'contact_day','contact_month','contact_year','sdr_sr','closed_deal']].copy()

After creating the dataframe as the backbone of the model, we are going to employ the first potential fix for the imbalanced classes. The option we chose to use is called over-sampling. Over-sampling is when you duplicate records from the minority class. Because the algorithm treats each record as a unique instance, duplicated records aren’t a problem and help us synthetically enhance our dataset. Since the minority class is a closed lead, we will over-sample this class:

closed_dup = df5['closed_deal'] == True
df_try = df5[closed_dup]
df5 = df5.append([df_try]*3,ignore_index=True)
df5.shape

This small piece of code will increase each of the closed deal records three times. Something to note: over/under sampling should not be the first line of defence. There are a myriad of other process-based fixes that should be used first such as:

Finding more data to build predictions
Investigate if bias is being introduced to the data collection in some way. Correct or reduce the level of bias in data collection
Gain some domain knowledge on the who, what, where, why, and how of the operations that build this data—make corrections to the prediction methodology where appropriate

Next, functions will be prepared to one-hot encode the features and label encode the labels:

# prepare input data
def prepare_inputs(X_train, X_test):
    ohe = OneHotEncoder(handle_unknown='ignore')
    ohe.fit(X_train)
    X_train_enc = ohe.transform(X_train)
    X_test_enc = ohe.transform(X_test)
    return X_train_enc, X_test_enc
 
# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc

Notice what is done inside the encoding functions. Fit to the training then transform the training and test set. Next, define the X and y:

X = df5.drop('closed_deal',axis=1)
y = df5['closed_deal']

X is everything but the target value. y is the target value. As always, perform your train, test split. You will notice that the ‘stratify’ parameter has been added to the train, test split. The ‘stratify’ parameter is a great tool for imbalanced dataset because it helps maintain equal proportions of X and y between the train and test data. It is best to ensure that one set does not monopolize the minority target:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101, stratify = y)

Sklearn has useful one-hot and label encoding functions. After splitting the data, we can actually use our data preparation functions:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

The very basic model can be instantiated and fitted to the Xtrain and ytrain:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()

model.fit(X_train_enc, y_train_enc)

y_pred = model.predict(X_test_enc)

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_enc, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

>> 0.8874727398312781

The basic model does ok!, but not as great as the even more basic SupportVectorClassifier. With a baseline established, hyper-parameter tuning can be completed using GridSearchCV:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

# A sample parameter

parameters = {
    "loss":["deviance"],
    "learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.5],
    "min_samples_split": np.linspace(0.1, 1.0, 5),
    "min_samples_leaf": np.linspace(0.1, 0.5, 5,endpoint=True),
    "min_weight_fraction_leaf": np.linspace(0.1, 1.0, 10),
    "max_depth":[3,5,8],
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse"],
    "subsample":[0.8, 0.9, 0.95, 1.0],
    "n_estimators":[10]
    }
#passing the scoring function in the GridSearchCV
grid = GridSearchCV(GradientBoostingClassifier(verbose=2), parameters, cv=3, n_jobs=-1)

grid.fit(X_train_enc,y_train_enc)

After the GridSearchCV the best parameters and scores can be collected:

print(grid.best_score_)
print(grid.best_params_)

>> 0.7847310912445011
{'criterion': 'friedman_mse', 'learning_rate': 0.5, 'loss': 'deviance', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 0.1, 'min_samples_split': 0.55, 'min_weight_fraction_leaf': 0.1, 'n_estimators': 10, 'subsample': 0.9}

Strangely enough, the outputs were worse than the baseline. If I am honest, I have no idea exactly why this is occurring. an initial hypothesis is that we trained the grid on the training set and not the whole set. Or, the grid doesn’t have all of the defaults of the GradientBoostedClassifer. Somewhat discouraged, I moved on to a Tensorflow Sequential Model (Artificial Neural Network)—which I thought was more fun to play with anyways.

Artificial Neural Network Application—Much Better

Since we already defined how we were going to prep the data for the model, the code will not be restated below. Development of the model will come first:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.callbacks import EarlyStopping
from keras.regularizers import l2

early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

model = Sequential()

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

model.add(Dense(units=1350,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(units=676,activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dropout(0.25))

model.add(Dense(units=338,activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dropout(0.125))

model.add(Dense(units=1,activation='sigmoid'))

# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

A few notes here:

We are going to use 4 layers here since the model is pretty complex and lots of features—over 1,300! Thanks to one-hot encoding
- The input layer will have 1,350 nodes—generally, this can be set to the number of features or columns. A drop out layer has been added to dispense of useless nodes
- Two hidden layers have been added. I have halved the nodes, added a regularizer to manage overfitting. The dropout layer is halved as well
- Relu activation is used in the first layers because of a general consensus that relu is quite flexible. If we wanted to tune these values we could later
- In the final hidden layer all the values, except for regularization, are halved to continue to simplify the model
- Finally, the output layer will be a single sigmoid node due to our problem being a binary classification problem

We will be adding early stopping to ensure we do not over fit
Our loss function is going to use binary_crossentropy since this is a binary classification problem and this loss function should be appropriate
The optimizer to be used is the adam optimizer—highly flexible and generally works quite well
This initial model was set up somewhat arbitrarily and should be tuned if it is to be used in some sort of production application

Next, we can fit the model:

model.fit(x=X_train_enc, 
          y=y_train_enc, 
          epochs=1000,
          validation_data=(X_test_enc, y_test_enc), verbose=1,
          callbacks=[early_stop]
          )

1000 epochs is going to be overkill, but the early stopping will ensure we never get close to 1000 epochs.

Epoch 1/1000
305/305 [==============================] - 8s 26ms/step - loss: 0.5662 - val_loss: 0.2128
Epoch 2/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.1535 - val_loss: 0.1206
Epoch 3/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0809 - val_loss: 0.1024
Epoch 4/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0564 - val_loss: 0.1522
Epoch 5/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0412 - val_loss: 0.0698
Epoch 6/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0316 - val_loss: 0.0833
Epoch 7/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0317 - val_loss: 0.0997
Epoch 8/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0246 - val_loss: 0.1217
Epoch 9/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0187 - val_loss: 0.0735
Epoch 10/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0160 - val_loss: 0.1116
Epoch 11/1000
305/305 [==============================] - 8s 25ms/step - loss: 0.0166 - val_loss: 0.0773
Epoch 12/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0307 - val_loss: 0.1051
Epoch 13/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0242 - val_loss: 0.1281
Epoch 14/1000
305/305 [==============================] - 7s 24ms/step - loss: 0.0201 - val_loss: 0.1222
Epoch 15/1000
305/305 [==============================] - 7s 23ms/step - loss: 0.0155 - val_loss: 0.0852
Epoch 00015: early stopping

Learning in 15 epochs is pretty good! Next, we can show the losses:

model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

I am pretty happy with the losses in the chart. Our scale is quite small even though the validation loss is not perfect. It is quite ‘chunky’. I believe this behavior occurs when adding dropout layers. With additional hyper-parameters, the gaps between training and validation loss could be further reduced.

Finally, we can get the predictions and determine performance:

predictions = model.predict_classes(X_test_enc)

# https://en.wikipedia.org/wiki/Precision_and_recall
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

       False       1.00      0.97      0.98      1432
        True       0.96      1.00      0.98      1008

    accuracy                           0.98      2440
   macro avg       0.98      0.98      0.98      2440
weighted avg       0.98      0.98      0.98      2440

Cool! The performance is quite good and without a ton of time spent training. By far, the neural network model gives a bigger bang for the buck.

Next time, we will fine-tune the parameters for the neural network.

A Vast Majority of AI & ML Projects Fail, but they Don't have to

Tyler Betthauser — Wed, 17 Feb 2021 01:48:38 +0000

I attended a virtual conference where a pretty interesting statistic was shared: 75% of AI & ML projects fail or benefits will not be realized. Gartner, in 2017 and 2019, seem to echo this sentiment: 80% of analytics insights will not deliver business outcomes through 2022 and 80% of AI projects will “remain alchemy, run by wizards” through 2020. Given the hype and number of businesses investing in data, the risk for negative ROI is alarmingly high. For small consultancies like Conaxon, this is not good news given that our goal is to create opportunities that allow small-to-midsize businesses to cash in on the benefits of AI and ML.

Here are the top 5 ways to drastically improve the success of AI and ML initiatives:

Talk to Stakeholders and Include them in Decision Making:
- A general lack of understanding surrounding AI, ML, Business Intelligence, and d Data Science can make well intentioned projects dead on arrival. Naturally, we are uncertain about new technologies, change, and being left behind. These are all valid! But, if business and data science leaders spend time educating, socializing, and strategizing how data literacy gets weaved into the company culture. If your employees are in constant fear that AI and ML are going to be replacing them then it will be incredibly difficult to allow for integration. AI and ML are not going to automate away everything. AI and ML is a tool to be used in symbiosis. These are tools to make human functions more precise and efficient.
Start with Decision Intelligence:
- Do not get caught up in the shiny gem that is data. It is so easy to overdo it early in the game. Start simple with AI and ML. Applied AI and ML are not yet advanced enough to easily interpret chaos. You need to collaborate with the various business functions and decide which decisions could be better by having a piece (or pieces) of information—the more repeatable, the better. AI and ML work best when the thing you are trying to make more efficient is repeatable and a pattern can be taught/identified. If the project does not meet those two very basic criteria, then your risk for failure increases fairly exponentially.
Keep it Simple:
- Don’t try and boil the ocean. Data can be overwhelming as well as liberating. Stay focused on a few initiatives that truly help make your team’s life easier. Putting dozens of dashboards with multitudes of charts and KPIs in front of executives isn’t effective.
Spend a majority of the time on defining/measuring the problem:
- If you start off your journey with ML and AI with a poorly defined and un-measurable then failure is imminent. Aimless, or poorly aimed, AI and ML development will result in the output being vastly different than what your stakeholders need. At the end of the project, your shiny new data product needs to be a tool that people use and integrate with—like a sword and an arm. Swords were an extension of the warriors arm. AI and ML products need to be integrated in the same way. As mentioned above, engage with the end-users early in the project. Interface with them regularly to assess . Study how they work day-to-day.
Put a good team together—with a kick-ass project manager:
- The team you build around your data vision will be the keystone for success. Your decision maker should be an advocate and ambassador. They should be pragmatic. They should have solid domain experience in the space the team is operating within. You should probably find a customer champion. This person should be politically savvy and have very intimate knowledge of how the operations are performed; furthermore, is well respected by the end users. Of course, you need your data scientists, engineers, and analysts. Last but not least, spend some time and money on finding a really great project manager. A great many analytics, AI, and ML do not come to fruition because of project management related issue. This is not to say the project managers are all to blame! However, there is something to be said about the impact of a great project management professional on the outcome of an initiative.

Predict which Sales Leads Close Part 1

Tyler Betthauser — Tue, 09 Feb 2021 22:06:23 +0000

Setting the Stage:

Consider for a moment that your business is doing quite well. Sales is quickly climbing, the sales funnel is quite full, and customer service is top notch. But, because you are a responsible leader and manager the future appears somewhat hazy! Growth is wonderful, to be sure. However, growth can only scale as well as the sales team—and by extension, the rest of your operations. At some point the sales funnel will become exceedingly top heavy, business leaders will have to decide: do we hire more team members to support the increased demand or do we attempt to lean out somewhat so as to preserve margin, customer service, and specialization?

A tough, but highly personal choice.

I am willing to bet, a great many businesses would choose the option to lean out, maintain margins, and continue to develop productive their salespeople. There are a few pillars critical to projects where efficiency is the desired output, but maybe none more critical than tools. Having a diverse toolbox is essential. Data analytics and machine learning is quickly becoming an essential tool in the toolbox.

Proposed Solution:

All that said, I propose that machine learning could be used to predict which sales leads might close; therefore, allowing salespeople some insights into which leads should be prioritized first in the funnel. Secondarily, this algorithm could be used as a tool for identifying sales opportunities NOT being closed that may be critical now or in the future.

Business Context:

In order to demonstrate the capability of machine learning to address the aforementioned use case, we searched for a public dataset to perform tests. The team landed on a Kaggle dataset posted by a company called Olist—the largest department store in Brazilian marketplaces (link: https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv). This is a marketing funnel dataset from sellers that populated a form that requested to sell their products on the Olist Store. Olist connects small businesses from all over Brazil to channels without hassle and a single contract. Merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist’s supply chain partners.

The sales process is as follows:

Sign-up at a landing page
Sales development Representative (SDR) contacts lead, collects some information and schedules an additional consultancy
Consultancy is made by a Sales Representative (SR). The SR may close the deal or not
Lead becomes a seller and starts building their catalog on Olist
The products are published on Olist marketplaces and ready to sell!

The Dataset:

The dataset has information related to 8,000 Marketing Qualified Leads (MQLs) that requested a contact. these MQLs were randomly sampled from a larger set of MQLs.

source: https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv

The algorithm will use the data from the qualified leads daraset and closed leads dataset. A future projet might be demand/sales forecasting using the sellers dataset and order items dataset.

Jumping into the Data:

When testing, I like to use Jupyter Lab. I find it to be supremely easy to work with and lends itself to iteration, agility, and ease of use. First, we will import the libraries we will be using:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

Now, we are going to read in the data for the analysis:

closed_deals = pd.read_csv('olist_closed_deals_dataset.csv')
olist_leads = pd.read_csv('olist_marketing_qualified_leads_dataset.csv')

Next, we are going to combine the qualified leads and the closed deals to create a single dataset for generating predictions. The documentation from Kaggle was really great so there is no mystery as to how the join needs to be performed.

funnel = pd.merge(olist_leads,closed_deals,how='left',on='mql_id')

The next lines of code are going to be adding some potentially useful features and removing others that won’t be useful in the prediction model. The code is pretty simple and self-explanatory. Initially, time-to-close was thought to be useful in the prediction, but at the time of writing the report the features were not used in the model. Time-to-close would likely be more important as a business analysis task rather than a prediction task. The code is left in this report for reference anyways:

funnel['won_date'] = funnel['won_date'].astype('datetime64[ns]')
funnel['first_contact_date'] = funnel['first_contact_date'].astype('datetime64[ns]')
funnel['time_to_close'] = funnel['won_date'] - funnel['first_contact_date']
funnel['time_to_close'] = funnel['time_to_close'].dt.days

funnel.drop('declared_monthly_revenue',axis=1,inplace=True)
funnel.drop('declared_product_catalog_size',axis=1,inplace=True)
funnel.drop('average_stock',axis=1,inplace=True)
funnel.drop('has_company',axis=1,inplace=True)
funnel.drop('seller_id',axis=1,inplace=True)
indexNames = funnel[ (funnel['time_to_close'] < 0)].index
funnel.drop(indexNames , inplace=True)

It is important to note that many of these dimensions were dropped because there was very little data to begin. There are instances of imputation later in the project that could have been applied to these dropped dimensions; however, there was little data to even be able to reliably impute from as a baseline.

The next line of code defines what we are going to end up trying to predict—a binary TRUE or FALSE classification:

funnel['closed_deal'] = funnel['won_date'].notnull()

This particular project did not attempt to understand time to close, but could easily be revisited at a later time.

Each project typically starts with some basic exploratory data analysis. I want to have a devent understanding of the spread in time-to-close, which SRs and SDRs are closing most often, and which features might have the most importance in a prediction. Let’s start with a basic understanding of which landing pages seem to close the most deals:

pg_id = funnel.loc[funnel['closed_deal'] == True]
pg_id = pg_id.landing_page_id.value_counts()
pg_id[pg_id.values > 5].plot(kind="bar")
plt.title("Landing pages count - closed deals")
plt.savefig("landing_page_counts.png")
plt.show()

Next, it might be interesting whom are the most effective SRs:

sr = funnel.loc[funnel['closed_deal'] == True]
sr = sr.sr_id.value_counts()
sr[sr.values > 5].plot(kind="bar")
plt.title("closed deals - by sales rep")
plt.savefig("landing_page_counts.png")
plt.show()

Next, we look at the most effective SDRs:

sdr = funnel.loc[funnel['closed_deal'] == True]
sdr = sdr.sdr_id.value_counts()
sdr[sdr.values > 5].plot(kind="bar")
plt.title("closed deals - by sales development rep")
plt.savefig("landing_page_counts.png")
plt.show()

We’ll finish off the exploratory data analysis with a cursory understanding of how long it takes to close a deal based on various features in the dataset. Again, this part of the business analysis is not strictly pertinent but potentially useful knowledge for further development. The business might find it useful in the future to have a prediction of when a deal might close—thereby allowing some ability to better understand potential revenue.

sns.displot(data=funnel, x="time_to_close", col="origin", kde=True,col_wrap=2)

sns.displot(data=funnel, x="time_to_close", col="business_segment", kde=True,col_wrap=2)

sns.displot(data=funnel, x="time_to_close", col="lead_type", kde=True,col_wrap=2)

sns.displot(data=funnel, x="time_to_close", col="business_type", kde=True,col_wrap=2)

Overall, the data analysis wasn’t too conclusive but gave decent exposure to some of the intricacies of the dataset. You’ll notice that in many areas, the data is quite sparse and not many samples to develop a robust model. It would be advisable, like in most instances, to acquire more data to test and fine tune hyper parameters.

After the brief data analysis, we can begin to further clean and develop the features going to be used in the prediction:

funnel_model = funnel.copy(deep=True)
funnel_model['contact_day'] = funnel_model['first_contact_date'].dt.strftime('%d')
funnel_model['contact_month'] = funnel_model['first_contact_date'].dt.strftime('%m')
funnel_model['contact_year'] = funnel_model['first_contact_date'].dt.year
funnel_model.drop('time_to_close',axis=1,inplace=True)
funnel_model.drop('won_date',axis=1,inplace=True)
funnel_model.drop('first_contact_date',axis=1,inplace=True)
funnel_model.drop('mql_id',axis=1,inplace=True)
funnel_model.drop_duplicates()

There are a few things to note with the preceding code:

Extract the contact day, month, and year because the date alone is not going to be a useful predictor
Drop the time to close (for the time being) as it will not be used in the initial model
Drop the date in which the contract was won. The date itself and it’s date components also will not be useful
Drop the first contact date as it will not be a useful predictor itself
Drop the unique qualified id because it is not useful
Drop any duplicates in the dataset so we ensure that bias is less likely

The following code addresses a particularly thorny problem from an architectural standpoint. There were a significant number of ‘na’ or ‘nan’ values with the combination of the close deals and leads dataset. In order to properly demonstrate the use case, the ‘na’ and ‘nan’ values will need to be addressed through imputation. In this investigation, we are going to assume that if the won date is null, then the contract has been lost—thereby giving us a population of leads not won and those that have been won (remember the line of code above: funnel['closed_deal'] = funnel['won_date'].notnull()). There is no other identifier for a lead in progress or otherwise.

A simple line of code to determine the number of ‘na’ and ‘nan’ records is the following:

#count missing values (NAs)
missing_count = pd.DataFrame(funnel_model.isna().sum(),columns=['Number'])
missing_count['Percentage'] = round(missing_count / len(funnel_model),2) * 100
missing_count

Given that there were so many missing feature values, imputation should be sufficient to demonstrate how to properly fill the gaps in the data model. Conceptually, the imputation employed was simple. A function was created to write the unique values from a feature to a list. Then, another line of code was written to randomly choose values from that list to fill the ‘na’ or ‘nan’ within the dimension:

origin_list = funnel_model["origin"].unique()
origin_list = [x for x in origin_list if str(x) != 'nan']
funnel_model['origin'] = funnel_model['origin'].fillna(pd.Series(np.random.choice(origin_list, size=len(funnel_model.index))))

sdr_id_list = funnel_model["sdr_id"].unique()
sdr_id_list = [x for x in sdr_id_list if str(x) != 'nan']
funnel_model['sdr_id'] = funnel_model['sdr_id'].fillna(pd.Series(np.random.choice(sdr_id_list, size=len(funnel_model.index))))

sr_id_list = funnel_model["sr_id"].unique()
sr_id_list = [x for x in sr_id_list if str(x) != 'nan']
funnel_model['sr_id'] = funnel_model['sr_id'].fillna(pd.Series(np.random.choice(sr_id_list, size=len(funnel_model.index))))

bs_list = funnel_model["business_segment"].unique()
bs_list = [x for x in bs_list if str(x) != 'nan']
funnel_model['business_segment'] = funnel_model['business_segment'].fillna(pd.Series(np.random.choice(bs_list, size=len(funnel_model.index))))

lead_list = funnel_model["lead_type"].unique()
lead_list = [x for x in lead_list if str(x) != 'nan']
funnel_model['lead_type'] = funnel_model['lead_type'].fillna(pd.Series(np.random.choice(lead_list, size=len(funnel_model.index))))

lbp_list = funnel_model["lead_behaviour_profile"].unique()
lbp_list = [x for x in lbp_list if str(x) != 'nan']
funnel_model['lead_behaviour_profile'] = funnel_model['lead_behaviour_profile'].fillna(pd.Series(np.random.choice(lbp_list, size=len(funnel_model.index))))

gtin_list = funnel_model["has_gtin"].unique()
gtin_list = [x for x in gtin_list if str(x) != 'nan']
funnel_model['has_gtin'] = funnel_model['has_gtin'].fillna(pd.Series(np.random.choice(gtin_list, size=len(funnel_model.index))))

btype_list = funnel_model["business_type"].unique()
btype_list = [x for x in btype_list if str(x) != 'nan']
funnel_model['business_type'] = funnel_model['business_type'].fillna(pd.Series(np.random.choice(btype_list, size=len(funnel_model.index))))

After filling the ‘nan’ and ‘na’ values, a combination between the SDR id and SR id to build a feature that uses the combo to predict closure:

funnel_model['sdr_sr'] = funnel_model['sdr_id'] + funnel_model['sr_id']

If we re-run the code for determining the ‘na’ or ‘nan’ value there should not be any left. You’ ll notice there is a single record still left ‘nan’ —so I drop it.

#count missing values (NAs)
missing_count = pd.DataFrame(funnel_model.isna().sum(),columns=['Number'])
missing_count['Percentage'] = round(missing_count / len(funnel_model),2) * 100
missing_count

funnel_model.loc[funnel_model['has_gtin'].isnull()]

funnel_model = funnel_model.drop(index=7999)

At this point, the data model should be set properly. First, we select the features to be used in the model. The next step will be to encode the categorical data. For this implementation, one-hot encoding was used as ordinal encoding did not seem to be appropriate.

df1 = funnel_model[['landing_page_id', 'origin', 'sdr_id','sr_id','business_segment',
                   'lead_type','lead_behaviour_profile','has_gtin','business_type','closed_deal',
                  'contact_day','contact_month','contact_year','sdr_sr']].copy()

pd.get_dummies(df1)

pd.get_dummies(df1.drop('closed_deal',axis=1),drop_first=True)

It was a long road to get here, but the following code will apply to model development. Support Vector Classifier and Decision Trees from Sci-kit Learn will be used initially for the prediction. We will first define the X and y. ‘X’ being the features and ‘y’ being the values we are trying to predict:

X = pd.get_dummies(df1.drop('closed_deal',axis=1),drop_first=True)
y = df1['closed_deal']

After defining the X and y, complete the train, test, split. I have set the test size to be a bit smaller and the training set to be larger. Due to the imbalanced classes, the hope is more closed deals will end up in the training set to learn from.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

Define the model and fit the model to the training data:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train,y_train)

The model is insanely simple—almost comical levels of simplicity given the complex nature of the functions being performed. However, it is much easier to create baselines with simple models so that hyper-parameters could be effectively tuned. Next, we will build our predictions:

base_pred = model.predict(X_test)

After the predictions have been made, it is easy enough to determine accuracy through a confusion matrix and classification report:

from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix

confusion_matrix(y_test,base_pred)

array([[1357,   77],
       [  98,   68]], dtype=int64)

plot_confusion_matrix(model,X_test,y_test)

print(classification_report(y_test,base_pred))

               precision    recall  f1-score   support

       False       0.93      0.95      0.94      1434
        True       0.47      0.41      0.44       166

    accuracy                           0.89      1600
   macro avg       0.70      0.68      0.69      1600
weighted avg       0.88      0.89      0.89      1600

OOF! The results of this model reflects the terribly imbalanced classes that exist. To no surprise, the model is very accurate in regards to predicting which leads won’t close and terrible at predicting the leads that will close. It would be very unwise to use the overall accuracy score of 0.89 (89%) given that there is a biased preference by the model to predict that a lead will not close. All this said, we can try to see if a very basic Support Vector Classifier will be a more balanced model.

Feature Importances:

pd.DataFrame(index=X.columns,data=model.feature_importances_,columns=['Feature Importance']).sort_values(by=['Feature Importance'],ascending=False)

Using Grid Search, an optimized Support Vector Classifier was built to determine the best parameters possible for the basic model:

from sklearn.model_selection import GridSearchCV

svm = SVC()
param_grid = {'C':[0.01,0.1,1],'kernel':['linear','rbf']}
grid = GridSearchCV(svm,param_grid)

# Note again we didn't split Train|Test
grid.fit(X,y)

After the Grid Search is complete, we can find the best parameters for the model to baseline:

grid.best_score_

0.9486120231394622

grid.best_params_

{'C': 0.1, 'kernel': 'linear'}

Not a terrible score! But, as we noted above, we need to break down this accuracy numbers into manageable components. We can do this by, first, building the simple model using the parameters found in the Grid Search:

model = SVC(kernel='linear', C=0.1)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

After the model has been trained, we can now see the accuracy for the second model we built:

print(classification_report(y_test, y_pred))

                precision    recall  f1-score   support

       False       0.96      0.98      0.97      1434
        True       0.83      0.63      0.72       166

    accuracy                           0.95      1600
   macro avg       0.89      0.81      0.84      1600
weighted avg       0.94      0.95      0.95      1600

Based off this classification report above, the Support Vector Classifier performs significantly better than the Tree Based methods above—especially in the area of predicting the leads that will eventually close. There is nearly a 30% increase in the prediction accuracy of True while maintaining a high level of accuracy on False. A confusion matrix helps further contextualize accuracy:

plot_confusion_matrix(model,X_test,y_test)

Conclusion:

Initial modeling seems to indicate that the Support Vector Classifiers are better predictors than the Tree Based methods
Accuracy, especially for this use case, needs to be balanced across the True and False—especially if the data will continue to be imbalanced in the future
There are opportunities to leverage other features that were part of the original dataset, but had poor data quality. Assuming the data quality could be improved, there would be increased opportunities to improve business outcomes

Opportunities & Future work:

The imbalanced nature of the dataset should be addressed through oversampling, weighting, and attempting to use gradient boosted algorithms
Prediction of the time to close will likely be another worthwhile venture, especially when attempting to predict future sales, prioritizing lines of business, and even resource planning
Thorough discussion would need to occur between those that would use something like this in process and those that have designed the algorithm itself—industrialization would need to be done with care and monitored closely over time

Permalink

Knowledge Management as a Keystone in your Data Science & Analytics Strategy

Tyler Betthauser — Sun, 17 Jan 2021 02:38:54 +0000

Knowledge Management in Analytics

In 2021, I was asked to prepare a roadmap a company just beginning along their Data Science journey. Knowledge management would be one of the keys to a healthy data science and analytics strategy for 2021 and beyond. Knowledge management (as defined for the context of this article) is the process and tools necessary to capture, disseminate, and present information generated throughout the organization—whether that be lessons learned, best practices, locations of data, project management information, tickets, and a whole host of other artifacts.

Documentation, more broadly knowledge management, is not a sexy topic. But, it can’t be overstated the unbridled frustration that occurs when analysts, data scientists, and machine learning engineers spend hours looking for data in random databases and obscure tables. it’s just infuriating. Most Data Scientists might acknowledge the importance of developing a robust knowledge management solution but never talk specifically about how they might deploy such a project within their organisation. Typically, there are a few reasons why most companies get knowledge management wrong with regards to analytics:

Return on investment is not immediately apparent
It is easy to borrow against future resources and the consequences might be perceived to be low
Documentation is boring
Difficult to maintain over time

The losses can be huge over time. Consider a single Data Scientist who makes $125,000 per year—roughly $67.00 USD per hour. Without a robust knowledge management solution, your analytics organization could be spending dozens of hours per week looking for data strewn about the business, struggling to generate queries, searching for data dictionaries, trying to figure out the transformations applied to a dataset, etc. It is easy to see that tens of thousands of dollars can be lost per year by simply not having tools to efficiently to their work.

Gaps (or complete lack thereof) in documentation is an exponential problem, by the time you notice there is an issue, the gradient has exploded! Therefore, companies who start off on the right foot (from the start) will have an easier time maintaining and reaping the benefits of knowledge management. All that said, starting is the next best step.

Where do you start?

It is probably easiest, and most useful, to begin with an entity reference (ER) diagram. These artifacts are likely the most important pieces of documentation that can be made available to data professionals. ER diagrams can come in all sorts of shapes, sizes, and complexities. However, the nature of these artifacts remains the same: these are the dictionaries that can help determine the location of data and how that data relates to other data objects/sources. There are tons of templates out there to model your work off of, but I have liked to use draw.io. The software is free, no license required, and simple to use—much like Visio or other flowcharting software.

I like to treat ER Diagrams like catalogues. They should be structured in such a way that allows the user to pose a question or search with a subject matter in mind. For example, a data professional in Marketing might want to look for data related to ‘marketing’. Therefore, maybe your ‘data catalogue’ starts simply like this:

Each will have different structures, names, themes, etc but, overall, for ease of use I find this the best way to help someone find data related to a concept. One might even draw a corollary to a graph structure. Next, move to a ‘source’ or maybe even a ‘location’ of data.

The second layer becomes extremely important because it is a catalyst for understanding exactly where data of this type is stored and accessed. It is important to be somewhat verbose in this layer of the graph. It should be quite clear the location of the data in question. Finally, but not necessarily so, I like to expand the graphs to the tables where the data exists.

The outer parts of the graph is where the complexity can become cumbersome (but it is worth it). This is the basic structure I tend to follow when working on a project such as an ER Diagram. The version I like to use is not textbook quality, but it is a framework I have adapted over time serving in many different roles and companies.

A Data Dictionary is Nice: But now what?

After creating your data dictionary tool, some might be done! There will be situations out there where is no need to continue. However, in some instances the journey might continue on to other knowledge management tools.:

Develop a knowledge center in a tool like Sharepoint or Confluence: find a way to consolidate all of the artifacts related to analytics on a single platform (think, ‘one-stop shopping’)
Start a series of training or podcasts that encourage data literacy throughout your company. Generate and disseminate knowledge in ways that enable you, as a data professional, to be more effective
Find time to communicate to the rest of your organization current projects, status, and requests for feedback

When does it all End?

Well….the job isn’t ever done! But, there is a point where maintenance is not as terrible. Largely, the end game will be determined according to each unique situation. Take steps to spread the work out amongst a few different team members, if possible. Another idea might be to set a rotation where a day or two per month is devoted solely to collecting knowledge, documenting that knowledge, and writing a brief summary that let’s other teams know there have been updates.

What is the value in the end?

The easy part about writing this article is that there is little to be debated. Organizations that collect, store, disseminate, and maintain their knowledge can have a competitive edge in creating a sustainable business. Because of the growth of data science, business intelligence, and analytics within modern companies, it only makes more sense to better organize information generated by a burgeoning profit center within contemporary institutions.

Specifically, in the context of the analytics operations there are some key value propositions:

Efficiency in development of queries, models, data models, algorithms, can help reduce go-to-market time—thereby potentially increasing return on your investment
Your analysts, scientists, and engineers will be less likely to be frustrated with finding important data
Better baseline future projects/initiatives by having a quick reference on what went wrong and right on past developments

Decision Intelligence: Data is an enabler to better decision making

Tyler Betthauser — Thu, 31 Dec 2020 00:09:20 +0000

Charles Elwood (SolisMatica), Andrew Hoekstra (Pointe Vector), and I break down how to get data into the hands of those within the organization that make decisions everyday and why enabling decision intelligence will be important to unlocking sales, efficiencies, and innovation. The team also tackles dealing with the fear (yes, fear) that accompanies the use of data that might reflect a poor image of performance or an unpopular truth.

Permalink

Data Preparation & Our Top Tips

Tyler Betthauser — Wed, 09 Dec 2020 13:19:40 +0000

Conaxon was invited back to speak with SolisMatica and Pointe Vector to discuss our top tips when it comes to preparing your data for a visualization or machine learning project. This was an exciting and (at times) light-hearted approach to diving into the most important step in working with data.

Permalink

So you want a career in Data Analytics--Here's how we think you can do it FAST!

Tyler Betthauser — Fri, 20 Nov 2020 18:47:54 +0000

In this livestream Pointe Vector, SolisMatica, and Conaxon team up to talk about what it takes to get a job in the realm of data analytics. Largely, our thoughts could apply to any career you want to get into, but we tend to focus on key strategies that have worked for us specifically in data analytics.

Permalink

Data Analytics in the Automotive Industry

Tyler Betthauser — Fri, 20 Nov 2020 18:21:00 +0000

Much of my early career has been spent in the automotive industry. Thanks to Charles Elwood from SolisMatica, I have been able to share my journey

There is so much data out there that can be used to develop key insights into customers, product use, quality, and so much more. I spend some time with some industry experts from SolisMatica, Pointe Vector, and We Predict Inc talking about the use of data analytics within the automotive industry—past, present, and future. Check out the link to my page and listen in on the action!

Permalink