The post Machine learning best practices: Autotune models to avoid local minimum breakdowns appeared first on Subconscious Musings.

]]>Hyperparameters are the algorithm options one "turns and tunes" when building a learning model. Hyperparameters cannot be learned using that algorithm. So, these parameters need to be assigned before training of the model. A lot of manual efforts in machine learning is spent finding the optimal set of hyperparameters for our model. How can you find a suitable hyperparameter set more efficiently than using trial and error?

There are several ways to solve this problem by automatically tuning, or autotuning, the parameters, including:

- Grid search: Grid search is simply an exhaustive search through a manually specified subset of the hyperparameters space. It must be guided by some performance metric, like measured by cross-validation on the training set or evaluation on a held-out validation set. Start from even-spaced start points, compute the objective functions of these points and select the smallest one as a solution. This is not very realistic when the parameter space is huge.
- Bayesian optimization. Bayesian optimization is a methodology for the global optimization of noisy black-box functions. Applied to hyperparemeters optimization, Bayesian optimization consists of developing a statistical model of the function from hyperparameter values to the objective evaluated on a validation set.

For more ways to autotune the hyperparameters, read this autotune paper by my colleagues at SAS, or check out his blog post about hyperparameter tuning.

My next post will be about generalization. If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment here. You can read the whole series by clicking on the image below.

The post Machine learning best practices: Autotune models to avoid local minimum breakdowns appeared first on Subconscious Musings.

]]>The post Machine learning best practices: Put your models to work appeared first on Subconscious Musings.

]]>It’s common to build models on historical training data and then apply the model to new data to make decisions. This process is called model deployment or scoring. I often hear data scientists say, “It took my team weeks, or months to deploy our model.” Sometimes, after all your hard work, some models never get deployed.

Each model includes a lot of data preparation logic. You have to aggregate many data sources, include the model formulae, and layer it with rules or policies. To summarize, a scoring model is comprised of data preparation + rules + model formulae + potentially more rules.

The data preparation logic is essential for scoring. This data wrangling phase – which includes defining all of the transformation logic to create new features – is typically handled by a data scientist.

Then, to deploy the model you have to replicate the data wrangling phase. Often, IT completes this task to integrate the model into a company’s decision support systems. Unfortunately, most organizations don’t have enough rigor and metadata to re-create the data wrangling phase for scoring. As a result, many of the backward data source dependencies for deriving the new scoring tables get lost. This is by far the biggest reason why most organizations take too long to put a model to work. How can you avoid these frustrations?

To get models into production, implement best practices for managing predictive models in a production environment, including:

- Determine the business objective.
- Access and manage the data.
- Develop the model.
- Validate the model.
- Deploy the model.
- Monitor the model.

One tip is to use tools that enable you to automatically capture and bind the data preparation logic including preliminary transformations with the model score code. The data engineer or IT staff responsible for deploying the model then has a blueprint for implementation. He/she does not have to piecemeal back together data engineering and algorithmic steps which is a huge time savings. The data scientist should participate in running initial scoring tests prior to putting the model into production.

Advanced enterprises are also creating standardized analytical data marts to foster replication and reuse of data for analytics. Data scientists and IT can then work collaboratively to harvest directly from these analytical data marts to build and deploy more models faster. You can tune this process to get to the point where some common modeling efforts are run like a model factory. It is important that the data scientist contribute new feature logic to the data marts. You don’t want to hand cuff your data scientist by not letting him/her pioneer new features at the detailed data source levels.

If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment here. My next post will be about autotuning. If you missed any previous posts, click the image below to read the whole series.

The post Machine learning best practices: Put your models to work appeared first on Subconscious Musings.

]]>The post Machine learning best practices: combining lots of models appeared first on Subconscious Musings.

]]>Data scientists commonly use machine learning algorithms, such as gradient boosting and decision forests, that automatically build lots of models for you. The individual models are then combined to form a potentially stronger solution. One of the most accurate machine learning classifiers is gradient boosting trees. In my own supervised learning efforts, I almost always try each of these models as challengers.

When using random forest, be careful not to set the tree depth too shallow. The goal of decision forests is to grow at random many large, deep trees (think forests, not bushes). Deep trees certainly tend to overfit the data and not generalize well, but a combination of these will capture the nuances of the space in a generalized fashion.* *

Some algorithms fit better than others within specific regions or boundaries of the data. A best practice is to combine different modeling algorithms. You may also want to place more emphasis or weight on the modeling method that has the overall best classification or fit on the validation data. Sometimes two weak classifiers can do a better job than one strong classifier in specific spaces of your training data.

As you become experienced with machine learning and master more techniques, you’ll find yourself continuing to address rare event modeling problems by combining techniques.

Recently, one of my colleagues developed a model to identify unlicensed money service businesses. The event level was about 0.09%. To solve the problem, he used multiple techniques:

- First, he developed k-fold samples by randomly selecting a subsample of nonevents in each of his 200 folds, while making sure he kept all the events in each fold.
- He then built a random forest model in each fold.
- Lastly, he ensembled the 200 random forest, which ended up being the best classifier among all the models he developed.

This is a pretty big computational problem so it's important to be able to build the models in parallel across several data nodes so that the models train quickly.

If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment on this post.

My next post will be about model deployment, and you can click the image below to read all 10 machine learning best practices.

The post Machine learning best practices: combining lots of models appeared first on Subconscious Musings.

]]>The post Machine learning best practices: detecting rare events appeared first on Subconscious Musings.

]]>Machine learning commonly requires the use of highly unbalanced data. When detecting fraud or isolating manufacturing defects, for example, the target event is extremely rare – often way below 1 percent. So, even if you’re using a model that’s 99 percent accurate, it might not correctly classify these rare events.

A lot of data scientists frown when they hear the word sampling. I like to use the term focused data selection, where you construct a biased training data set by oversampling or undersampling. As a result, my training data may end up slightly more balanced, often with a 10 percent event level or more (See Figure 1). This higher ratio of events can help the machine learning algorithm learn to better isolate the event signal.

For reference, undersampling removes observations at random to downsize the majority class. Oversampling up-sizes the minority class at random to decrease the level of class disparity.

Another rare event modeling strategy is to use decision processing to place greater weight on correctly classifying the event.

The table above shows the cost associated with each decision outcome. In this scenario, classifying a fraudulent case as not fraudulent has an expected cost of $500. There's also a $100 cost associated with falsely classifying a non-fraudulent case as fraudulent.

Rather than developing a model based on some statistical assessment criterion, here the goal is to select the best model that minimizes total cost. Total cost = False negative X 500 + False Positive X 100. In this strategy, accurately specifying the cost of the two types of misclassification is the key to the success of the algorithm.

My next post will be about combining lots of models. And you can read the whole series by clicking on the image below.

If there are other tips you want me to cover, or if you have tips of your own to share, leave a comment here.

The post Machine learning best practices: detecting rare events appeared first on Subconscious Musings.

]]>The post Machine learning best practices: the basics appeared first on Subconscious Musings.

]]>Regardless of what you call it, I’ve spent more than 30 years building models that help global companies solve some of their most pressing problems. I’ve also had the good fortune to learn from some of the best data scientists on the planet, including Will Potts, Chief Data Scientist at Capital One, Dr. Warren Sarle, a distinguished researcher here at SAS, and Dr. William Sanders while I was at the University of Tennessee.

Through hundreds of projects and dozens of mentors over the years, I’ve caught on to some of the best practices for machine learning. I’ve narrowed those lessons down to my top ten tips. These are tips and tricks that I’ve relied on again and again over the years to develop the best models and solve difficult problems.

I’ll be sharing my tips in a series of posts over the next few weeks, starting with the first three tips here. The next tips will be longer, but these first three are short and sweet, so I've included them in one post:

*Look at your data.*

You spend 80 percent or more of your time preparing a training data set, so prior to building a model, please look at your data at the observational level. I always use PROC PRINT with OBS=20 in Base SAS^{®}, the FETCH action in SAS^{®}VIYA, and the HEAD or TAIL functions in Python to see and almost touch the observations. You can quickly discern if you have the right data in the correct form just by looking at it. It’s not uncommon to make initial mistakes when building out your training data, so this tip can save you a lot of time. Naturally, you then want to generate measures of central tendency and dispersion. To isolate key trends and anomalies, compute summary statistics for your features with your label. If the label is categorical, compute summary measures using the label as a group by variable. If the label is interval, compute correlations. If you have categorical features, use those as your by group.*Slice and dice your data.*

Usually, there’s some underlying substructure in your data. So I often slice my data up like a pizza – although the slices are not all the same size – and build separate models for each slice. I may use a groupby variable like REGION or VEHICLE_TYPE that already provides built in stratification for my training data. When I have a target, I also build a shallow decision tree and then build separate models for each segment. I rarely use clustering algorithms to build segments if I have a target. I just don’t like ignoring my target.*Remember**Occam’s Razor**.*

The object of Occam learning is to output a succinct representation of the training data. The rational is, you want as simple a model as possible to make informed decisions. Many data scientists no longer believe in Occam’s Razor, since building more complex models to extract as much as you can from your data is an important technique. However, I also like to build simple, white-box models using regression and decision trees. Or I’ll use a gradient boosting model as a quick check for how well my simple models are performing. I might add first order interactions or other basic transformations to improve the performance of my regression model. I commonly use L1 to shrink down the number of model effects in my model (watch for more about this in an upcoming post). Simpler models are also easier to deploy which makes the IT and systems operation teams happy. Finally, using the simplest model possible also makes it easier to explain results to business users, who will want to understand how you’ve arrived at a conclusion before making decisions with the results.

My next post will be about detecting rare events, and you can click on the image below to continue to reading all ten machine learning best practices as I publish them.

If there are other tips you want me to cover, leave a comment here.

The post Machine learning best practices: the basics appeared first on Subconscious Musings.

]]>The post How to use regularization to prevent model overfitting appeared first on Subconscious Musings.

]]>According to Wikipedia, regularization "refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. This information usually comes in the form of a penalty for complexity, such as restrictions for smoothness or bounds on the vector space norm.”

Shrinkage can be thought of as "a penalty of complexity." Why? If we set some parameters of the model to exactly zero, then the model is effectively shrunk to have lower-dimensionality and less complex. Analogously, if we use a shrinkage mechanism to zero out some of the parameters or smooth the parameters (the difference of parameters will not be very large), then we are decreasing complexity by reducing dimensions or making it more continuous.

Why do we want to use shrinkage mechanism in training data? In real life, we often encounter problems where we have less data but a lot of features in our problem. These problems become ill-posed, meaning no unique solution can be found.

In fact, without using shrinkage, we can find a lot, if not unlimited, solutions to them. And the learned model from such data sets will often over fit. It will fit the training data perfectly but it does not generalize well to the unseen data (See Figure 1).

Before we jump into this figure, let us first explain what the figure means. In Figure 1, the blue line is the true underlying model. The blue circles are noisy samples drawn from the model. We separate all samples into two groups -- training samples denoted by blue solid circles and testing samples represented by blue dotted circles. Our problem can be expressed as:

\begin{equation}\label{eq:LS}

\hat\beta =\mathrm{arg}\max_{\beta}\sum_{j=1}^{n}(y_j-\sum_{i=1}\beta_i\phi_i(x_j))^2

\end{equation}

where \(\beta_i\) is the parameter that we want to learn, \(x_j\) is the \(i\)th data point and \(y_i\) is the \(i\)th predicted value in the training data set.

Assume that the red line is the regression model we learn from the training data set. It can be seen that the learned model fits the training data set perfectly, while it cannot generalize well to the data not included in the training set. There are several ways to avoid the problem of overfitting.

To remedy this problem, we could:

- Get more training examples.
- Use a simple predictor.
- Select a subsample of features.

In this blog post, we focus on the second and third ways to avoid overfitting by introducing regularization on the parameters \(\beta_i\) of the model.

Three types of regularization are often used in such a regression problem:

• *\({l}_2\)* regularization (use a simpler model)

*• \({l}_1\)* regularization (select a subsample of features)

• *\({l}_{12}\)* regularization (both)

*\({l}_2\)* regularization, which adds a penalty of *\({l}_2\)* norm on the parameters \(\beta_i\), encourages the sum of the squares of the parameters \(\beta_i\) to be small. The original problem is transformed to the ridge regression, which can be expressed as

\begin{eqnarray}\label{eq:ridge}

\hat\beta = \mathrm{arg}\max_{\beta}\sum_{j=1}^{n}(y_j-\sum_{i=1}\beta_j\phi_i(x_j))^2+\lambda\sum_i\beta_i^2

\end{eqnarray}

where the shrinkage parameter \(\lambda\) need be tuned via cross-validation.

The *\({l}_2\)* regularization can be explained from a geometric perspective. As shown in Figure 2, the residual sum of squares has elliptical contours, represented by a black curve. The *\({l}_2\)* constraint is represented by the red disk. The first point where the elliptical contours hit the constraint region is the solution of ridge regression. *\({l}_2\)* regularization will keep all predictors by jointly shrinking the corresponding coefficients. It also reduces the possible solution to those points in the intersection of two contours. The intersection is a much smaller set than the original parameter space. Therefore it reduces the complexity of the model and smooths it.

*\(l_1\)* regularization, instead, which uses a penalty term of *\({l}_1\)* norm, encourages the sum of the absolute values of the parameters to be small. Problem (\ref{eq:LS}) then becomes Lasso regression, which can be expressed as:

\begin{eqnarray}\label{eq:L1}

\hat\beta =\mathrm{arg}\max_{\beta}\sum_{j=1}^{n}(y_j-\sum_{i=1}\beta_j\phi_i(x_j))^2+\lambda\sum_i|\beta_i|

\end{eqnarray}

Unlike the *\({l}_2\)* regularization, the *\({l}_1\)* constraint is represented by a red diamond, seen in Figure 3. The diamond has corners; if the solution occurs at a corner, then it has one parameter \(\beta_j\) equal to zero.

Lasso regression uses this shrinkage mechanism to zero out some parameters \(\beta_i\) and de-select the corresponding features \(\phi_i(x_j)\). Due to its strong tends of setting some parameters to zeros, it is often used to select features when we know that some features are really very sparse.

Elastic net regularization is a tradeoff between *\({l}_2\)*and *\({l}_1\)* regularization and has a penalty which is a mix of *\(l_1\)* and *\(l_2\)* norm. It can perform the function of feature selection while still not imposing too much sparsity on the features (discarding too many features) by imposing a mixture of *\({l}_2\)* and *\({l}_1\)* regularization on parameters \(\beta_i\), seen in equ.(\ref{eq:L12}).

\begin{eqnarray}\label{eq:L12}

\hat\beta =argmax_{\beta}\sum_{j=1}^{n}(y_j-\sum_{i=1}\beta_j\phi_i(x_j))^2+\lambda_1\sum_i|\beta_i|+\lambda_2\sum_i\beta_i^2

\end{eqnarray}

The selection of different penalties depends on problems. If the signals are truly sparse, then *\(l_1\)* or *\(l_{12}\)* penalty can be used to select the hidden signals from noisy data while it is almost impossible for *\(l_2\)* to fully recover the the sparse signals. For problems with features not sparse at all, I often find the *\(l_2\)* regularization often outperforms *\(l_1\)* regularization. In the prediction application, ridge regression with *\(l_2\)* is more common and often recommended for many modeling. However in the case where you have many features and want to reduce the complexity of the model by de-selecting some features, you may want to impose *\(l_1\)* penalty or go for more of a balanced approach like elastic net.

Therefore, it is best to collect as many samples as possible. Even with a lot samples, a simpler model with *\(l_2\)* regularization will often perform better than other choices.

For more on regularization, be sure to read these posts in the SAS Community:

- Tuning Matters: An Example of LASSO Tuned via Validation Data and Cross Validation
- Tip: Top five reasons for using penalized regression for modeling your high-dimensional data

The post How to use regularization to prevent model overfitting appeared first on Subconscious Musings.

]]>The post Why do stacked ensemble models win data science competitions? appeared first on Subconscious Musings.

]]>Building powerful ensemble models has many parallels with building successful human teams in business, science, politics, and sports. Each team member makes a significant contribution and individual weaknesses and biases are offset by the strengths of other members.

The simplest kind of ensemble is the unweighted average of the predictions of the models that form a model library. For example, if a model library includes three models for an interval target (as shown in the following figure), the unweighted average would entail dividing the sum of the predicted values of the three candidate models by three. In an unweighted average, each model takes the same weight when an ensemble model is built.

More generally, you can think about using *weighted* averages. For example, you might believe that some of the models are better or more accurate and you want to manually assign higher weights to them. But an even better approach might be to estimate these weights more intelligently by using another layer of learning algorithm. This approach is called model stacking.

__Model stacking__ is an efficient ensemble method in which the predictions, generated by using various machine learning algorithms, are used as inputs in a second-layer learning algorithm. This second-layer algorithm is trained to optimally combine the model predictions to form a new set of predictions. For example, when linear regression is used as second-layer modeling, it estimates these weights by minimizing the least square errors. However, the second-layer modeling is not restricted to only linear models; the relationship between the predictors can be more complex, opening the door to employing other machine learning algorithms.

Ensemble modeling and model stacking are especially popular in data science competitions, in which a sponsor posts a training set (which includes labels) and a test set (which does not include labels) and issues a global challenge to produce the best predictions of the test set for a specified performance criterion. The winning teams almost always use ensemble models instead of a single fine-tuned model. Often individual teams develop their own ensemble models in the early stages of the competition, and then join their forces in the later stages.

On the popular data science competition site Kaggle you can explore numerous winning solutions through its discussion forums to get a flavor of the state of the art. Another popular data science competition is the KDD Cup. The following figure shows the winning solution for the 2015 competition, which used a three-stage stacked modeling approach.

The figure shows that a diverse set of 64 single models were used to build the model library. These models are trained by using various machine learning algorithms. For example, the green boxes represent gradient boosting models (GBM), pink boxes represent neural network models (NN), and orange boxes represent factorization machines models (FM). You can see that there are multiple gradient boosting models in the model library; they probably vary in their use of different hyperparameter settings and/or feature sets.

At stage 1, the predictions from these 64 models are used as inputs to train 15 new models, again by using various machine learning algorithms. At stage 2 (ensemble stacking), the predictions from the 15 stage 1 models are used as inputs to train two models by using gradient boosting and linear regression. At stage 3 ensemble stacking (the final stage), the predictions of the two models from stage 2 are used as inputs in a logistic regression (LR) model to form the final ensemble.

In order to build a powerful predictive model like the one that was used to win the 2015 KDD Cup, __building a diverse set of initial models__ plays an important role! There are various ways to enhance diversity such as using:

- Different training algorithms.
- Different hyperparameter settings.
- Different feature subsets.
- Different training sets.

A simple way to enhance diversity is to train models by using different machine learning algorithms. For example, adding a factorization model to a set of tree-based models (such as random forest and gradient boosting) provides a nice diversity because a factorization model is trained very differently than decision tree models are trained. For the same machine learning algorithm, you can enhance diversity by using different hyperparameter settings and subsets of variables. If you have many features, one efficient method is to choose subsets of the variables by simple random sampling. Choosing subsets of variables could be done in more principled fashion that is based on some computed measure of importance which introduces the large and difficult problem of feature selection.

In addition to using various machine learning training algorithms and hyperparameter settings, the KDD Cup solution shown above uses seven different feature sets (F1-F7) to further enhance the diversity. Another simple way to create diversity is to generate various versions of the training data. This can be done by bagging and cross validation.

Overfitting is an omnipresent concern in building predictive models, and every data scientist needs to be equipped with tools to deal with it. An overfitting model is complex enough to perfectly fit the training data, but it generalizes very poorly for a new data set. Overfitting is an especially big problem in model stacking, because so many predictors that all predict the same target are combined. Overfitting is partially caused by this collinearity between the predictors.

The most efficient techniques for training models (especially during the stacking stages) include using cross validation and some form of regularization. To learn how we used these techniques to build stacked ensemble models, see our recent SAS Global Forum paper, * "Stacked Ensemble Models for Improved Prediction Accuracy."* That paper also shows how you can generate a diverse set of models by various methods (such as forests, gradient boosted decision trees, factorization machines, and logistic regression) and then combine them with stacked ensemble techniques such regularized regression methods, gradient boosting, and hill climbing methods.

The following image provides a simple summary of our ensemble approach. The complete model building approach is explained in detail in the paper. A computationally intense process such as this benefits greatly by running in a distributed execution environment offered in the SAS^{®} Viya platform by using SAS^{®} Visual Data Mining and Machine Learning.

Applying stacked models to real-world big data problems can produce greater prediction accuracy and robustness than do individual models. The model stacking approach is powerful and compelling enough to alter your initial data mining mindset from finding the single best model to finding a collection of really good complementary models.

Of course, this method does involve additional cost both because you need to train a large number of models and because you need to use cross validation to avoid overfitting. However, SAS Viya provides a modern environment that enables you to efficiently handle this computational expense and manage an ensemble workflow by using parallel computation in a distributed framework.

**To learn more, check out our paper, "Stacked Ensemble Models for Improved Prediction Accuracy," and read the SAS Visual Data Mining and Machine Learning documentation.**

The post Why do stacked ensemble models win data science competitions? appeared first on Subconscious Musings.

]]>The post Which machine learning algorithm should I use? appeared first on Subconscious Musings.

]]>A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

- The size, quality, and nature of data.
- The available computational time.
- The urgency of the task.
- What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

The **machine learning algorithm cheat sheet** helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems. This article walks you through the process of how to use the sheet.

Since the cheat sheet is designed for beginner data scientists and analysts, we will make some simplified assumptions when talking about the algorithms.

The algorithms recommended here result from compiled feedback and tips from several data scientists and machine learning experts and developers. There are several issues on which we have not reached an agreement and for these issues we try to highlight the commonality and reconcile the difference.

Additional algorithms will be added in later as our library grows to encompass a more complete set of available methods.

Read the path and algorithm labels on the chart as "If *<path label>* then use *<algorithm>*." For example:

- If you want to perform dimension reduction then use principal component analysis.
- If you need a numeric prediction quickly, use decision trees or logistic regression.
- If you need a hierarchical result, use hierarchical clustering.

Sometimes more than one branch will apply, and other times none of them will be a perfect match. It’s important to remember these paths are intended to be rule-of-thumb recommendations, so some of the recommendations are not exact. Several data scientists I talked with said that the only sure way to find the very best algorithm is to try all of them.

This section provides an overview of the most popular types of machine learning. If you’re familiar with these categories and want to move on to discussing specific algorithms, you can skip this section and go to “When to use specific algorithms” below.

Supervised learning algorithms make predictions based on a set of examples. For example, historical sales can be used to estimate the future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable. You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.

**Classification:**When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.**Regression:**When predicting continuous values, the problems become a regression problem.**Forecasting:**This is the process of making predictions about the future based on the past and present data. It is most commonly used to analyze trends. A common example might be estimation of the next year sales based on the sales of the current year and previous years.

The challenge with supervised learning is that labeling data can be expensive and time consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.

When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlies the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.

**Clustering:**Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.**Dimension reduction:**Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.

Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment. Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take. Trial-and-error and delayed reward distinguishes reinforcement learning from other techniques.

When choosing an algorithm, always take these aspects into account: accuracy, training time and ease of use. Many users put the accuracy first, while beginners tend to focus on algorithms they know best.

When presented with a dataset, the first thing to consider is how to obtain results, no matter what those results might look like. Beginners tend to choose algorithms that are easy to implement and can obtain results quickly. This works fine, as long as it is just the first step in the process. Once you obtain some results and become familiar with the data, you may spend more time using more sophisticated algorithms to strengthen your understanding of the data, hence further improving the results.

Even in this stage, the best algorithms might not be the methods that have achieved the highest reported accuracy, as an algorithm usually requires careful tuning and extensive training to obtain its best achievable performance.

Looking more closely at individual algorithms can help you understand what they provide and how they are used. These descriptions provide more details and give additional tips for when to use specific algorithms, in alignment with the cheat sheet.

Linear regression is an approach for modeling the relationship between a continuous dependent variable \(y\) and one or more predictors \(X\). The relationship between \(y\) and \(X\) can be linearly modeled as \(y=\beta^TX+\epsilon\) Given the training examples \(\{x_i,y_i\}_{i=1}^N\), the parameter vector \(\beta\) can be learnt.

If the dependent variable is not continuous but categorical, linear regression can be transformed to logistic regression using a logit link function. Logistic regression is a simple, fast yet powerful classification algorithm. Here we discuss the binary case where the dependent variable \(y\) only takes binary values \(\{y_i\in(-1,1)\}_{i=1}^N\) (it which can be easily extended to multi-class classification problems).

In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the "1" class versus the probability that it belongs to the "-1" class. Specifically, we will try to learn a function of the form:\(p(y_i=1|x_i )=\sigma(\beta^T x_i )\) and \(p(y_i=-1|x_i )=1-\sigma(\beta^T x_i )\). Here \(\sigma(x)=\frac{1}{1+exp(-x)}\) is a sigmoid function. Given the training examples\(\{x_i,y_i\}_{i=1}^N\), the parameter vector \(\beta\) can be learnt by maximizing the log-likelihood of \(\beta\) given the data set.

Kernel tricks are used to map a non-linearly separable functions into a higher dimension linearly separable function. A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector \(w\) and bias \(b\) of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

\begin{equation*}

\begin{aligned}

& \underset{w}{\text{minimize}}

& & ||w|| \\

& \text{subject to}

& & y_i(w^T X_i-b) \geq 1, \; i = 1, \ldots, n.

\end{aligned}

\end{equation*}

A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector and bias of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

When the classes are not linearly separable, a kernel trick can be used to map a non-linearly separable space into a higher dimension linearly separable space.

When most dependent variables are numeric, logistic regression and SVM should be the first try for classification. These models are easy to implement, their parameters easy to tune, and the performances are also pretty good. So these models are appropriate for beginners.

Decision trees, random forest and gradient boosting are all algorithms based on decision trees. There are many variants of decision trees, but they all do the same thing – subdivide the feature space into regions with mostly the same label. Decision trees are easy to understand and implement. However, they tend to over fit data when we exhaust the branches and go very deep with the trees. Random Forrest and gradient boosting are two popular ways to use tree algorithms to achieve good accuracy as well as overcoming the over-fitting problem.

Neural networks flourished in the mid-1980s due to their parallel and distributed processing ability. But research in this field was impeded by the ineffectiveness of the back-propagation training algorithm that is widely used to optimize the parameters of neural networks. Support vector machines (SVM) and other simpler models, which can be easily trained by solving convex optimization problems, gradually replaced neural networks in machine learning.

In recent years, new and improved training techniques such as unsupervised pre-training and layer-wise greedy training have led to a resurgence of interest in neural networks. Increasingly powerful computational capabilities, such as graphical processing unit (GPU) and massively parallel processing (MPP), have also spurred the revived adoption of neural networks. The resurgent research in neural networks has given rise to the invention of models with thousands of layers.

In other words, shallow neural networks have evolved into deep learning neural networks. Deep neural networks have been very successful for supervised learning. When used for speech and image recognition, deep learning performs as well as, or even better than, humans. Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention.

A neural network consists of three parts: input layer, hidden layers and output layer. The training samples define the input and output layers. When the output layer is a categorical variable, then the neural network is a way to address classification problems. When the output layer is a continuous variable, then the network can be used to do regression. When the output layer is the same as the input layer, the network can be used to extract intrinsic features. The number of hidden layers defines the model complexity and modeling capacity.

Kmeans/k-modes, GMM clustering aims to partition n observations into k clusters. K-means define hard assignment: the samples are to be and only to be associated to one cluster. GMM, however define a soft assignment for each sample. Each sample has a probability to be associated with each cluster. Both algorithms are simple and fast enough for clustering when the number of clusters k is given.

** **

When the number of clusters k is not given, DBSCAN (density-based spatial clustering) can be used by connecting samples through density diffusion.

Hierarchical partitions can be visualized using a tree structure (a dendrogram). It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities (i.e., can refine/coarsen clusters) using different K.

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the “intrinsic” dimensionality may be smaller than the number of features. Principal component analysis (PCA), singular value decomposition (SVD), and* *latent Dirichlet allocation (*LDA*) all can be used to perform dimension reduction.

PCA is an unsupervised clustering method which maps the original data space into a lower dimensional space while preserving as much information as possible. The PCA basically finds a subspace that most preserves the data variance, with the subspace defined by the dominant eigenvectors of the data’s covariance matrix.

The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA. However, SVD is a more versatile technique as it can also do things that PCA may not do. For example, the SVD of a user-versus-movie matrix is able to extract the user profiles and movie profiles which can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).

A related technique in NLP is latent Dirichlet allocation (LDA). LDA is probabilistic topic model and it decomposes documents into topics in a similar way as a Gaussian mixture model (GMM) decomposes continuous data into Gaussian densities. Differently from the GMM, an LDA models discrete data (words in documents) and it constrains that the topics are *a priori* distributed according to a Dirichlet distribution.

This is the work flow which is easy to follow. The takeaway messages when trying to solve a new problem are:

- Define the problem. What problems do you want to solve?
- Start simple. Be familiar with the data and the baseline results.
- Then try something more complicated.

SAS Visual Data Mining and Machine Learning provides a good platform for beginners to learn machine learning and apply machine learning methods to their problems. Sign up for a free trial today!

The post Which machine learning algorithm should I use? appeared first on Subconscious Musings.

]]>The post Model selection for spatial econometrics using PROC SPATIALREG appeared first on Subconscious Musings.

]]>For the purpose of illustration, this post uses the same 2013 North Carolina county-level home value data that was used in the previous post. The data set is named NC_HousePrice and contains five variables: county (county name), homeValue (median value of owner-occupied housing units), income (median household income in 2013 in inflation-adjusted dollars), bachelor (percentage of people in the county who have a bachelor’s degree or higher), and crime (rate of Crime Index offenses per 100, 000 people). Before you proceed with spatial econometric analysis, you need to create a spatial weights matrix. For convenience, consider a first-order contiguity matrix W where two counties are neighbors to each other if they share a common border.

To fit a spatial Durbin model (SDM) to the NC_HousePrice data, you can submit the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAR;
spatialeffects Income Crime Bachelor;
test _rho=0/all;
spatialid County;
run;
```

You supply two data sets—the primary data set and a spatial weights matrix—by using the DATA= option and the WMAT= option, respectively. The primary data set contains the dependent variable, the independent variables, and possibly the spatial ID variable. In the MODEL statement, you specify the dependent variable y and regressors x. You use the TYPE= option to specify the type of model to be fit, selecting one of the following values: SAR, SEM, SMA, SARMA, SAC, and LINEAR. For example, you specify TYPE=SAR to fit a SAR model and TYPE=LINEAR to fit a linear regression model. You use the SPATIALEFFECTS statement to specify exogenous interaction effects. In the preceding statements, you specify TYPE=SAR together with the SPATIALEFFECTS statement to fit an SDM model. The TEST statement in PROC SPATIALREG enables you to perform hypothesis testing. The SPATIALREG procedure supports three different tests: likelihood ratio (LR), Wald, and Lagrange multiplier (LM). The SPATIALID statement enables you to specify a spatial ID variable to identify observations in the two data sets that are specified in the DATA= and WMAT= options.

For the SDM model fitted to the NC_HousePrice data, the value of Akaike’s information criterion (AIC) is –144.96. The results of parameter estimation from the SDM model, shown in Table 1, suggest that three predictors—income, crime, and bachelor—are all significant at the 0.05 level. The spatial correlation coefficient ρ is estimated to be 0.31 and is significant at the 0.05 level. Table 2 shows the test results for H0: ρ = 0 from the three tests. According to the test results, you can conclude that H0 should be rejected at the 5% significance level. In other words, there is a significantly positive spatial correlation in house price.

To fit a spatial error model (SEM) to the NC_HousePrice data, you can submit the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SEM;
spatialid County;
run;
```

The results of parameter estimation from the SEM model, shown in Table 3, suggest that two out of three predictors—income and bachelor—are significant at the 0.05 level. The value of AIC for this model is –122.68. The spatial correlation coefficient λ is estimated to be 0.60 and is significant at the 0.05 level. As a result, there seems to be a significant positive correlation in the disturbance.

So far, SDM and SEM models have been fitted to NC_HousePrice data. The SDE model is capable of accounting for both endogenous and exogenous interaction effects, whereas the SEM model can account for spatial dependence in the disturbance term. The comparison of AIC values between these two models suggests that the SDM model is better than the SEM model because the SDM model has a smaller AIC value. However, you might want to try a more complicated model—such as a spatial autoregressive confused (SAC) model—that can address endogenous interaction effects, exogenous interaction effects, and spatial dependence in the disturbance term.

You can fit an SAC model to the NC_HousePrice data by submitting the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAC;
spatialid County;
run;
```

Table 4 shows the results of parameter estimation from the SAC model. As in the SEM model, only two predictors—income and bachelor—are significant at the 0.05 level. The value of AIC for this model is –121.43. The spatial correlation coefficient ρ is estimated to be –0.10, but it is not significant at the 0.05 level. However, the spatial correlation coefficient λ is estimated to be 0.67 and is significant at the 0.05 level.

Among the three models that have been considered for the NC_HousePrice data, the SDM model is the one with the smallest value of AIC. As a result, if you have to choose the best model among SDM, SEM, and SAC models according to AIC, the SDM model would be the winning model. Since model selection is common in most data analysis, PROC SPATIALREG facilitates model selection by enabling you to use multiple MODEL statements to fit more than one model at a time. For example, you can fit the preceding three models to NC_HousePrice data by submitting the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAR;
spatialeffects Income Crime Bachelor;
test _rho=0/all;
model Hvalue=Income Crime Bachelor/type=SEM;
model Hvalue=Income Crime Bachelor/type=SAC;
spatialid County;
run;
```

This post introduces how to fit spatial econometric models that are available in the SPATIALREG procedure. These models include the spatial Durbin model (SDM), spatial error model (SEM), and spatial autoregressive confused (SAC) model. It also describes how you can use multiple MODEL statements in only one call to PROC SPATIALREG to facilitate model selection. In the next blog post, we’ll talk more about how to create spatial weights matrices for spatial econometric analysis. We will also be giving a talk, "Big Value from Big Data: SAS/ETS® Software for Spatial Econometric Modeling in the Era of Big Data," at the SAS Global Forum conference April 2-5, 2017, in Orlando. Stop by and let's talk about spatial econometrics!

The post Model selection for spatial econometrics using PROC SPATIALREG appeared first on Subconscious Musings.

]]>The post Women in analytics - a personal perspective appeared first on Subconscious Musings.

]]>**Getting an early start in Analytics**

How do you encourage a young girl to pursue a career which requires mathematical or scientific skills? How do you react to your child’s interest in mathematics? Do you react differently depending on whether the child is a boy or girl? Why? Encourage him / her to pursue that interest and ensure that the school is emphasizing that message as well. From my childhood, my parents (especially my dad who started his career as a Math professor) were very proud that I was good at math – no one ever commented on the fact that I was also a girl! When I started high school and ended up being the only girl in the math class, we took it in stride and did not make a big deal of it – Radhika likes math and she is in the math class. That was it!

One way to interest young girls in math-related fields is by providing examples of real-world applications so they can understand the value. In today’s world, every field is rich with applications of mathematical modeling, so it is easier to capture student’s interests. Another avenue is to encourage your child to participate in competitions, especially ones involving teams. Often, girls like working in teams which can also be good training for their professional career later in life.

And, of course, it is very important for the young girls to have role models of women in analytics who provide examples of successful careers in these fields.

**Personal experience**

I had a unique experience during my Master’s program in Mathematics at the Indian Institute of Technology in Delhi. It was amazing that in the year 1975, my class had 7 women out of the small class of 14 students! Several of us women have gone on to have very successful professional careers in highly technical fields – in fact three of us are now in the U.S. Likewise, my PhD class in Operations Research at Cornell had five women out of a total of about a dozen students. With that kind of an experience of women in analytics, I have never noticed that there may be more men than women in my field. We are all just members of the analytics community!

**Career Growth**

Women are hesitant to talk about their strengths and proactively seek promotions. Be bolder in seeking out new opportunities. Of course, you need to be qualified for the role! Be confident of your strengths. If you have a strong foundation and are the expert in your area, your being a woman in analytics is irrelevant. People will recognize you as the expert immediately. Be bold and take a seat at the table. If you have been invited to participate, it is because you are recognized as someone who can contribute – use that opportunity to do so.

**Women in Analytics professions**

Within SAS Institute, we have several women in analytics at all levels – from individual contributors who are recognized throughout the institute as “the expert in a particular area” to senior managers who are responsible for key flagship analytical products from SAS. Several of them play important roles in leading professional organizations in addition to their responsibility at SAS. As a percentage of the workforce, we may have fewer women than men in the industry as a whole. However, there are several women in leading roles in analytics, both as leaders in major multinational companies (IBM, Ford, Verizon) and at the helm of professional organizations like the Institute for Operations Research and Management Science (INFORMS), the American Statistical Association, etc.

Especially in the last few years, women in analytics have made great strides in leading large organizations. For example, Dr. Pooja Dewan was named Chief Data Scientist at BNSF, Malene Haxholdt is VP of Enterprise Analytics at Metlife, Dr. Nipa Basu is Chief Analytics Officer at Dunn and Bradstreet and so on.

There is a geographic differential in the representation of women in technology. Some of the research from the UK indicates that women in India seem to see IT/STEM as empowering in a way that women in the UK or US do not. There is also research suggesting that while women may not be doing well in terms of increased numbers in computer science or math the discipline of statistics is an exception: “More than 40 percent of degrees in statistics go to women, and they make up 40 percent of the statistics department faculty poised to move into tenured positions.”

**Increasing interest in analytics for women**

As analytics and data science become more ubiquitous in several industries, we are seeing an increase in the number of women in Analytics. There are a few key reasons for this increase:

- There are many applications areas for analytics which are attractive to women: Health care, Education, non-profit work, projects that are aimed at doing social good. There is ample evidence that women are more drawn to opportunities to make a social impact.
- More companies are providing flexibility that helps women get back or stay in the work force. These include benefits like flexible work schedules, more “work from home” options, family medical leave options, more options for day care, support for nursing mothers, and so on.
- Many educational opportunities are available for women to get trained in data science and analytics arenas through online master’s programs or certification programs. For women who have a STEM-related undergraduate education, this provides an easy entry into the analytics domain. There is an interesting article describing how data science is creating opportunities for women. It is exciting to see the many women leaders who participated in the recent Women in Data Science Conference where the elite panel of speakers were all women!
- Many collaborative, team competitions (for example, the DataDive at SAS being held in partnership with DataKind) are being arranged across the data science and analytics domain – such collaborative, problem solving events may be particularly appealing to women.

**Challenges and opportunities being a woman in this highly competitive field**

It is well recognized that there is a confidence issue that plagues women in fields that are dominated by men. A Harvard study presented the result: “Female computer science concentrators with eight years of programming experience report being as confident in their skills as their male peers with zero to one year of programming experience.” I can relate to that! Most women I know will speak up only if they are confident that their points are thoroughly researched and vetted. We need to encourage them to participate freely in any dialog and discussion. Women need to be reminded that they have been invited to be in the group / discussion because their opinions are valued.

At the same time, women have some inherent strengths that they bring to the table. We have an ability to understand others’ points of view which makes it possible to have productive discussions over contentious topics. We also have a capacity to nurture which is useful in growing a team of very talented individuals who are often brilliant but may not be used to working as part of a high achieving team. One of the most important skills for a successful leader is the ability to see the big picture (remember the tale of the “Six Blind Men and the Elephant?” I believe that women are more likely to understand the big picture because of their natural empathy for others’ inputs.

My advice to young women entering the field of analytics is: Do not hold yourself back because you are a woman, you have earned the right to be in this area, use your strengths to build a strong team by nurturing everyone’s talents. This is a golden age to be part of this domain!

*Image credit: photo by *__Mike Kline__* // attribution by *__creative commons__

Note: I prepared these thoughts related to an interview in Analytics India Magazine: International Women's Day special.

The post Women in analytics - a personal perspective appeared first on Subconscious Musings.

]]>