The post Why do stacked ensemble models win data science competitions? appeared first on Subconscious Musings.

]]>Building powerful ensemble models has many parallels with building successful human teams in business, science, politics, and sports. Each team member makes a significant contribution and individual weaknesses and biases are offset by the strengths of other members.

The simplest kind of ensemble is the unweighted average of the predictions of the models that form a model library. For example, if a model library includes three models for an interval target (as shown in the following figure), the unweighted average would entail dividing the sum of the predicted values of the three candidate models by three. In an unweighted average, each model takes the same weight when an ensemble model is built.

More generally, you can think about using *weighted* averages. For example, you might believe that some of the models are better or more accurate and you want to manually assign higher weights to them. But an even better approach might be to estimate these weights more intelligently by using another layer of learning algorithm. This approach is called model stacking.

__Model stacking__ is an efficient ensemble method in which the predictions, generated by using various machine learning algorithms, are used as inputs in a second-layer learning algorithm. This second-layer algorithm is trained to optimally combine the model predictions to form a new set of predictions. For example, when linear regression is used as second-layer modeling, it estimates these weights by minimizing the least square errors. However, the second-layer modeling is not restricted to only linear models; the relationship between the predictors can be more complex, opening the door to employing other machine learning algorithms.

Ensemble modeling and model stacking are especially popular in data science competitions, in which a sponsor posts a training set (which includes labels) and a test set (which does not include labels) and issues a global challenge to produce the best predictions of the test set for a specified performance criterion. The winning teams almost always use ensemble models instead of a single fine-tuned model. Often individual teams develop their own ensemble models in the early stages of the competition, and then join their forces in the later stages.

On the popular data science competition site Kaggle you can explore numerous winning solutions through its discussion forums to get a flavor of the state of the art. Another popular data science competition is the KDD Cup. The following figure shows the winning solution for the 2015 competition, which used a three-stage stacked modeling approach.

The figure shows that a diverse set of 64 single models were used to build the model library. These models are trained by using various machine learning algorithms. For example, the green boxes represent gradient boosting models (GBM), pink boxes represent neural network models (NN), and orange boxes represent factorization machines models (FM). You can see that there are multiple gradient boosting models in the model library; they probably vary in their use of different hyperparameter settings and/or feature sets.

At stage 1, the predictions from these 64 models are used as inputs to train 15 new models, again by using various machine learning algorithms. At stage 2 (ensemble stacking), the predictions from the 15 stage 1 models are used as inputs to train two models by using gradient boosting and linear regression. At stage 3 ensemble stacking (the final stage), the predictions of the two models from stage 2 are used as inputs in a logistic regression (LR) model to form the final ensemble.

In order to build a powerful predictive model like the one that was used to win the 2015 KDD Cup, __building a diverse set of initial models__ plays an important role! There are various ways to enhance diversity such as using:

- Different training algorithms.
- Different hyperparameter settings.
- Different feature subsets.
- Different training sets.

A simple way to enhance diversity is to train models by using different machine learning algorithms. For example, adding a factorization model to a set of tree-based models (such as random forest and gradient boosting) provides a nice diversity because a factorization model is trained very differently than decision tree models are trained. For the same machine learning algorithm, you can enhance diversity by using different hyperparameter settings and subsets of variables. If you have many features, one efficient method is to choose subsets of the variables by simple random sampling. Choosing subsets of variables could be done in more principled fashion that is based on some computed measure of importance which introduces the large and difficult problem of feature selection.

In addition to using various machine learning training algorithms and hyperparameter settings, the KDD Cup solution shown above uses seven different feature sets (F1-F7) to further enhance the diversity. Another simple way to create diversity is to generate various versions of the training data. This can be done by bagging and cross validation.

Overfitting is an omnipresent concern in building predictive models, and every data scientist needs to be equipped with tools to deal with it. An overfitting model is complex enough to perfectly fit the training data, but it generalizes very poorly for a new data set. Overfitting is an especially big problem in model stacking, because so many predictors that all predict the same target are combined. Overfitting is partially caused by this collinearity between the predictors.

The most efficient techniques for training models (especially during the stacking stages) include using cross validation and some form of regularization. To learn how we used these techniques to build stacked ensemble models, see our recent SAS Global Forum paper, * "Stacked Ensemble Models for Improved Prediction Accuracy."* That paper also shows how you can generate a diverse set of models by various methods (such as forests, gradient boosted decision trees, factorization machines, and logistic regression) and then combine them with stacked ensemble techniques such regularized regression methods, gradient boosting, and hill climbing methods.

The following image provides a simple summary of our ensemble approach. The complete model building approach is explained in detail in the paper. A computationally intense process such as this benefits greatly by running in a distributed execution environment offered in the SAS^{®} Viya platform by using SAS^{®} Visual Data Mining and Machine Learning.

Applying stacked models to real-world big data problems can produce greater prediction accuracy and robustness than do individual models. The model stacking approach is powerful and compelling enough to alter your initial data mining mindset from finding the single best model to finding a collection of really good complementary models.

Of course, this method does involve additional cost both because you need to train a large number of models and because you need to use cross validation to avoid overfitting. However, SAS Viya provides a modern environment that enables you to efficiently handle this computational expense and manage an ensemble workflow by using parallel computation in a distributed framework.

**To learn more, check out our paper, "Stacked Ensemble Models for Improved Prediction Accuracy," and read the SAS Visual Data Mining and Machine Learning documentation.**

The post Why do stacked ensemble models win data science competitions? appeared first on Subconscious Musings.

]]>The post Which machine learning algorithm should I use? appeared first on Subconscious Musings.

]]>A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

- The size, quality, and nature of data.
- The available computational time.
- The urgency of the task.
- What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

The **machine learning algorithm cheat sheet** helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems. This article walks you through the process of how to use the sheet.

Since the cheat sheet is designed for beginner data scientists and analysts, we will make some simplified assumptions when talking about the algorithms.

The algorithms recommended here result from compiled feedback and tips from several data scientists and machine learning experts and developers. There are several issues on which we have not reached an agreement and for these issues we try to highlight the commonality and reconcile the difference.

Additional algorithms will be added in later as our library grows to encompass a more complete set of available methods.

Read the path and algorithm labels on the chart as "If *<path label>* then use *<algorithm>*." For example:

- If you want to perform dimension reduction then use principal component analysis.
- If you need a numeric prediction quickly, use decision trees or logistic regression.
- If you need a hierarchical result, use hierarchical clustering.

Sometimes more than one branch will apply, and other times none of them will be a perfect match. It’s important to remember these paths are intended to be rule-of-thumb recommendations, so some of the recommendations are not exact. Several data scientists I talked with said that the only sure way to find the very best algorithm is to try all of them.

This section provides an overview of the most popular types of machine learning. If you’re familiar with these categories and want to move on to discussing specific algorithms, you can skip this section and go to “When to use specific algorithms” below.

Supervised learning algorithms make predictions based on a set of examples. For example, historical sales can be used to estimate the future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable. You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.

**Classification:**When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.**Regression:**When predicting continuous values, the problems become a regression problem.**Forecasting:**This is the process of making predictions about the future based on the past and present data. It is most commonly used to analyze trends. A common example might be estimation of the next year sales based on the sales of the current year and previous years.

The challenge with supervised learning is that labeling data can be expensive and time consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.

When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlies the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.

**Clustering:**Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.**Dimension reduction:**Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.

Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment. Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take. Trial-and-error and delayed reward distinguishes reinforcement learning from other techniques.

When choosing an algorithm, always take these aspects into account: accuracy, training time and ease of use. Many users put the accuracy first, while beginners tend to focus on algorithms they know best.

When presented with a dataset, the first thing to consider is how to obtain results, no matter what those results might look like. Beginners tend to choose algorithms that are easy to implement and can obtain results quickly. This works fine, as long as it is just the first step in the process. Once you obtain some results and become familiar with the data, you may spend more time using more sophisticated algorithms to strengthen your understanding of the data, hence further improving the results.

Even in this stage, the best algorithms might not be the methods that have achieved the highest reported accuracy, as an algorithm usually requires careful tuning and extensive training to obtain its best achievable performance.

Looking more closely at individual algorithms can help you understand what they provide and how they are used. These descriptions provide more details and give additional tips for when to use specific algorithms, in alignment with the cheat sheet.

Linear regression is an approach for modeling the relationship between a continuous dependent variable \(y\) and one or more predictors \(X\). The relationship between \(y\) and \(X\) can be linearly modeled as \(y=\beta^TX+\epsilon\) Given the training examples \(\{x_i,y_i\}_{i=1}^N\), the parameter vector \(\beta\) can be learnt.

If the dependent variable is not continuous but categorical, linear regression can be transformed to logistic regression using a logit link function. Logistic regression is a simple, fast yet powerful classification algorithm. Here we discuss the binary case where the dependent variable \(y\) only takes binary values \(\{y_i\in(-1,1)\}_{i=1}^N\) (it which can be easily extended to multi-class classification problems).

In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the "1" class versus the probability that it belongs to the "-1" class. Specifically, we will try to learn a function of the form:\(p(y_i=1|x_i )=\sigma(\beta^T x_i )\) and \(p(y_i=-1|x_i )=1-\sigma(\beta^T x_i )\). Here \(\sigma(x)=\frac{1}{1+exp(-x)}\) is a sigmoid function. Given the training examples\(\{x_i,y_i\}_{i=1}^N\), the parameter vector \(\beta\) can be learnt by maximizing the log-likelihood of \(\beta\) given the data set.

Kernel tricks are used to map a non-linearly separable functions into a higher dimension linearly separable function. A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector \(w\) and bias \(b\) of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

\begin{equation*}

\begin{aligned}

& \underset{w}{\text{minimize}}

& & ||w|| \\

& \text{subject to}

& & y_i(w^T X_i-b) \geq 1, \; i = 1, \ldots, n.

\end{aligned}

\end{equation*}

A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector and bias of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

When the classes are not linearly separable, a kernel trick can be used to map a non-linearly separable space into a higher dimension linearly separable space.

When most dependent variables are numeric, logistic regression and SVM should be the first try for classification. These models are easy to implement, their parameters easy to tune, and the performances are also pretty good. So these models are appropriate for beginners.

Decision trees, random forest and gradient boosting are all algorithms based on decision trees. There are many variants of decision trees, but they all do the same thing – subdivide the feature space into regions with mostly the same label. Decision trees are easy to understand and implement. However, they tend to over fit data when we exhaust the branches and go very deep with the trees. Random Forrest and gradient boosting are two popular ways to use tree algorithms to achieve good accuracy as well as overcoming the over-fitting problem.

Neural networks flourished in the mid-1980s due to their parallel and distributed processing ability. But research in this field was impeded by the ineffectiveness of the back-propagation training algorithm that is widely used to optimize the parameters of neural networks. Support vector machines (SVM) and other simpler models, which can be easily trained by solving convex optimization problems, gradually replaced neural networks in machine learning.

In recent years, new and improved training techniques such as unsupervised pre-training and layer-wise greedy training have led to a resurgence of interest in neural networks. Increasingly powerful computational capabilities, such as graphical processing unit (GPU) and massively parallel processing (MPP), have also spurred the revived adoption of neural networks. The resurgent research in neural networks has given rise to the invention of models with thousands of layers.

In other words, shallow neural networks have evolved into deep learning neural networks. Deep neural networks have been very successful for supervised learning. When used for speech and image recognition, deep learning performs as well as, or even better than, humans. Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention.

A neural network consists of three parts: input layer, hidden layers and output layer. The training samples define the input and output layers. When the output layer is a categorical variable, then the neural network is a way to address classification problems. When the output layer is a continuous variable, then the network can be used to do regression. When the output layer is the same as the input layer, the network can be used to extract intrinsic features. The number of hidden layers defines the model complexity and modeling capacity.

Kmeans/k-modes, GMM clustering aims to partition n observations into k clusters. K-means define hard assignment: the samples are to be and only to be associated to one cluster. GMM, however define a soft assignment for each sample. Each sample has a probability to be associated with each cluster. Both algorithms are simple and fast enough for clustering when the number of clusters k is given.

** **

When the number of clusters k is not given, DBSCAN (density-based spatial clustering) can be used by connecting samples through density diffusion.

Hierarchical partitions can be visualized using a tree structure (a dendrogram). It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities (i.e., can refine/coarsen clusters) using different K.

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the “intrinsic” dimensionality may be smaller than the number of features. Principal component analysis (PCA), singular value decomposition (SVD), and* *latent Dirichlet allocation (*LDA*) all can be used to perform dimension reduction.

PCA is an unsupervised clustering method which maps the original data space into a lower dimensional space while preserving as much information as possible. The PCA basically finds a subspace that most preserves the data variance, with the subspace defined by the dominant eigenvectors of the data’s covariance matrix.

The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA. However, SVD is a more versatile technique as it can also do things that PCA may not do. For example, the SVD of a user-versus-movie matrix is able to extract the user profiles and movie profiles which can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).

A related technique in NLP is latent Dirichlet allocation (LDA). LDA is probabilistic topic model and it decomposes documents into topics in a similar way as a Gaussian mixture model (GMM) decomposes continuous data into Gaussian densities. Differently from the GMM, an LDA models discrete data (words in documents) and it constrains that the topics are *a priori* distributed according to a Dirichlet distribution.

This is the work flow which is easy to follow. The takeaway messages when trying to solve a new problem are:

- Define the problem. What problems do you want to solve?
- Start simple. Be familiar with the data and the baseline results.
- Then try something more complicated.

SAS Visual Data Mining and Machine Learning provides a good platform for beginners to learn machine learning and apply machine learning methods to their problems.

The post Which machine learning algorithm should I use? appeared first on Subconscious Musings.

]]>The post Model selection for spatial econometrics using PROC SPATIALREG appeared first on Subconscious Musings.

]]>For the purpose of illustration, this post uses the same 2013 North Carolina county-level home value data that was used in the previous post. The data set is named NC_HousePrice and contains five variables: county (county name), homeValue (median value of owner-occupied housing units), income (median household income in 2013 in inflation-adjusted dollars), bachelor (percentage of people in the county who have a bachelor’s degree or higher), and crime (rate of Crime Index offenses per 100, 000 people). Before you proceed with spatial econometric analysis, you need to create a spatial weights matrix. For convenience, consider a first-order contiguity matrix W where two counties are neighbors to each other if they share a common border.

To fit a spatial Durbin model (SDM) to the NC_HousePrice data, you can submit the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAR;
spatialeffects Income Crime Bachelor;
test _rho=0/all;
spatialid County;
run;
```

You supply two data sets—the primary data set and a spatial weights matrix—by using the DATA= option and the WMAT= option, respectively. The primary data set contains the dependent variable, the independent variables, and possibly the spatial ID variable. In the MODEL statement, you specify the dependent variable y and regressors x. You use the TYPE= option to specify the type of model to be fit, selecting one of the following values: SAR, SEM, SMA, SARMA, SAC, and LINEAR. For example, you specify TYPE=SAR to fit a SAR model and TYPE=LINEAR to fit a linear regression model. You use the SPATIALEFFECTS statement to specify exogenous interaction effects. In the preceding statements, you specify TYPE=SAR together with the SPATIALEFFECTS statement to fit an SDM model. The TEST statement in PROC SPATIALREG enables you to perform hypothesis testing. The SPATIALREG procedure supports three different tests: likelihood ratio (LR), Wald, and Lagrange multiplier (LM). The SPATIALID statement enables you to specify a spatial ID variable to identify observations in the two data sets that are specified in the DATA= and WMAT= options.

For the SDM model fitted to the NC_HousePrice data, the value of Akaike’s information criterion (AIC) is –144.96. The results of parameter estimation from the SDM model, shown in Table 1, suggest that three predictors—income, crime, and bachelor—are all significant at the 0.05 level. The spatial correlation coefficient ρ is estimated to be 0.31 and is significant at the 0.05 level. Table 2 shows the test results for H0: ρ = 0 from the three tests. According to the test results, you can conclude that H0 should be rejected at the 5% significance level. In other words, there is a significantly positive spatial correlation in house price.

To fit a spatial error model (SEM) to the NC_HousePrice data, you can submit the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SEM;
spatialid County;
run;
```

The results of parameter estimation from the SEM model, shown in Table 3, suggest that two out of three predictors—income and bachelor—are significant at the 0.05 level. The value of AIC for this model is –122.68. The spatial correlation coefficient λ is estimated to be 0.60 and is significant at the 0.05 level. As a result, there seems to be a significant positive correlation in the disturbance.

So far, SDM and SEM models have been fitted to NC_HousePrice data. The SDE model is capable of accounting for both endogenous and exogenous interaction effects, whereas the SEM model can account for spatial dependence in the disturbance term. The comparison of AIC values between these two models suggests that the SDM model is better than the SEM model because the SDM model has a smaller AIC value. However, you might want to try a more complicated model—such as a spatial autoregressive confused (SAC) model—that can address endogenous interaction effects, exogenous interaction effects, and spatial dependence in the disturbance term.

You can fit an SAC model to the NC_HousePrice data by submitting the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAC;
spatialid County;
run;
```

Table 4 shows the results of parameter estimation from the SAC model. As in the SEM model, only two predictors—income and bachelor—are significant at the 0.05 level. The value of AIC for this model is –121.43. The spatial correlation coefficient ρ is estimated to be –0.10, but it is not significant at the 0.05 level. However, the spatial correlation coefficient λ is estimated to be 0.67 and is significant at the 0.05 level.

Among the three models that have been considered for the NC_HousePrice data, the SDM model is the one with the smallest value of AIC. As a result, if you have to choose the best model among SDM, SEM, and SAC models according to AIC, the SDM model would be the winning model. Since model selection is common in most data analysis, PROC SPATIALREG facilitates model selection by enabling you to use multiple MODEL statements to fit more than one model at a time. For example, you can fit the preceding three models to NC_HousePrice data by submitting the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAR;
spatialeffects Income Crime Bachelor;
test _rho=0/all;
model Hvalue=Income Crime Bachelor/type=SEM;
model Hvalue=Income Crime Bachelor/type=SAC;
spatialid County;
run;
```

This post introduces how to fit spatial econometric models that are available in the SPATIALREG procedure. These models include the spatial Durbin model (SDM), spatial error model (SEM), and spatial autoregressive confused (SAC) model. It also describes how you can use multiple MODEL statements in only one call to PROC SPATIALREG to facilitate model selection. In the next blog post, we’ll talk more about how to create spatial weights matrices for spatial econometric analysis. We will also be giving a talk, "Big Value from Big Data: SAS/ETS® Software for Spatial Econometric Modeling in the Era of Big Data," at the SAS Global Forum conference April 2-5, 2017, in Orlando. Stop by and let's talk about spatial econometrics!

The post Model selection for spatial econometrics using PROC SPATIALREG appeared first on Subconscious Musings.

]]>The post Women in analytics - a personal perspective appeared first on Subconscious Musings.

]]>**Getting an early start in Analytics**

How do you encourage a young girl to pursue a career which requires mathematical or scientific skills? How do you react to your child’s interest in mathematics? Do you react differently depending on whether the child is a boy or girl? Why? Encourage him / her to pursue that interest and ensure that the school is emphasizing that message as well. From my childhood, my parents (especially my dad who started his career as a Math professor) were very proud that I was good at math – no one ever commented on the fact that I was also a girl! When I started high school and ended up being the only girl in the math class, we took it in stride and did not make a big deal of it – Radhika likes math and she is in the math class. That was it!

One way to interest young girls in math-related fields is by providing examples of real-world applications so they can understand the value. In today’s world, every field is rich with applications of mathematical modeling, so it is easier to capture student’s interests. Another avenue is to encourage your child to participate in competitions, especially ones involving teams. Often, girls like working in teams which can also be good training for their professional career later in life.

And, of course, it is very important for the young girls to have role models of women in analytics who provide examples of successful careers in these fields.

**Personal experience**

I had a unique experience during my Master’s program in Mathematics at the Indian Institute of Technology in Delhi. It was amazing that in the year 1975, my class had 7 women out of the small class of 14 students! Several of us women have gone on to have very successful professional careers in highly technical fields – in fact three of us are now in the U.S. Likewise, my PhD class in Operations Research at Cornell had five women out of a total of about a dozen students. With that kind of an experience of women in analytics, I have never noticed that there may be more men than women in my field. We are all just members of the analytics community!

**Career Growth**

Women are hesitant to talk about their strengths and proactively seek promotions. Be bolder in seeking out new opportunities. Of course, you need to be qualified for the role! Be confident of your strengths. If you have a strong foundation and are the expert in your area, your being a woman in analytics is irrelevant. People will recognize you as the expert immediately. Be bold and take a seat at the table. If you have been invited to participate, it is because you are recognized as someone who can contribute – use that opportunity to do so.

**Women in Analytics professions**

Within SAS Institute, we have several women in analytics at all levels – from individual contributors who are recognized throughout the institute as “the expert in a particular area” to senior managers who are responsible for key flagship analytical products from SAS. Several of them play important roles in leading professional organizations in addition to their responsibility at SAS. As a percentage of the workforce, we may have fewer women than men in the industry as a whole. However, there are several women in leading roles in analytics, both as leaders in major multinational companies (IBM, Ford, Verizon) and at the helm of professional organizations like the Institute for Operations Research and Management Science (INFORMS), the American Statistical Association, etc.

Especially in the last few years, women in analytics have made great strides in leading large organizations. For example, Dr. Pooja Dewan was named Chief Data Scientist at BNSF, Malene Haxholdt is VP of Enterprise Analytics at Metlife, Dr. Nipa Basu is Chief Analytics Officer at Dunn and Bradstreet and so on.

There is a geographic differential in the representation of women in technology. Some of the research from the UK indicates that women in India seem to see IT/STEM as empowering in a way that women in the UK or US do not. There is also research suggesting that while women may not be doing well in terms of increased numbers in computer science or math the discipline of statistics is an exception: “More than 40 percent of degrees in statistics go to women, and they make up 40 percent of the statistics department faculty poised to move into tenured positions.”

**Increasing interest in analytics for women**

As analytics and data science become more ubiquitous in several industries, we are seeing an increase in the number of women in Analytics. There are a few key reasons for this increase:

- There are many applications areas for analytics which are attractive to women: Health care, Education, non-profit work, projects that are aimed at doing social good. There is ample evidence that women are more drawn to opportunities to make a social impact.
- More companies are providing flexibility that helps women get back or stay in the work force. These include benefits like flexible work schedules, more “work from home” options, family medical leave options, more options for day care, support for nursing mothers, and so on.
- Many educational opportunities are available for women to get trained in data science and analytics arenas through online master’s programs or certification programs. For women who have a STEM-related undergraduate education, this provides an easy entry into the analytics domain. There is an interesting article describing how data science is creating opportunities for women. It is exciting to see the many women leaders who participated in the recent Women in Data Science Conference where the elite panel of speakers were all women!
- Many collaborative, team competitions (for example, the DataDive at SAS being held in partnership with DataKind) are being arranged across the data science and analytics domain – such collaborative, problem solving events may be particularly appealing to women.

**Challenges and opportunities being a woman in this highly competitive field**

It is well recognized that there is a confidence issue that plagues women in fields that are dominated by men. A Harvard study presented the result: “Female computer science concentrators with eight years of programming experience report being as confident in their skills as their male peers with zero to one year of programming experience.” I can relate to that! Most women I know will speak up only if they are confident that their points are thoroughly researched and vetted. We need to encourage them to participate freely in any dialog and discussion. Women need to be reminded that they have been invited to be in the group / discussion because their opinions are valued.

At the same time, women have some inherent strengths that they bring to the table. We have an ability to understand others’ points of view which makes it possible to have productive discussions over contentious topics. We also have a capacity to nurture which is useful in growing a team of very talented individuals who are often brilliant but may not be used to working as part of a high achieving team. One of the most important skills for a successful leader is the ability to see the big picture (remember the tale of the “Six Blind Men and the Elephant?” I believe that women are more likely to understand the big picture because of their natural empathy for others’ inputs.

My advice to young women entering the field of analytics is: Do not hold yourself back because you are a woman, you have earned the right to be in this area, use your strengths to build a strong team by nurturing everyone’s talents. This is a golden age to be part of this domain!

*Image credit: photo by *__Mike Kline__* // attribution by *__creative commons__

Note: I prepared these thoughts related to an interview in Analytics India Magazine: International Women's Day special.

The post Women in analytics - a personal perspective appeared first on Subconscious Musings.

]]>The post How Santa’s Workshop uses social network analysis appeared first on Subconscious Musings.

]]>I recently met Mrs. Claus at the INFORMS Annual Meeting, where we got to talking about the social network analysis session she’d just attended. It turns out Mrs. Claus and I are both fans of a book by Alex Pentland, Social Physics: How Social Networks Can Make Us Smarter. Apparently years ago she had foreseen the trend toward analytics and returned to school for dual PhDs in Computer Science and Statistics at Stanford. She now carries the title of Chief Data Scientist of Santa’s Workshop. Who knew? We chatted about the many ways she and her team at Santa’s Workshop use social network analysis, some of which were commonly employed but others were surprising adaptations.

Santa’s Workshop first started using social network analysis to uncover fraud. While naughty children exist (which requires predicting coal delivery, but that’s another post), the perpetrators Santa is after are adults. I bet you didn’t know that some households pretend to have young children by leaving out notes for Santa, even if they have no children at all or their children are grown and have left home. The challenge with finding fraudsters is not in seeing a pattern, because fraud is by definition a rare event, but in making meaningful connections between disparate data activities that may help spot fraud.

Social network analysis allows investigators to look at lots of data from multiple sources at the level of a network, where they can see different people (nodes) and their relationships (ties) in the form of a graph. The connections between people may not exist at the transactional level but jump out when viewed graphically in network form. Just like how the Los Angeles County Department of Public Social Services uses social network analysis to quickly identify potential co-conspirators in fraud rings, Santa’s Workshop uses social network analysis to find the bad guys. Mrs. Claus has uncovered several fraud rings using this analysis way and stopped delivery of presents to those homes.

Telecom companies are not the only ones to worry about churn. Santa’s Workshop also has to worry about this problem, which arises when children stop believing in Santa Claus and cancel their “Santa service” prematurely. You may think this transition of beliefs is an isolated event that is just a natural function of age, but Mrs. Claus uses the SAS® Enterprise Miner™ Link Analysis node (link analysis is a popular form of social network analysis) to uncover notable connections among parents, siblings, and schools that suggest the possibility of churn. They look at cell phone call records and social media connections to understand relationships, and then use targeted interventions to offer parents tools to allow children to continue to believe.

Drawing on Pentland’s research they have applied careful use of social network incentives to encourage older children not to tell their younger siblings or other children in the neighborhood. Pentland’s lab has found that this kind of positive social pressure is best applied to people in the target’s network rather than directly to the target. For Santa this means offering incentives to the older, respected friends of children at risk of “explaining” Santa to their younger siblings rather than to the older siblings themselves. Pentland’s research shows that this kind of nudge is far more effective than standard economic incentives, because it recognizes that we are social actors strongly influenced by our social ties. Mrs. Claus and team have also found that children at risk of churning held onto their beliefs longer when they received special mail messages from Santa himself.

Using social network analysis to detect fraud and churn are common use cases, but what really intrigued me was how Santa’s Workshop uses social network analysis to generate ideas for presents by improving idea flow among the elves. You may have thought that Santa is only an order taker, but consider the challenge he faces when a child asks for something that simply is infeasible (pony requests in New York City, for example). So each year he must conceive of, design, and produce items that may not have been requested but are likely to please. Product Manager Elves are responsible for finding ideas for these gifts and delivering product specifications to the Manufacturing Elves.

Research shows that the best way to stimulate idea flow is to increase both exploration and engagement. Early in the year Product Manager Elves travel around (incognito, of course) to hunt for ideas. The Product Manager Elves most consistently successful at generating creative product ideas are those Pentland labels explorers. You know these kind of people – they are the ones who know lots of different kinds of people, love talking ideas with them, and then sharing those ideas they’ve just gathered in subsequent conversations. As Pentland describes, their focus is not “the ‘best’ people or ‘best’ ideas….but “people with different views and different ideas.” They then filter the best ideas by learning which ones generate the most traction in their subsequent conversations with others.

The other key to idea flow is engagement, which is when new ideas are shared among teams. The best ideas the Product Manager Elves discover go nowhere if they aren’t adopted and championed by other Elves. So drawing upon an example in Pentland’s book about improvement in idea flow at a call center, Mrs. Claus scheduled a common lunch hour, so everyone breaks at the same time to eat. Previously they didn’t want to bring down the line, so lunch was at various times, but they’ve learned that when all those Elves from different departments circulate at lunch they share ideas. Those ideas that stick are the ones the whole teams get excited about and start contributing to design specs, beginning to see these ideas as those belonging to Santa’s Workshop and not just the Product Manager Elf who discovered them initially. Plus, this seems to have helped in Elf retention, because all of them feel part of the entire process.

What does Mrs. Claus have on her 2017 data science horizon? She’s been exploring the use of sociometric badges that Pentland first employed in his research. Commonly known as sociometers, these are small electronic wearable devices that collect data on people’s interactions (face-to-face time, conversation, gestures, physical proximity, etc.). Pentland’s devices are the size of thick badges, but Santa’s Workshop has developed tiny ones they can surreptitiously place on toys to track similar behavior, with the added element of serving as gateways, so they can analyze the data in real-time. She hopes to make tweaks to gift-giving in 2017 as Santa travels around the world, drawing upon the initial reactions of children to new gifts.

I was glad to meet Mrs. Claus, another fan of Pentland’s book, Social Physics, because it is full of interesting ideas. I’m clearly an explorer, because when I read a book like this I want to discuss it with other people, hear their reactions, and learn new things. Later chapters talk about how the concepts of social physics can lead to smarter cities and even smarter societies. To ensure that the kind of data collected is used ethically Pentland even proposes a New Deal on Data. I'm encouraged by research like this that can be applied as part of the growing #data4good movement. So lots of good stuff here! If any of you have read this book please chime in and let me know what caught your attention.

*Mrs. Claus image credit: photo by Public Information Office **// attribution by creative commons*

*Santa's Workshop image credit: photo by Loozrboy **// attribution by creative commons*

The post How Santa’s Workshop uses social network analysis appeared first on Subconscious Musings.

]]>The post 4 tips to ensure your data4good efforts have an impact appeared first on Subconscious Musings.

]]>Two weeks ago I heard two very interesting talks on data4good at the INFORMS Annual Meeting, where 5,000+ people focused on operations research gathered together. The first was on “Challenges and Lessons Learned from Influencing Policy Change in Organ Transplantation.” As you can see from the photo I took, this session combined quite a distinguished group of operations research (OR) academics and transplant surgeons who both want to make an impact. For many reasons, a pure market-based approach is not the best way to allocate organs for transplant, so the process is governed by policy makers, who have divided the country into regions. All parties agree that the current regional system results in disparities in access and is broken, but policy makers have been unable to settle on a solution.

Because this situation is a classic market-matching problem it has drawn the attention of the operations research field.* Over the years the academics on the panel had proposed a variety of mathematical solutions. But the most elegant mathematical model for a real world problem adds little value if it is not implemented. So why haven’t they solved the problem? Part of it is as simple and frustrating as problem definition. Are they trying to help those who are the sickest or those most likely to succeed with a transplant? Is it fair that some regions have shorter waiting lists because more organs are available due to increased deaths? Agreeing on the problem definition is tough, and as the clinicians explained, they “argue a lot, because life matters.”

The other session that triggered my thinking was a tutorial on healthcare analytics by Joris van de Klundert of the Erasmus University Institute of Health Policy and Management, who gave a challenge to the OR professionals in attendance. Healthcare analytics is a critical area for the world’s population but one where his fellow researchers in OR are making only a modest contribution. A big part of the problem is too much emphasis on research and too little on results that actually improve healthcare. His literature review of articles on healthcare analytics at various stages of the analytical life cycle highlighted this fact. The vast majority of research is in model building, with fewer and fewer articles published as you move along the cycle to solution development, then model implementation, and finally evaluation and monitoring. There was a lively discussion among the audience of the challenges, which include difficulty as academics getting involved in practice, the conflict between the simple models most often needed and those that will result in publication, and the risk tenure-track faculty face doing work that may not result in the right kind of publications.

After listening to these data4good talks I propose these tips to ensure your application of analytics have a real impact:

**Take time to listen to your “customer.”**Even in the social sector realm, you still have “customers,” who are the people or groups for whom you are trying to solve the problem. The transplant surgeons emphasized that it takes a lot of time to build relationships between clinicians and what they called “engineers,” in part because of the big gap that can exist between what these two groups value. Be sure to explain your results, to increase the credibility of your model. As the San Bernadino County Department of Behavioral Health found, discussing their analysis with many of their partners in care helped them align on the goals they shared.**Build models that match the problem as well as the solution.**This means ensuring that you have defined the problem correctly, which as I indicated with the thorny organ transplant discussion maybe far more than half the battle. This also relates back to listening to your “customer.” SAS is working with DataKind and the Boston Public Schools to optimize their bus routes, and as we do so we have to periodically check in to see if the initial models we propose would make sense in practice. People who know math must talk to people who know buses to know if the model will work.**Focus on putting models into practice.**Modeling the problem is important, but as van de Klundert’s literature review shows, the OR community has plenty of success in this area. The challenge is working closely enough (see items 1 and 2 above) with your “customer” to find a path to implementation. After all, as one of the academics said “our endless models don’t necessarily provide the details practitioners want.” So find the details they do want, put them into your model, and work with them to put that model into practice. As Jake Porway, founder of DataKind, blogged: “we cannot make change with technology or data alone.”**Remember Occam’s razor, or the simplest solution is often the best.**For all your interest in trying out the latest non-negative matrix factorization model, a simple logistic regression is often hard to beat. And it will be far more interpretable to most non-analytics professionals.

Today’s #GivingTuesday celebrates giving of all forms, and the social sector could benefit so much from the talent of data scientists (in fact, what sector wouldn’t benefit?). But as Jake Porway likes to say, “you can’t just hack your way to social change.” You must consider impact from the start for your data4good efforts to succeed. After all, who wants to give their time and talent without it making a difference?

* Alvin Roth shared the Nobel Prize for his work in this area, "for the theory of stable allocations and the practice of market design," and while he is a professor of economics at Stanford his PhD is in operations research.

The post 4 tips to ensure your data4good efforts have an impact appeared first on Subconscious Musings.

]]>The post Local Search Optimization for HyperParameter Tuning appeared first on Subconscious Musings.

]]>Once a TV is calibrated, it is ready to enjoy. The visual data, the broadcasted information, can be observed, processed, and understood in real time. When it comes to data analytics, however, with raw data in the form of numbers, text, images, etc., gathered from sensors and online transactions, ‘seeing’ the information contained within, as the source grows rapidly, is not so easy. Machine learning is a form of self-calibration of predictive models given training data. These modeling algorithms are commonly used to find hidden value in big data. Facilitating effective decision making requires the transformation of relevant data to high-quality descriptive and predictive models. The transformation presents several challenges however. As an example, take a neural network (Figure 1). A set of outputs are predicted by transforming a set of inputs through a series of hidden layers defined by activation functions linked with weights. *How do we determine the activation functions and the weights to determine the best model configuration? *This is a complex optimization problem.

The goal in this model training optimization problem is to find the weights that will minimize the error in model predictions given the training data, validation data, specified model configuration (number of hidden layers, number of neurons in each hidden layer) and regularization levels designed to reduce overfitting to training data. One recently popular approach to solving for the weights in this optimization problem is through use of a *stochastic gradient descent* (SGD) algorithm. The performance of this algorithm, as with all optimization algorithms, depends on a number of control parameters for which no set of default values are best for all problems. SGD parameters include among others a *learning rate* controlling the step size for selecting new weights, a *momentum* parameter to avoid slow oscillations, a *mini-batch* size for sampling a subset of observations in a distributed environment, and *adaptive decay rate* and *annealing rate* to adjust the learning rate for each weight and time. See related blog post ‘Optimization for machine learning and monster trucks’ for more on the benefits and challenges of SGD for machine learning.

The best values of the control parameters must be chosen very carefully. For example, the momentum parameter dictates whether the algorithm tends to oscillate slowly in ravines where solutions lie, jumping across the ravine, or dives in quickly. But if momentum is too high, it could jump by the solution (Figure 2). The best values for these parameters also vary for different data sets, just like the ideal adjustments for an HDTV depending the characteristics of its environment. These options that must be chosen before model training begins dictate not only the performance of the training process, but more importantly, the quality of the resulting model – again like the tuning parameters of a modern HDTV controlling the picture quality. As these parameters are external to the training process – not the model parameters (weights in the neural network) being optimized during training – they are often called ‘*hyperparameters’*. Settings for these hyperparameters can significantly influence the resulting accuracy of the predictive models, and there are no clear defaults that work well for different data sets.

In addition to the optimization options already discussed for the SGD algorithm, the machine learning algorithms themselves have many hyperparameters. Following the neural net example, the number of hidden layers, the number of neurons in each hidden layer, the distribution used for the initial weights, etc., are all hyperparameters specified up front for model training that govern the quality of the resulting model.

The approach to finding the ideal values for hyperparameters, to tuning a model to a given data set, traditionally has been a manual effort. However, even with expertise in machine learning algorithms and their parameters, the best settings of these parameters will change with different data; it is difficult to predict based on previous experience. To explore alternative configurations typically a grid search or parameter sweep is performed. But a grid search is often too coarse. As expense grows exponentially with number of parameters and number of discrete levels of each, a grid search will often fail to identify an improved model configuration. More recently random search is recommended. For the same number of samples, a random search will sample the space better, but can still miss good hyperparameter values and combinations, depending on the size and uniformity the sample. A better approach is a random Latin hypercube sample. In this case, samples are exactly uniform across each hyperparameter, but random in combinations. This approach is more likely to find good values of each hyperparameter, which can then be used to identify good combinations (Figure 3).

True hyperparameter optimization, however, should allow searching between these discrete samples, as a discrete sample is unlikely to identify even a local accuracy peak or error valley in the hyperparameter space, to find good *combinations* of hyperparameter values. However, as a complex black-box to the tuning algorithm, machine learning training and scoring algorithms create a challenging class of optimization problems:

- Machine learning algorithms typically include not only continuous, but also categorical and integer variables. These variables can lead to very discrete changes in the objective.
- In some cases, the space is discontinuous where the objective blows up.
- The space can also be very noisy and non-deterministic. This can happen when distributed data is moved around due to unexpected rebalancing.
- Objective evaluations can fail due to grid node failure, which can derail a search strategy.
- Often the space contains many flat regions – many configurations give very similar models.

An additional challenge is the unpredictable computation expense of training and validating predictive models with changing hyperparameter values. Adding hidden layers and neurons to a neural network can significantly increase the training and validation time, resulting in a wide range of potential objective expense. A very flexible and efficient search strategy is needed.

SAS Local Search Optimization, part of the SAS/OR® offering, is a hybrid derivative-free optimization strategy that operates in parallel/distributed environment to overcome the challenges and expense of hyperparameter optimization. It is comprised of an extendable suite of search methods driven by a hybrid solver manager controlling concurrent execution of search methods. Objective evaluations (different model configurations in this case) are distributed across multiple evaluation worker nodes in a grid implementation and coordinated in a feedback loop supplying data from all concurrent running search methods. The strengths of this approach include handling of continuous, integer, and categorical variables, handling nonsmooth, discontinuous spaces, and easy of parallelizing the search strategy. Multi-level parallelism is critical for hyperparameter tuning. For very large data sets, distributed training is necessary. Even with distributed training, the expense of training severely restricts the number of configurations that can be evaluated when tuning sequentially. For small data sets, cross validation is typically recommended for model validation, a process that also increases the tuning expense. Parallel training (distributed data and/or parallel cross validation folds) and parallel tuning can be managed – very carefully – in a parallel/threaded/distributed environment. This is typically not discussed in the literature or implemented in practice; typically either ‘data parallel’ or ‘model parallel’ (parallel tuning) is exercised.

Optimization for hyperparameter tuning typically can very quickly lead to several percent reduction in model error over default settings of these parameters. More advanced and extensive optimization, facilitated through parallel tuning to explore more configurations, can lead to further improvement, further refining parameter values. The neural net example discussed here is not the only machine learning algorithm that can benefit from tuning: the *depth* and *number of bins* of a decision tree, *number of trees* and *number of variables* *to split on* in a random forest or gradient boosted trees, the *kernel parameters* and *regularization* in SVM and many more can all benefit from tuning. The more parameters that are tuned, the larger the dimensions of the hyperparameter space, the more difficult a manual tuning process becomes and the more coarse a grid search becomes. An automated, parallelized search strategy can also benefit novice machine learning algorithm users.

Machine learning hyperparameter optimization is the topic of a talk to be presented by Funda Günes and myself at The Machine Learning Conference (MLconf) in Atlanta on September 23. The title of the talk is “Local Search Optimization for Hyperparameter Tuning” and includes more details on the approach, parallel training and tuning, and tuning results.

*image credit: photo by kelly // attribution by creative commons*

The post Local Search Optimization for HyperParameter Tuning appeared first on Subconscious Musings.

]]>The post Machine learning fun at KDD appeared first on Subconscious Musings.

]]>Who says machine learning can't be fun? A crew of us from SAS went to San Francisco for the recent KDD conference, which bills itself as "a premier interdisciplinary conference, [which]brings together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data." We brought these buttons with us, and they were a huge hit!

But we weren't at KDD just to have fun, of course. We came to learn and share, in our booth and in many other ways. Simran Bagga came to talk about all things text analytics, and she was nice enough to pitch in and help me set up the booth. Naturally, her favorite button was "I'm Feeling Unstructured Today." She gave two extended demos in the booth: "Combining Structured and Unstructured Data for Predictive Modeling Using SAS® Text Miner" and "Topic Identification and Document Categorizing Using SAS® Contextual Analysis."

Wayne Thompson served as a senior editor on the Review Board, which means he oversaw a group if volunteers who had the hard task of reviewing and making selections from the many excellent papers submitted for the Applied Data Science track. He was also was a panelist in a "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data." His favorite button was "Talk Data to Me," which he did during his panel, "Internet of Things, Industrial Internet, and Instrumented Environments: the Furious Need for Standards." He also gave an extended demo in the SAS booth on "Machine Learning on the Go."

"Can Tools Effectively Unleash the Power of Big Data?" Udo Sglavo thinks so, and he said as much in this panel he was part of in the Applied Data Science Invited Talks track. As someone who has been involved in data mining for many years, Udo's favorite button was "I Support Vector Machines." This button was popular, because it was also Wei Xiao's favorite. He was busy attending many sessions, but he did give his own extended demo in the booth on "A Probabilistic Machine Learning Approach to Detect Industrial Plant Fault."

Susan Haller, who leads teams responsible for data mining and machine learning at SAS, had a different favorite button: "How Random are Your Forests?” Ray Wright on Susan's team's favorite was "You Can Engineer My Features." Ray is interested in automation, too, which was the subject of his extended demo: "Modeling Automation With SAS® Enterprise Miner™ and SAS® Factory Miner." But Ray also focused on basketball, giving a poster in the Large Scale Sports Analytics Workshop on "Shot Recommender System for NBA Coaches," which he co-authored with Ilknur Kaynar Kabul and Jorge Silva. Jorge didn't attend the conference, but Ilknur did, and her favorite button was "I'm Having a Cold Start Today." However, Ilknur was not having a cold start when she presented her extended demo: "Auto-Tuning Your Decision Tree, Random Forest and Neural Networks Models." Another member of Susan's team, Patrick Hall, spent a lot of time in the SAS booth, where he was great at answering all kinds of questions. However, he couldn't decide on his favorite button, because it was a tie between "I’m Feeling Unstructured Today” and “I Support Vector Machines.” Patrick answered a lot of questions on options for integrating open source software with SAS, and this was the topic of his extended demo: "Options for Open Source Integration in SAS® Enterprise Miner™." Also on Susan's team, Taiping He, liked: "I'm Feeling Unstructured Today," which may be a surprise, because his extended demo was: "Distributed Support Vector Machines in SAS® Viya™ System." Guess who develops our SVM procedure in SAS Enterprise Miner?

KDD has a nice balance of practitioners and academics in attendance, so we were glad to interact with both groups. We met many students and professors in the booth, and Scott MacConnell was on hand from our Academic Outreach and Collaborations group to talk about all the great free resources SAS has to offer academics. Scott's favorite button was "I Am Feeling Unstructured Today."

We made time for fun, too, and one night many of us ate dinner together at a restaurant called The Stinking Rose, which calls itself "A garlic restaurant." They had fun murals on the wall showing garlic in all kinds of ways you never even dreamed of! I had the Forty Clove Garlic Chicken, and even though I didn't eat anywhere near that number they provided, I do hope my choice didn't depress traffic in the booth. The food was delicious! And my favorite button? "My Networks Run Deep."

The post Machine learning fun at KDD appeared first on Subconscious Musings.

]]>The post The Internet of medical things and of intern things appeared first on Subconscious Musings.

]]>This past summer I used data from cell phones attached at the waist to predict the activity of the owner, which is an exciting application of the internet of medical things. There are a number of immediate applications of this research: contextualizing electrocardiogram signals, improving exercise analysis, and assisting in the care for elderly. As an intern, my first assignment was simply to replicate the results from an existing activity recognition paper using SAS/IML® to extract features from a time series and SAS® Enterprise Miner ™ to produce an accurate model. As I mentioned earlier, I started my summer knowing how to program in a few languages, including SAS, but I didn’t know what a time series of data was, or how to program in IML, and I knew absolutely nothing about how to use a neural network model.

My first obstacle for my summer project on the internet of medical things was fleshing out how I learn best. With respect to SAS Enterprise Miner, at first I spent time going through the documentation to get a feel for the different nodes and their respective settings and options. This was helpful to a point, but what I discovered was that I learned best when I tested different options and examined results. I found this to be true in other parts of my research; when I spent time to plot the time series data, using different graph types, styles, and filters, I was able to understand my data at a deeper level. When extracting features from a time series, it is important to extract intuitive and meaningful features that capture a characteristic of the time series that would be evident if you looked at it in its entirety. This is almost impossible without spending some time examining the data. I think this is a common trend in our new age of data science and analytics; it’s not about what you think the data should say, but about what the data are actually saying.

Throughout this project I observed some characteristics of the internet of medical things, but I also learned what I call the Internet of Intern Things.

**1. Read, read, read, and read some more.**

On my second day at SAS, my manager and another team member met me for lunch and we discussed some possible projects for my summer experience. I admitted that I didn’t have a particular research interest, but I was willing to try anything. My team member suggested a project and later sent me a folder of recent publications related to the topic. I dutifully saved the folder to my computer and printed the PDFs that seemed the most intuitive and easiest to understand. I read the documents, highlighted what I thought was important, and launched into the project. That was my first mistake. For example, I began searching for a filtering mechanism, because that’s what the project required, but I didn’t intuitively understand my data and its form enough to be able to explain why I needed a filtering mechanism. Days later my mentor asked me some questions about the data, just to be sure we were on the same page, and I realized I wasn’t sure about my answers. Not only was I embarrassed, but I was worried that the time I had spent on my project was wasted. Of course as an intern, no time spent failing is wasted, because failures are learning experiences, but I was nonetheless disappointed. Fortunately for me, my mentor was very understanding. From this experience I learned that taking time to contextualize your data at the beginning of a project is not only helpful, but necessary. Moreover, prioritizing reading papers for research, and papers or articles suggested by colleagues is helpful. Several times throughout the summer, my mentors, with much more experience than me, recommended well-known papers or recent articles that were relevant to my field of study and interests. I learned that it is valuable to take time each week, if not each day, to read a short paper or several articles to stay up to date and informed. As an intern, reading helped me to understand the “buzz words” of my field, like “data science” and “machine learning,” and gave me talking points when I met with my colleagues for lunch. I know, that sounds a little over the top. Like really, “talking points” for lunch? But as an intern it is important to set yourself apart, and being well-read helps.

**2. Collaboration is not only helpful, but imperative.**

If I were to summarize my summer experience at SAS into one word, it would be “collaboration.” Collaboration was crucial to my summer project, and to navigating such a large organization as an intern. After giving my first presentation of my preliminary work on my summer project, several other interns contacted me and shared their projects, and we found overlaps. While I was working on modeling human activity using feature engineering with a goal of classifying healthy or unhealthy heartbeats, others were working on motif discovery and motif comparison.

These projects logically overlap in our ultimate motivation: classification of health signals. My project was focused on extracting information from a time series, while others were reviewing the actual pattern of a time series in a pictorial sense. After realizing this overlap, we began to compare notes and share helpful resources for visualization. In my final intern presentation, I actually used a visualization application shared by my fellow intern. Our collaboration not only benefited our summer projects, but it was also in the spirit of the atmosphere of modern tech companies who are concerned with team work and shared work effort. Moreover, it points to the central theme of the internet of things: everything is connected in some way, and thus should be used in tandem for the most efficient and accurate results.

**3. Prepare and ask questions.**

You know those professors who on the first day of class say “You can never ask enough questions! There are no dumb questions.”? I won’t use this time to reiterate the very important action of asking questions, but instead will add my own flavor. Don’t just ask questions, prepare questions. What do I mean? Exhaust your own resources before you ask for help, but don’t take too long. Continually ask yourself what is confusing before asking for help. Read. Did I say that again? I can’t stress it enough. Don’t get me wrong, I spent countless hours in my teammates’ offices this summer asking some dumb questions, and also asking some questions that took us both a week to answer. But, I believe when you come prepared with specific questions and sources of confusion that teammates are more willing to help and answer questions. During the summer I also had the opportunity to email individuals who published the data I used for my project. In writing that email, it was very important for me to be sure of what I knew before I asked questions. It goes back to the internet of things: What do I know? What do I want to know? What resources of information can I use to learn what I want to know?

My intern experience this summer has impacted my research focus, education plans, and career path. Another amazing opportunity that grew out of my summer experience is presenting a student e-poster at the 2016 SAS Analytics Experience conference in Las Vegas. Besides being able to present my research, I am also very excited about this opportunity because I will be able to hear a talk given by Jake Porway, founder and executive director of DataKind, an organization committed to using “data for good”, along with many other interesting talks, sessions, and demos.

Having an experience at SAS (my own personal Internet of Intern Things) in the middle of my college career was perfect timing. I realized that knowing mathematics, statistics, and computer science are very important, but recognizing the overlaps and interconnectedness of these disciplines is crucial, just as in the internet of things, and as I have found, in the internet of medical things.

The post The Internet of medical things and of intern things appeared first on Subconscious Musings.

]]>The post Time series machine learning techniques in healthcare appeared first on Subconscious Musings.

]]>I am currently a graduate student intern in machine learning at SAS and also a research assistant at the center for Advanced Self-Powered Systems of Integrated Sensors and Technologies (ASSIST) at North Carolina State University. The ASSIST Center is a National Science Foundation-sponsored Nanosystems Engineering Research Center (NERC), which means it develops and employs nanotechnology-enabled energy harvesting and storage, ultra-low power electronics, and sensors to create innovative, body-powered, and wearable health monitoring systems. SAS is one of the industry partners for the ASSIST Center, and the insights on real time data analysis from SAS have proven to be really helpful for our research. Our motivation behind this research can be explained through this simple example: suppose an individual has a pre-existing condition like asthma, where the surroundings and their activities could trigger an attack. In such cases predicting respiration rate in advance could be beneficial. For example, if s/he is biking and continues to bike for another 20 mins, his/her predicted respiration rate could help him/her decide if s/he should bike for another 20 mins or reduce the time to stay within healthy levels. The goal is to be able to notify people about these parameters by identifying the right activities, which then become an index to predict the physiological parameters. In my research, I address the problem of identifying activities by creating hierarchical models to learn robust parameters, which is one application of time series machine learning techniques. In the near future we will be able to use these models to then predict respiration rate and heart rate.

There have been numerous studies that make use of supervised learning for activity recognition, using motion capture data and inertial measurements obtained from inertial measurement units (IMU). An IMU is a device that measures and reports linear and angular motion from the body, and one widely-available example is a smartphone. Most of these studies make use of techniques such as feature extraction, clustering and machine learning approaches for classification. Feature extraction techniques range from using statistical moments of the data (e.g., mean, variance, kurtosis) to bag-of-words representations of poses and their temporal differences. Machine learning methods used include support vector machines (SVM), neural networks, and probabilistic graphical models (e.g., hidden Markov models and conditional random fields). There are also approaches using semi-supervised techniques, and even unsupervised techniques that rely on clustering user-defined similarity metrics to identify single activities. However, most of these approaches only work at a fixed scale. That is, they do not capture hierarchies in the activities, which are required to explain complex dependencies between activities. For example, a person’s arm swinging can be part of a simple activity, such as walking, or a complex activity, such as dancing. A two-level hierarchy has been captured through the computation of the so-called motifs that compose activities. Higher level hierarchies may also be essential but have not been carefully studied. The aim of this research is to capture these dependencies using a computationally efficient framework that will provide a robust characterization of the existing hierarchical structures.

Topological tools for high-dimensional data analysis have gained popularity in recent years. These techniques often focus on tracking the homology of a space, which is a group structure that carries information about its connectivity and number of holes. Techniques such as persistent homology have been used for the analysis of point cloud data, quantifying the stability of the features extracted in a computationally efficient way via the use of stability theorems. These techniques have been used in a variety of applications, including the study of shapes in protein, image analysis and speech pattern analysis. For this research project we use topological data analysis to find robust parameters and build hierarchical graphical representations to classify activities.

Our approach builds a hierarchical representation of the data streams by comparing segments of data over various window sizes. A graphical model is extracted by first clustering the segments over a fixed window size τ and then connecting clusters with sufficient overlap across τ values. The structure of the hierarchical graphical model depends on a clustering parameter ε. We propose a new methodology for selecting robust graphical structures from this data via the use of an aggregate version of the persistence diagram. We also provide a methodology for selecting parameter values for this representation based on inference performance and power consumption considerations.

From our approach we are able to report the prediction accuracy for each of the activities in our dataset (walking, bicycling, sitting, golfing and waving). We also show how persistence diagrams can help reduce computation time and help choose stable models for our hierarchical representations. Some of the future work will involve testing this method on other datasets and comparing it with other existing algorithms.

I personally am really excited about the advantages the wearable technologies provide us! They are changing the lifestyle of individuals at a personalized level. Coming from a biomedical background, I always wanted to work closely with the wearable devices and understand how they could benefit us to achieve a better and healthy living. Being able to apply time series machine learning techniques from my current studies in electrical engineering to health care wearables leverages my biomedical experience in exciting new ways!

I’ll be presenting this work as an e-poster during the SAS Analytics Experience Conference in Vegas September 12-14, 2016, so look for me if you’re there to learn more!

*Editor’s note: Namita was one of six winners of the e-poster competition offered at the Conference, which meant she won a free trip to the event, so be sure to check out her work! This past summer Namita was also a SAS Summer Fellow in Machine Learning, which is a highly selective competitive program SAS offers for PhD students each year. *

The post Time series machine learning techniques in healthcare appeared first on Subconscious Musings.

]]>