The post Model selection for spatial econometrics using PROC SPATIALREG appeared first on Subconscious Musings.

]]>For the purpose of illustration, this post uses the same 2013 North Carolina county-level home value data that was used in the previous post. The data set is named NC_HousePrice and contains five variables: county (county name), homeValue (median value of owner-occupied housing units), income (median household income in 2013 in inflation-adjusted dollars), bachelor (percentage of people in the county who have a bachelor’s degree or higher), and crime (rate of Crime Index offenses per 100, 000 people). Before you proceed with spatial econometric analysis, you need to create a spatial weights matrix. For convenience, consider a first-order contiguity matrix W where two counties are neighbors to each other if they share a common border.

To fit a spatial Durbin model (SDM) to the NC_HousePrice data, you can submit the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAR;
spatialeffects Income Crime Bachelor;
test _rho=0/all;
spatialid County;
run;
```

You supply two data sets—the primary data set and a spatial weights matrix—by using the DATA= option and the WMAT= option, respectively. The primary data set contains the dependent variable, the independent variables, and possibly the spatial ID variable. In the MODEL statement, you specify the dependent variable y and regressors x. You use the TYPE= option to specify the type of model to be fit, selecting one of the following values: SAR, SEM, SMA, SARMA, SAC, and LINEAR. For example, you specify TYPE=SAR to fit a SAR model and TYPE=LINEAR to fit a linear regression model. You use the SPATIALEFFECTS statement to specify exogenous interaction effects. In the preceding statements, you specify TYPE=SAR together with the SPATIALEFFECTS statement to fit an SDM model. The TEST statement in PROC SPATIALREG enables you to perform hypothesis testing. The SPATIALREG procedure supports three different tests: likelihood ratio (LR), Wald, and Lagrange multiplier (LM). The SPATIALID statement enables you to specify a spatial ID variable to identify observations in the two data sets that are specified in the DATA= and WMAT= options.

For the SDM model fitted to the NC_HousePrice data, the value of Akaike’s information criterion (AIC) is –144.96. The results of parameter estimation from the SDM model, shown in Table 1, suggest that three predictors—income, crime, and bachelor—are all significant at the 0.05 level. The spatial correlation coefficient ρ is estimated to be 0.31 and is significant at the 0.05 level. Table 2 shows the test results for H0: ρ = 0 from the three tests. According to the test results, you can conclude that H0 should be rejected at the 5% significance level. In other words, there is a significantly positive spatial correlation in house price.

To fit a spatial error model (SEM) to the NC_HousePrice data, you can submit the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SEM;
spatialid County;
run;
```

The results of parameter estimation from the SEM model, shown in Table 3, suggest that two out of three predictors—income and bachelor—are significant at the 0.05 level. The value of AIC for this model is –122.68. The spatial correlation coefficient λ is estimated to be 0.60 and is significant at the 0.05 level. As a result, there seems to be a significant positive correlation in the disturbance.

So far, SDM and SEM models have been fitted to NC_HousePrice data. The SDE model is capable of accounting for both endogenous and exogenous interaction effects, whereas the SEM model can account for spatial dependence in the disturbance term. The comparison of AIC values between these two models suggests that the SDM model is better than the SEM model because the SDM model has a smaller AIC value. However, you might want to try a more complicated model—such as a spatial autoregressive confused (SAC) model—that can address endogenous interaction effects, exogenous interaction effects, and spatial dependence in the disturbance term.

You can fit an SAC model to the NC_HousePrice data by submitting the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAC;
spatialid County;
run;
```

Table 4 shows the results of parameter estimation from the SAC model. As in the SEM model, only two predictors—income and bachelor—are significant at the 0.05 level. The value of AIC for this model is –121.43. The spatial correlation coefficient ρ is estimated to be –0.10, but it is not significant at the 0.05 level. However, the spatial correlation coefficient λ is estimated to be 0.67 and is significant at the 0.05 level.

Among the three models that have been considered for the NC_HousePrice data, the SDM model is the one with the smallest value of AIC. As a result, if you have to choose the best model among SDM, SEM, and SAC models according to AIC, the SDM model would be the winning model. Since model selection is common in most data analysis, PROC SPATIALREG facilitates model selection by enabling you to use multiple MODEL statements to fit more than one model at a time. For example, you can fit the preceding three models to NC_HousePrice data by submitting the following statements:

```
proc spatialreg data=NC_HousePrice Wmat=W;
model Hvalue=Income Crime Bachelor/type=SAR;
spatialeffects Income Crime Bachelor;
test _rho=0/all;
model Hvalue=Income Crime Bachelor/type=SEM;
model Hvalue=Income Crime Bachelor/type=SAC;
spatialid County;
run;
```

This post introduces how to fit spatial econometric models that are available in the SPATIALREG procedure. These models include the spatial Durbin model (SDM), spatial error model (SEM), and spatial autoregressive confused (SAC) model. It also describes how you can use multiple MODEL statements in only one call to PROC SPATIALREG to facilitate model selection. In the next blog post, we’ll talk more about how to create spatial weights matrices for spatial econometric analysis. We will also be giving a talk, "Big Value from Big Data: SAS/ETS® Software for Spatial Econometric Modeling in the Era of Big Data," at the SAS Global Forum conference April 2-5, 2017, in Orlando. Stop by and let's talk about spatial econometrics!

The post Model selection for spatial econometrics using PROC SPATIALREG appeared first on Subconscious Musings.

]]>The post Women in analytics - a personal perspective appeared first on Subconscious Musings.

]]>**Getting an early start in Analytics**

How do you encourage a young girl to pursue a career which requires mathematical or scientific skills? How do you react to your child’s interest in mathematics? Do you react differently depending on whether the child is a boy or girl? Why? Encourage him / her to pursue that interest and ensure that the school is emphasizing that message as well. From my childhood, my parents (especially my dad who started his career as a Math professor) were very proud that I was good at math – no one ever commented on the fact that I was also a girl! When I started high school and ended up being the only girl in the math class, we took it in stride and did not make a big deal of it – Radhika likes math and she is in the math class. That was it!

One way to interest young girls in math-related fields is by providing examples of real-world applications so they can understand the value. In today’s world, every field is rich with applications of mathematical modeling, so it is easier to capture student’s interests. Another avenue is to encourage your child to participate in competitions, especially ones involving teams. Often, girls like working in teams which can also be good training for their professional career later in life.

And, of course, it is very important for the young girls to have role models of women in analytics who provide examples of successful careers in these fields.

**Personal experience**

I had a unique experience during my Master’s program in Mathematics at the Indian Institute of Technology in Delhi. It was amazing that in the year 1975, my class had 7 women out of the small class of 14 students! Several of us women have gone on to have very successful professional careers in highly technical fields – in fact three of us are now in the U.S. Likewise, my PhD class in Operations Research at Cornell had five women out of a total of about a dozen students. With that kind of an experience of women in analytics, I have never noticed that there may be more men than women in my field. We are all just members of the analytics community!

**Career Growth**

Women are hesitant to talk about their strengths and proactively seek promotions. Be bolder in seeking out new opportunities. Of course, you need to be qualified for the role! Be confident of your strengths. If you have a strong foundation and are the expert in your area, your being a woman in analytics is irrelevant. People will recognize you as the expert immediately. Be bold and take a seat at the table. If you have been invited to participate, it is because you are recognized as someone who can contribute – use that opportunity to do so.

**Women in Analytics professions**

Within SAS Institute, we have several women in analytics at all levels – from individual contributors who are recognized throughout the institute as “the expert in a particular area” to senior managers who are responsible for key flagship analytical products from SAS. Several of them play important roles in leading professional organizations in addition to their responsibility at SAS. As a percentage of the workforce, we may have fewer women than men in the industry as a whole. However, there are several women in leading roles in analytics, both as leaders in major multinational companies (IBM, Ford, Verizon) and at the helm of professional organizations like the Institute for Operations Research and Management Science (INFORMS), the American Statistical Association, etc.

Especially in the last few years, women in analytics have made great strides in leading large organizations. For example, Dr. Pooja Dewan was named Chief Data Scientist at BNSF, Malene Haxholdt is VP of Enterprise Analytics at Metlife, Dr. Nipa Basu is Chief Analytics Officer at Dunn and Bradstreet and so on.

There is a geographic differential in the representation of women in technology. Some of the research from the UK indicates that women in India seem to see IT/STEM as empowering in a way that women in the UK or US do not. There is also research suggesting that while women may not be doing well in terms of increased numbers in computer science or math the discipline of statistics is an exception: “More than 40 percent of degrees in statistics go to women, and they make up 40 percent of the statistics department faculty poised to move into tenured positions.”

**Increasing interest in analytics for women**

As analytics and data science become more ubiquitous in several industries, we are seeing an increase in the number of women in Analytics. There are a few key reasons for this increase:

- There are many applications areas for analytics which are attractive to women: Health care, Education, non-profit work, projects that are aimed at doing social good. There is ample evidence that women are more drawn to opportunities to make a social impact.
- More companies are providing flexibility that helps women get back or stay in the work force. These include benefits like flexible work schedules, more “work from home” options, family medical leave options, more options for day care, support for nursing mothers, and so on.
- Many educational opportunities are available for women to get trained in data science and analytics arenas through online master’s programs or certification programs. For women who have a STEM-related undergraduate education, this provides an easy entry into the analytics domain. There is an interesting article describing how data science is creating opportunities for women. It is exciting to see the many women leaders who participated in the recent Women in Data Science Conference where the elite panel of speakers were all women!
- Many collaborative, team competitions (for example, the DataDive at SAS being held in partnership with DataKind) are being arranged across the data science and analytics domain – such collaborative, problem solving events may be particularly appealing to women.

**Challenges and opportunities being a woman in this highly competitive field**

It is well recognized that there is a confidence issue that plagues women in fields that are dominated by men. A Harvard study presented the result: “Female computer science concentrators with eight years of programming experience report being as confident in their skills as their male peers with zero to one year of programming experience.” I can relate to that! Most women I know will speak up only if they are confident that their points are thoroughly researched and vetted. We need to encourage them to participate freely in any dialog and discussion. Women need to be reminded that they have been invited to be in the group / discussion because their opinions are valued.

At the same time, women have some inherent strengths that they bring to the table. We have an ability to understand others’ points of view which makes it possible to have productive discussions over contentious topics. We also have a capacity to nurture which is useful in growing a team of very talented individuals who are often brilliant but may not be used to working as part of a high achieving team. One of the most important skills for a successful leader is the ability to see the big picture (remember the tale of the “Six Blind Men and the Elephant?” I believe that women are more likely to understand the big picture because of their natural empathy for others’ inputs.

My advice to young women entering the field of analytics is: Do not hold yourself back because you are a woman, you have earned the right to be in this area, use your strengths to build a strong team by nurturing everyone’s talents. This is a golden age to be part of this domain!

*Image credit: photo by *__Mike Kline__* // attribution by *__creative commons__

Note: I prepared these thoughts related to an interview in Analytics India Magazine: International Women's Day special.

The post Women in analytics - a personal perspective appeared first on Subconscious Musings.

]]>The post How Santa’s Workshop uses social network analysis appeared first on Subconscious Musings.

]]>I recently met Mrs. Claus at the INFORMS Annual Meeting, where we got to talking about the social network analysis session she’d just attended. It turns out Mrs. Claus and I are both fans of a book by Alex Pentland, Social Physics: How Social Networks Can Make Us Smarter. Apparently years ago she had foreseen the trend toward analytics and returned to school for dual PhDs in Computer Science and Statistics at Stanford. She now carries the title of Chief Data Scientist of Santa’s Workshop. Who knew? We chatted about the many ways she and her team at Santa’s Workshop use social network analysis, some of which were commonly employed but others were surprising adaptations.

Santa’s Workshop first started using social network analysis to uncover fraud. While naughty children exist (which requires predicting coal delivery, but that’s another post), the perpetrators Santa is after are adults. I bet you didn’t know that some households pretend to have young children by leaving out notes for Santa, even if they have no children at all or their children are grown and have left home. The challenge with finding fraudsters is not in seeing a pattern, because fraud is by definition a rare event, but in making meaningful connections between disparate data activities that may help spot fraud.

Social network analysis allows investigators to look at lots of data from multiple sources at the level of a network, where they can see different people (nodes) and their relationships (ties) in the form of a graph. The connections between people may not exist at the transactional level but jump out when viewed graphically in network form. Just like how the Los Angeles County Department of Public Social Services uses social network analysis to quickly identify potential co-conspirators in fraud rings, Santa’s Workshop uses social network analysis to find the bad guys. Mrs. Claus has uncovered several fraud rings using this analysis way and stopped delivery of presents to those homes.

Telecom companies are not the only ones to worry about churn. Santa’s Workshop also has to worry about this problem, which arises when children stop believing in Santa Claus and cancel their “Santa service” prematurely. You may think this transition of beliefs is an isolated event that is just a natural function of age, but Mrs. Claus uses the SAS® Enterprise Miner™ Link Analysis node (link analysis is a popular form of social network analysis) to uncover notable connections among parents, siblings, and schools that suggest the possibility of churn. They look at cell phone call records and social media connections to understand relationships, and then use targeted interventions to offer parents tools to allow children to continue to believe.

Drawing on Pentland’s research they have applied careful use of social network incentives to encourage older children not to tell their younger siblings or other children in the neighborhood. Pentland’s lab has found that this kind of positive social pressure is best applied to people in the target’s network rather than directly to the target. For Santa this means offering incentives to the older, respected friends of children at risk of “explaining” Santa to their younger siblings rather than to the older siblings themselves. Pentland’s research shows that this kind of nudge is far more effective than standard economic incentives, because it recognizes that we are social actors strongly influenced by our social ties. Mrs. Claus and team have also found that children at risk of churning held onto their beliefs longer when they received special mail messages from Santa himself.

Using social network analysis to detect fraud and churn are common use cases, but what really intrigued me was how Santa’s Workshop uses social network analysis to generate ideas for presents by improving idea flow among the elves. You may have thought that Santa is only an order taker, but consider the challenge he faces when a child asks for something that simply is infeasible (pony requests in New York City, for example). So each year he must conceive of, design, and produce items that may not have been requested but are likely to please. Product Manager Elves are responsible for finding ideas for these gifts and delivering product specifications to the Manufacturing Elves.

Research shows that the best way to stimulate idea flow is to increase both exploration and engagement. Early in the year Product Manager Elves travel around (incognito, of course) to hunt for ideas. The Product Manager Elves most consistently successful at generating creative product ideas are those Pentland labels explorers. You know these kind of people – they are the ones who know lots of different kinds of people, love talking ideas with them, and then sharing those ideas they’ve just gathered in subsequent conversations. As Pentland describes, their focus is not “the ‘best’ people or ‘best’ ideas….but “people with different views and different ideas.” They then filter the best ideas by learning which ones generate the most traction in their subsequent conversations with others.

The other key to idea flow is engagement, which is when new ideas are shared among teams. The best ideas the Product Manager Elves discover go nowhere if they aren’t adopted and championed by other Elves. So drawing upon an example in Pentland’s book about improvement in idea flow at a call center, Mrs. Claus scheduled a common lunch hour, so everyone breaks at the same time to eat. Previously they didn’t want to bring down the line, so lunch was at various times, but they’ve learned that when all those Elves from different departments circulate at lunch they share ideas. Those ideas that stick are the ones the whole teams get excited about and start contributing to design specs, beginning to see these ideas as those belonging to Santa’s Workshop and not just the Product Manager Elf who discovered them initially. Plus, this seems to have helped in Elf retention, because all of them feel part of the entire process.

What does Mrs. Claus have on her 2017 data science horizon? She’s been exploring the use of sociometric badges that Pentland first employed in his research. Commonly known as sociometers, these are small electronic wearable devices that collect data on people’s interactions (face-to-face time, conversation, gestures, physical proximity, etc.). Pentland’s devices are the size of thick badges, but Santa’s Workshop has developed tiny ones they can surreptitiously place on toys to track similar behavior, with the added element of serving as gateways, so they can analyze the data in real-time. She hopes to make tweaks to gift-giving in 2017 as Santa travels around the world, drawing upon the initial reactions of children to new gifts.

I was glad to meet Mrs. Claus, another fan of Pentland’s book, Social Physics, because it is full of interesting ideas. I’m clearly an explorer, because when I read a book like this I want to discuss it with other people, hear their reactions, and learn new things. Later chapters talk about how the concepts of social physics can lead to smarter cities and even smarter societies. To ensure that the kind of data collected is used ethically Pentland even proposes a New Deal on Data. I'm encouraged by research like this that can be applied as part of the growing #data4good movement. So lots of good stuff here! If any of you have read this book please chime in and let me know what caught your attention.

*Mrs. Claus image credit: photo by Public Information Office **// attribution by creative commons*

*Santa's Workshop image credit: photo by Loozrboy **// attribution by creative commons*

The post How Santa’s Workshop uses social network analysis appeared first on Subconscious Musings.

]]>The post 4 tips to ensure your data4good efforts have an impact appeared first on Subconscious Musings.

]]>Two weeks ago I heard two very interesting talks on data4good at the INFORMS Annual Meeting, where 5,000+ people focused on operations research gathered together. The first was on “Challenges and Lessons Learned from Influencing Policy Change in Organ Transplantation.” As you can see from the photo I took, this session combined quite a distinguished group of operations research (OR) academics and transplant surgeons who both want to make an impact. For many reasons, a pure market-based approach is not the best way to allocate organs for transplant, so the process is governed by policy makers, who have divided the country into regions. All parties agree that the current regional system results in disparities in access and is broken, but policy makers have been unable to settle on a solution.

Because this situation is a classic market-matching problem it has drawn the attention of the operations research field.* Over the years the academics on the panel had proposed a variety of mathematical solutions. But the most elegant mathematical model for a real world problem adds little value if it is not implemented. So why haven’t they solved the problem? Part of it is as simple and frustrating as problem definition. Are they trying to help those who are the sickest or those most likely to succeed with a transplant? Is it fair that some regions have shorter waiting lists because more organs are available due to increased deaths? Agreeing on the problem definition is tough, and as the clinicians explained, they “argue a lot, because life matters.”

The other session that triggered my thinking was a tutorial on healthcare analytics by Joris van de Klundert of the Erasmus University Institute of Health Policy and Management, who gave a challenge to the OR professionals in attendance. Healthcare analytics is a critical area for the world’s population but one where his fellow researchers in OR are making only a modest contribution. A big part of the problem is too much emphasis on research and too little on results that actually improve healthcare. His literature review of articles on healthcare analytics at various stages of the analytical life cycle highlighted this fact. The vast majority of research is in model building, with fewer and fewer articles published as you move along the cycle to solution development, then model implementation, and finally evaluation and monitoring. There was a lively discussion among the audience of the challenges, which include difficulty as academics getting involved in practice, the conflict between the simple models most often needed and those that will result in publication, and the risk tenure-track faculty face doing work that may not result in the right kind of publications.

After listening to these data4good talks I propose these tips to ensure your application of analytics have a real impact:

**Take time to listen to your “customer.”**Even in the social sector realm, you still have “customers,” who are the people or groups for whom you are trying to solve the problem. The transplant surgeons emphasized that it takes a lot of time to build relationships between clinicians and what they called “engineers,” in part because of the big gap that can exist between what these two groups value. Be sure to explain your results, to increase the credibility of your model. As the San Bernadino County Department of Behavioral Health found, discussing their analysis with many of their partners in care helped them align on the goals they shared.**Build models that match the problem as well as the solution.**This means ensuring that you have defined the problem correctly, which as I indicated with the thorny organ transplant discussion maybe far more than half the battle. This also relates back to listening to your “customer.” SAS is working with DataKind and the Boston Public Schools to optimize their bus routes, and as we do so we have to periodically check in to see if the initial models we propose would make sense in practice. People who know math must talk to people who know buses to know if the model will work.**Focus on putting models into practice.**Modeling the problem is important, but as van de Klundert’s literature review shows, the OR community has plenty of success in this area. The challenge is working closely enough (see items 1 and 2 above) with your “customer” to find a path to implementation. After all, as one of the academics said “our endless models don’t necessarily provide the details practitioners want.” So find the details they do want, put them into your model, and work with them to put that model into practice. As Jake Porway, founder of DataKind, blogged: “we cannot make change with technology or data alone.”**Remember Occam’s razor, or the simplest solution is often the best.**For all your interest in trying out the latest non-negative matrix factorization model, a simple logistic regression is often hard to beat. And it will be far more interpretable to most non-analytics professionals.

Today’s #GivingTuesday celebrates giving of all forms, and the social sector could benefit so much from the talent of data scientists (in fact, what sector wouldn’t benefit?). But as Jake Porway likes to say, “you can’t just hack your way to social change.” You must consider impact from the start for your data4good efforts to succeed. After all, who wants to give their time and talent without it making a difference?

* Alvin Roth shared the Nobel Prize for his work in this area, "for the theory of stable allocations and the practice of market design," and while he is a professor of economics at Stanford his PhD is in operations research.

The post 4 tips to ensure your data4good efforts have an impact appeared first on Subconscious Musings.

]]>The post Local Search Optimization for HyperParameter Tuning appeared first on Subconscious Musings.

]]>Once a TV is calibrated, it is ready to enjoy. The visual data, the broadcasted information, can be observed, processed, and understood in real time. When it comes to data analytics, however, with raw data in the form of numbers, text, images, etc., gathered from sensors and online transactions, ‘seeing’ the information contained within, as the source grows rapidly, is not so easy. Machine learning is a form of self-calibration of predictive models given training data. These modeling algorithms are commonly used to find hidden value in big data. Facilitating effective decision making requires the transformation of relevant data to high-quality descriptive and predictive models. The transformation presents several challenges however. As an example, take a neural network (Figure 1). A set of outputs are predicted by transforming a set of inputs through a series of hidden layers defined by activation functions linked with weights. *How do we determine the activation functions and the weights to determine the best model configuration? *This is a complex optimization problem.

The goal in this model training optimization problem is to find the weights that will minimize the error in model predictions given the training data, validation data, specified model configuration (number of hidden layers, number of neurons in each hidden layer) and regularization levels designed to reduce overfitting to training data. One recently popular approach to solving for the weights in this optimization problem is through use of a *stochastic gradient descent* (SGD) algorithm. The performance of this algorithm, as with all optimization algorithms, depends on a number of control parameters for which no set of default values are best for all problems. SGD parameters include among others a *learning rate* controlling the step size for selecting new weights, a *momentum* parameter to avoid slow oscillations, a *mini-batch* size for sampling a subset of observations in a distributed environment, and *adaptive decay rate* and *annealing rate* to adjust the learning rate for each weight and time. See related blog post ‘Optimization for machine learning and monster trucks’ for more on the benefits and challenges of SGD for machine learning.

The best values of the control parameters must be chosen very carefully. For example, the momentum parameter dictates whether the algorithm tends to oscillate slowly in ravines where solutions lie, jumping across the ravine, or dives in quickly. But if momentum is too high, it could jump by the solution (Figure 2). The best values for these parameters also vary for different data sets, just like the ideal adjustments for an HDTV depending the characteristics of its environment. These options that must be chosen before model training begins dictate not only the performance of the training process, but more importantly, the quality of the resulting model – again like the tuning parameters of a modern HDTV controlling the picture quality. As these parameters are external to the training process – not the model parameters (weights in the neural network) being optimized during training – they are often called ‘*hyperparameters’*. Settings for these hyperparameters can significantly influence the resulting accuracy of the predictive models, and there are no clear defaults that work well for different data sets.

In addition to the optimization options already discussed for the SGD algorithm, the machine learning algorithms themselves have many hyperparameters. Following the neural net example, the number of hidden layers, the number of neurons in each hidden layer, the distribution used for the initial weights, etc., are all hyperparameters specified up front for model training that govern the quality of the resulting model.

The approach to finding the ideal values for hyperparameters, to tuning a model to a given data set, traditionally has been a manual effort. However, even with expertise in machine learning algorithms and their parameters, the best settings of these parameters will change with different data; it is difficult to predict based on previous experience. To explore alternative configurations typically a grid search or parameter sweep is performed. But a grid search is often too coarse. As expense grows exponentially with number of parameters and number of discrete levels of each, a grid search will often fail to identify an improved model configuration. More recently random search is recommended. For the same number of samples, a random search will sample the space better, but can still miss good hyperparameter values and combinations, depending on the size and uniformity the sample. A better approach is a random Latin hypercube sample. In this case, samples are exactly uniform across each hyperparameter, but random in combinations. This approach is more likely to find good values of each hyperparameter, which can then be used to identify good combinations (Figure 3).

True hyperparameter optimization, however, should allow searching between these discrete samples, as a discrete sample is unlikely to identify even a local accuracy peak or error valley in the hyperparameter space, to find good *combinations* of hyperparameter values. However, as a complex black-box to the tuning algorithm, machine learning training and scoring algorithms create a challenging class of optimization problems:

- Machine learning algorithms typically include not only continuous, but also categorical and integer variables. These variables can lead to very discrete changes in the objective.
- In some cases, the space is discontinuous where the objective blows up.
- The space can also be very noisy and non-deterministic. This can happen when distributed data is moved around due to unexpected rebalancing.
- Objective evaluations can fail due to grid node failure, which can derail a search strategy.
- Often the space contains many flat regions – many configurations give very similar models.

An additional challenge is the unpredictable computation expense of training and validating predictive models with changing hyperparameter values. Adding hidden layers and neurons to a neural network can significantly increase the training and validation time, resulting in a wide range of potential objective expense. A very flexible and efficient search strategy is needed.

SAS Local Search Optimization, part of the SAS/OR® offering, is a hybrid derivative-free optimization strategy that operates in parallel/distributed environment to overcome the challenges and expense of hyperparameter optimization. It is comprised of an extendable suite of search methods driven by a hybrid solver manager controlling concurrent execution of search methods. Objective evaluations (different model configurations in this case) are distributed across multiple evaluation worker nodes in a grid implementation and coordinated in a feedback loop supplying data from all concurrent running search methods. The strengths of this approach include handling of continuous, integer, and categorical variables, handling nonsmooth, discontinuous spaces, and easy of parallelizing the search strategy. Multi-level parallelism is critical for hyperparameter tuning. For very large data sets, distributed training is necessary. Even with distributed training, the expense of training severely restricts the number of configurations that can be evaluated when tuning sequentially. For small data sets, cross validation is typically recommended for model validation, a process that also increases the tuning expense. Parallel training (distributed data and/or parallel cross validation folds) and parallel tuning can be managed – very carefully – in a parallel/threaded/distributed environment. This is typically not discussed in the literature or implemented in practice; typically either ‘data parallel’ or ‘model parallel’ (parallel tuning) is exercised.

Optimization for hyperparameter tuning typically can very quickly lead to several percent reduction in model error over default settings of these parameters. More advanced and extensive optimization, facilitated through parallel tuning to explore more configurations, can lead to further improvement, further refining parameter values. The neural net example discussed here is not the only machine learning algorithm that can benefit from tuning: the *depth* and *number of bins* of a decision tree, *number of trees* and *number of variables* *to split on* in a random forest or gradient boosted trees, the *kernel parameters* and *regularization* in SVM and many more can all benefit from tuning. The more parameters that are tuned, the larger the dimensions of the hyperparameter space, the more difficult a manual tuning process becomes and the more coarse a grid search becomes. An automated, parallelized search strategy can also benefit novice machine learning algorithm users.

Machine learning hyperparameter optimization is the topic of a talk to be presented by Funda Günes and myself at The Machine Learning Conference (MLconf) in Atlanta on September 23. The title of the talk is “Local Search Optimization for Hyperparameter Tuning” and includes more details on the approach, parallel training and tuning, and tuning results.

*image credit: photo by kelly // attribution by creative commons*

The post Local Search Optimization for HyperParameter Tuning appeared first on Subconscious Musings.

]]>The post Machine learning fun at KDD appeared first on Subconscious Musings.

]]>Who says machine learning can't be fun? A crew of us from SAS went to San Francisco for the recent KDD conference, which bills itself as "a premier interdisciplinary conference, [which]brings together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data." We brought these buttons with us, and they were a huge hit!

But we weren't at KDD just to have fun, of course. We came to learn and share, in our booth and in many other ways. Simran Bagga came to talk about all things text analytics, and she was nice enough to pitch in and help me set up the booth. Naturally, her favorite button was "I'm Feeling Unstructured Today." She gave two extended demos in the booth: "Combining Structured and Unstructured Data for Predictive Modeling Using SAS® Text Miner" and "Topic Identification and Document Categorizing Using SAS® Contextual Analysis."

Wayne Thompson served as a senior editor on the Review Board, which means he oversaw a group if volunteers who had the hard task of reviewing and making selections from the many excellent papers submitted for the Applied Data Science track. He was also was a panelist in a "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data." His favorite button was "Talk Data to Me," which he did during his panel, "Internet of Things, Industrial Internet, and Instrumented Environments: the Furious Need for Standards." He also gave an extended demo in the SAS booth on "Machine Learning on the Go."

"Can Tools Effectively Unleash the Power of Big Data?" Udo Sglavo thinks so, and he said as much in this panel he was part of in the Applied Data Science Invited Talks track. As someone who has been involved in data mining for many years, Udo's favorite button was "I Support Vector Machines." This button was popular, because it was also Wei Xiao's favorite. He was busy attending many sessions, but he did give his own extended demo in the booth on "A Probabilistic Machine Learning Approach to Detect Industrial Plant Fault."

Susan Haller, who leads teams responsible for data mining and machine learning at SAS, had a different favorite button: "How Random are Your Forests?” Ray Wright on Susan's team's favorite was "You Can Engineer My Features." Ray is interested in automation, too, which was the subject of his extended demo: "Modeling Automation With SAS® Enterprise Miner™ and SAS® Factory Miner." But Ray also focused on basketball, giving a poster in the Large Scale Sports Analytics Workshop on "Shot Recommender System for NBA Coaches," which he co-authored with Ilknur Kaynar Kabul and Jorge Silva. Jorge didn't attend the conference, but Ilknur did, and her favorite button was "I'm Having a Cold Start Today." However, Ilknur was not having a cold start when she presented her extended demo: "Auto-Tuning Your Decision Tree, Random Forest and Neural Networks Models." Another member of Susan's team, Patrick Hall, spent a lot of time in the SAS booth, where he was great at answering all kinds of questions. However, he couldn't decide on his favorite button, because it was a tie between "I’m Feeling Unstructured Today” and “I Support Vector Machines.” Patrick answered a lot of questions on options for integrating open source software with SAS, and this was the topic of his extended demo: "Options for Open Source Integration in SAS® Enterprise Miner™." Also on Susan's team, Taiping He, liked: "I'm Feeling Unstructured Today," which may be a surprise, because his extended demo was: "Distributed Support Vector Machines in SAS® Viya™ System." Guess who develops our SVM procedure in SAS Enterprise Miner?

KDD has a nice balance of practitioners and academics in attendance, so we were glad to interact with both groups. We met many students and professors in the booth, and Scott MacConnell was on hand from our Academic Outreach and Collaborations group to talk about all the great free resources SAS has to offer academics. Scott's favorite button was "I Am Feeling Unstructured Today."

We made time for fun, too, and one night many of us ate dinner together at a restaurant called The Stinking Rose, which calls itself "A garlic restaurant." They had fun murals on the wall showing garlic in all kinds of ways you never even dreamed of! I had the Forty Clove Garlic Chicken, and even though I didn't eat anywhere near that number they provided, I do hope my choice didn't depress traffic in the booth. The food was delicious! And my favorite button? "My Networks Run Deep."

The post Machine learning fun at KDD appeared first on Subconscious Musings.

]]>The post The Internet of medical things and of intern things appeared first on Subconscious Musings.

]]>This past summer I used data from cell phones attached at the waist to predict the activity of the owner, which is an exciting application of the internet of medical things. There are a number of immediate applications of this research: contextualizing electrocardiogram signals, improving exercise analysis, and assisting in the care for elderly. As an intern, my first assignment was simply to replicate the results from an existing activity recognition paper using SAS/IML® to extract features from a time series and SAS® Enterprise Miner ™ to produce an accurate model. As I mentioned earlier, I started my summer knowing how to program in a few languages, including SAS, but I didn’t know what a time series of data was, or how to program in IML, and I knew absolutely nothing about how to use a neural network model.

My first obstacle for my summer project on the internet of medical things was fleshing out how I learn best. With respect to SAS Enterprise Miner, at first I spent time going through the documentation to get a feel for the different nodes and their respective settings and options. This was helpful to a point, but what I discovered was that I learned best when I tested different options and examined results. I found this to be true in other parts of my research; when I spent time to plot the time series data, using different graph types, styles, and filters, I was able to understand my data at a deeper level. When extracting features from a time series, it is important to extract intuitive and meaningful features that capture a characteristic of the time series that would be evident if you looked at it in its entirety. This is almost impossible without spending some time examining the data. I think this is a common trend in our new age of data science and analytics; it’s not about what you think the data should say, but about what the data are actually saying.

Throughout this project I observed some characteristics of the internet of medical things, but I also learned what I call the Internet of Intern Things.

**1. Read, read, read, and read some more.**

On my second day at SAS, my manager and another team member met me for lunch and we discussed some possible projects for my summer experience. I admitted that I didn’t have a particular research interest, but I was willing to try anything. My team member suggested a project and later sent me a folder of recent publications related to the topic. I dutifully saved the folder to my computer and printed the PDFs that seemed the most intuitive and easiest to understand. I read the documents, highlighted what I thought was important, and launched into the project. That was my first mistake. For example, I began searching for a filtering mechanism, because that’s what the project required, but I didn’t intuitively understand my data and its form enough to be able to explain why I needed a filtering mechanism. Days later my mentor asked me some questions about the data, just to be sure we were on the same page, and I realized I wasn’t sure about my answers. Not only was I embarrassed, but I was worried that the time I had spent on my project was wasted. Of course as an intern, no time spent failing is wasted, because failures are learning experiences, but I was nonetheless disappointed. Fortunately for me, my mentor was very understanding. From this experience I learned that taking time to contextualize your data at the beginning of a project is not only helpful, but necessary. Moreover, prioritizing reading papers for research, and papers or articles suggested by colleagues is helpful. Several times throughout the summer, my mentors, with much more experience than me, recommended well-known papers or recent articles that were relevant to my field of study and interests. I learned that it is valuable to take time each week, if not each day, to read a short paper or several articles to stay up to date and informed. As an intern, reading helped me to understand the “buzz words” of my field, like “data science” and “machine learning,” and gave me talking points when I met with my colleagues for lunch. I know, that sounds a little over the top. Like really, “talking points” for lunch? But as an intern it is important to set yourself apart, and being well-read helps.

**2. Collaboration is not only helpful, but imperative.**

If I were to summarize my summer experience at SAS into one word, it would be “collaboration.” Collaboration was crucial to my summer project, and to navigating such a large organization as an intern. After giving my first presentation of my preliminary work on my summer project, several other interns contacted me and shared their projects, and we found overlaps. While I was working on modeling human activity using feature engineering with a goal of classifying healthy or unhealthy heartbeats, others were working on motif discovery and motif comparison.

These projects logically overlap in our ultimate motivation: classification of health signals. My project was focused on extracting information from a time series, while others were reviewing the actual pattern of a time series in a pictorial sense. After realizing this overlap, we began to compare notes and share helpful resources for visualization. In my final intern presentation, I actually used a visualization application shared by my fellow intern. Our collaboration not only benefited our summer projects, but it was also in the spirit of the atmosphere of modern tech companies who are concerned with team work and shared work effort. Moreover, it points to the central theme of the internet of things: everything is connected in some way, and thus should be used in tandem for the most efficient and accurate results.

**3. Prepare and ask questions.**

You know those professors who on the first day of class say “You can never ask enough questions! There are no dumb questions.”? I won’t use this time to reiterate the very important action of asking questions, but instead will add my own flavor. Don’t just ask questions, prepare questions. What do I mean? Exhaust your own resources before you ask for help, but don’t take too long. Continually ask yourself what is confusing before asking for help. Read. Did I say that again? I can’t stress it enough. Don’t get me wrong, I spent countless hours in my teammates’ offices this summer asking some dumb questions, and also asking some questions that took us both a week to answer. But, I believe when you come prepared with specific questions and sources of confusion that teammates are more willing to help and answer questions. During the summer I also had the opportunity to email individuals who published the data I used for my project. In writing that email, it was very important for me to be sure of what I knew before I asked questions. It goes back to the internet of things: What do I know? What do I want to know? What resources of information can I use to learn what I want to know?

My intern experience this summer has impacted my research focus, education plans, and career path. Another amazing opportunity that grew out of my summer experience is presenting a student e-poster at the 2016 SAS Analytics Experience conference in Las Vegas. Besides being able to present my research, I am also very excited about this opportunity because I will be able to hear a talk given by Jake Porway, founder and executive director of DataKind, an organization committed to using “data for good”, along with many other interesting talks, sessions, and demos.

Having an experience at SAS (my own personal Internet of Intern Things) in the middle of my college career was perfect timing. I realized that knowing mathematics, statistics, and computer science are very important, but recognizing the overlaps and interconnectedness of these disciplines is crucial, just as in the internet of things, and as I have found, in the internet of medical things.

The post The Internet of medical things and of intern things appeared first on Subconscious Musings.

]]>The post Time series machine learning techniques in healthcare appeared first on Subconscious Musings.

]]>I am currently a graduate student intern in machine learning at SAS and also a research assistant at the center for Advanced Self-Powered Systems of Integrated Sensors and Technologies (ASSIST) at North Carolina State University. The ASSIST Center is a National Science Foundation-sponsored Nanosystems Engineering Research Center (NERC), which means it develops and employs nanotechnology-enabled energy harvesting and storage, ultra-low power electronics, and sensors to create innovative, body-powered, and wearable health monitoring systems. SAS is one of the industry partners for the ASSIST Center, and the insights on real time data analysis from SAS have proven to be really helpful for our research. Our motivation behind this research can be explained through this simple example: suppose an individual has a pre-existing condition like asthma, where the surroundings and their activities could trigger an attack. In such cases predicting respiration rate in advance could be beneficial. For example, if s/he is biking and continues to bike for another 20 mins, his/her predicted respiration rate could help him/her decide if s/he should bike for another 20 mins or reduce the time to stay within healthy levels. The goal is to be able to notify people about these parameters by identifying the right activities, which then become an index to predict the physiological parameters. In my research, I address the problem of identifying activities by creating hierarchical models to learn robust parameters, which is one application of time series machine learning techniques. In the near future we will be able to use these models to then predict respiration rate and heart rate.

There have been numerous studies that make use of supervised learning for activity recognition, using motion capture data and inertial measurements obtained from inertial measurement units (IMU). An IMU is a device that measures and reports linear and angular motion from the body, and one widely-available example is a smartphone. Most of these studies make use of techniques such as feature extraction, clustering and machine learning approaches for classification. Feature extraction techniques range from using statistical moments of the data (e.g., mean, variance, kurtosis) to bag-of-words representations of poses and their temporal differences. Machine learning methods used include support vector machines (SVM), neural networks, and probabilistic graphical models (e.g., hidden Markov models and conditional random fields). There are also approaches using semi-supervised techniques, and even unsupervised techniques that rely on clustering user-defined similarity metrics to identify single activities. However, most of these approaches only work at a fixed scale. That is, they do not capture hierarchies in the activities, which are required to explain complex dependencies between activities. For example, a person’s arm swinging can be part of a simple activity, such as walking, or a complex activity, such as dancing. A two-level hierarchy has been captured through the computation of the so-called motifs that compose activities. Higher level hierarchies may also be essential but have not been carefully studied. The aim of this research is to capture these dependencies using a computationally efficient framework that will provide a robust characterization of the existing hierarchical structures.

Topological tools for high-dimensional data analysis have gained popularity in recent years. These techniques often focus on tracking the homology of a space, which is a group structure that carries information about its connectivity and number of holes. Techniques such as persistent homology have been used for the analysis of point cloud data, quantifying the stability of the features extracted in a computationally efficient way via the use of stability theorems. These techniques have been used in a variety of applications, including the study of shapes in protein, image analysis and speech pattern analysis. For this research project we use topological data analysis to find robust parameters and build hierarchical graphical representations to classify activities.

Our approach builds a hierarchical representation of the data streams by comparing segments of data over various window sizes. A graphical model is extracted by first clustering the segments over a fixed window size τ and then connecting clusters with sufficient overlap across τ values. The structure of the hierarchical graphical model depends on a clustering parameter ε. We propose a new methodology for selecting robust graphical structures from this data via the use of an aggregate version of the persistence diagram. We also provide a methodology for selecting parameter values for this representation based on inference performance and power consumption considerations.

From our approach we are able to report the prediction accuracy for each of the activities in our dataset (walking, bicycling, sitting, golfing and waving). We also show how persistence diagrams can help reduce computation time and help choose stable models for our hierarchical representations. Some of the future work will involve testing this method on other datasets and comparing it with other existing algorithms.

I personally am really excited about the advantages the wearable technologies provide us! They are changing the lifestyle of individuals at a personalized level. Coming from a biomedical background, I always wanted to work closely with the wearable devices and understand how they could benefit us to achieve a better and healthy living. Being able to apply time series machine learning techniques from my current studies in electrical engineering to health care wearables leverages my biomedical experience in exciting new ways!

I’ll be presenting this work as an e-poster during the SAS Analytics Experience Conference in Vegas September 12-14, 2016, so look for me if you’re there to learn more!

*Editor’s note: Namita was one of six winners of the e-poster competition offered at the Conference, which meant she won a free trip to the event, so be sure to check out her work! This past summer Namita was also a SAS Summer Fellow in Machine Learning, which is a highly selective competitive program SAS offers for PhD students each year. *

The post Time series machine learning techniques in healthcare appeared first on Subconscious Musings.

]]>The post Is Poker a Skill Game? A Panel Data Analysis appeared first on Subconscious Musings.

]]>While preparing for my trip I was reminded of a paper I once read in *Chance* magazine (Croson, Fishman and Pope 2008) that concluded that poker, like golf, is a game of skill rather than luck. The paper was published in 2008 during the heyday of televised poker, when it seemed that ESPN aired poker tournaments and little else. The paper especially struck me because it quoted one of my favorite movies:

*"Why do you think the same five guys make it to the final table of the World Series of Poker every year? What are they, the luckiest guys in Las Vegas?"* – Mike McDermott (played by Matt Damon in *Rounders*)

Upon rereading the paper I realized the datasets the authors gathered followed a design for panel data.

Panel data occur when a set of individuals, or *panel*, are each measured on several occasions. Panel data are ubiquitous in all fields, because they allow each individual to act as their own control group. That allows you to focus on identifying causal relationships between response and regressor, knowing that you can control for all factors specific to the individual, both measured and unmeasured.

In regards to Croson et al. (2008), the *individuals* were poker players whose results were recorded over multiple poker tournaments. The authors gathered two panel datasets, one for poker players and one for professional golfers. They surmised that if the associations you see for poker mimic those for golf, then you should conclude that poker, like golf, is a game of skill. After all, one would never theorize that Tiger Woods has won 14 major championships based purely on good karma.

Focusing on the data for poker, the authors gathered tournament results on 899 poker players. Because poker tournaments vary in the number of entries, only results in the top 18 were considered, and that number was chosen because it corresponds to the final two tables of 9 players each. The response was the final rank (1 through 18, lower being better) and the regression variables were three measures of previous performance. One such measure was *experience*, a variable indicating whether the player had a previous top 18 finish.

Among other similar analyses, the authors fit a least-squares regression of rank on experience:

*\(Rank_{ij} = \beta_{0} + \beta_{1} Experience_{ij} + \epsilon_{ij}\)*

where *i* represents the player and j the player’s ordered top-18 finish. From the analysis they found a statistically significant negative association between current rank and previous success. Because lower ranks are better, they concluded that good previous performance was associated with good present performance. Furthermore, the magnitude of the association was analogous to the parallel analysis they performed for golf. They concluded that because you can predict current results based on previous performance – in the same way you can with golf – then poker must be a skill game.

The authors used simple least squares regression, with the only adjustment for the panel design being that they calculated "cluster robust’’ standard errors that controlled for intra-player correlation. They did not consider directly whether there were any player effects in the regression.

After obtaining the data, I used PROC PANEL in SAS/ETS to explore this issue. I considered three different estimation strategies applied to the previous regression. PROC PANEL compactly summarized the results as follows:

The **OLS Regression** column precisely reproduces the analysis of Croson et al. (2008) and shows a significant negative association between current rank and previous experience. The **Within Effects** column is from a fixed-effects estimation that utilizes only within-player comparisons. You can interpret that coefficient (0.39) as the effect of experience for a given player. Conversely, the **Between Effects** column is from a regression using only player-level means, that is, the estimator uses only between-player comparisons. Because the estimator of the within effect for experience is not significant and that for the between effect is strongly significant, you can conclude the data exhibit substantial latent player effects. That is not surprising, because measures of player ability (technical, psychological or mystical) weren’t included in the model.

The augmented analysis does nothing to invalidate the Croson el al. (2008) conclusion that poker involves more skill than luck. However, to believe that premise you must begin with the untested (yet reasonable) assumption that luck is something that, even if it plays a factor in one tournament, cannot be maintained over a career. You must rely on common sense and not the data at hand to rule out luck as a latent (and mystical) player ability. With that question settled, the data go on to indicate that luck is not even a factor for single tournaments, each of which can be thought of as a long-run realization of hundreds of poker hands.

The PROC PANEL output merely furthers the point that some poker players (like their golfing counterparts) are just better at their craft than others.

*Then again, maybe they really are the luckiest guys in Vegas.*

If you are curious to know more about panel data, what’s available in SAS and how it may be applied, you can catch my theater presentation (that’s just a fancy way to say `talk’’), "Modeling Panel Data: Choosing the Correct Strategy," at the SAS Analytics Experience conference September 12-14 in Vegas. I'll be speaking on Wednesday, September 14, 1:15 PM - 2:00 PM. You will not catch me at the poker tables, however. My poker game stinks.

References:

Croson, R., P. Fishman and D. G. Pope. 2008. Poker Superstars: Skill or Luck? Similarities between golf --- thought to be a game of skill --- and poker. Chance 21(4): 25-28.

SAS Institute, The PANEL Procedure, SAS/ETS(R) 14.1 documentation

The post Is Poker a Skill Game? A Panel Data Analysis appeared first on Subconscious Musings.

]]>The post Spatial econometric modeling using PROC SPATIALREG appeared first on Subconscious Musings.

]]>In this post, we discuss how you can use the SPATIALREG procedure to analyze 2013 home value data in North Carolina at the county level. The five variables in the data set are county (county name), homeValue (median value of owner-occupied housing units), income (median household income in 2013 in inflation-adjusted dollars), bachelor (percentage of people with bachelor’s degree or higher who live in the county), and crime (rate of Crime Index offenses per 100, 000 people). The data for home values, income, and bachelor’s degree percentages in each county were obtained from the website of the United States Census Bureau and computed using the 2009–2013 American Community Survey five-year estimates. Data for crime were retrieved from the website of North Carolina Department of Public Safety. For the purpose of numerical stability and interpretation, all five variables are log-transformed during the process of data cleansing. We use this data set to demonstrate the modeling capabilities of the SPATIALREG procedure and to understand the impact of household income, crime rate, and education attainment on home values.

As a preliminary data analysis, we first show a map of North Carolina that depicts the county-level home values in Figure 1. It is easy to see that the home values tend to be clustered together. Higher values are found in the coastal, urban, and mountain areas of North Carolina and lower home values can be found in rural areas. Home values of neighboring counties more closely resemble each other than home values of counties that are far apart.

From a modeling perspective, findings from Figure 1 suggest that the data might contain a spatial dependence, which needs to be accounted for in the analysis. In particular, an endogenous interaction effect might exist in the data—home values tend to be spatially correlated with each other. PROC SPATIALREG enables you to analyze the data by using a variety of spatial econometric models.

To lay the groundwork for discussion, you can start the analysis with a linear regression. For this model, the value of Akaike’s information criterion (AIC) is –106.12. The results of parameter estimation from a linear regression model, shown in Table 1, suggest that three predictors—income, crime, and bachelor—are all significant at the 0.01 level. Moreover, crime exerts a negative impact on home values, indicating that high crime rates reduce home values. On the other hand, both income and bachelor have positive impacts on home values.

Figure 2 provides the plot of predicted homeValue from the linear regression model. Although the comparison of Figure 1 and Figure 2 might suggest that predicted homeValue from the linear regression model captures the general pattern in the observed data, you need to be careful about some underlying assumptions for linear regression. Among those assumptions, a critical one is that the values of the dependent variable are independent of each other, which is not likely for the data at hand. As a matter of fact, both Moran’s I test and Geary’s C test suggest that there is a spatial autocorrelation in homeValue at the 0.01 significance level. Consequently, if you ignore the spatial dependence in the data by fitting a linear regression model to the data, you run the risk of false inference.

Because of the spatial dependence in homeValue, a good candidate model to consider might be a spatial autoregressive (SAR) model for its ability to accommodate the endogenous interaction effect. You can use PROC SPATIALREG to fit a SAR model to the data. Before you proceed with model fitting, you need provide a spatial weights matrix. Generally speaking, a spatial weights matrix summarizes the spatial neighborhood structure; entries in the matrix represent how much influence one unit exerts over another.

The spatial weights matrix specification is of vital importance in spatial econometric modeling. Despite many different ways of specifying such a matrix, results can be sensitive to the choice of a spatial weights matrix. Without delving into the nitty-gritty of such choice, you can simply define two counties to be neighbors of each other if they share a common border. After creating the spatial weights matrix, you can feed it into PROC SPATIALREG and run a SAR model. Table 2 presents the results of parameter estimation from a SAR model.

For this model, the value AIC is –110.79. The regression coefficients that correspond to income, crime, and bachelor are all significantly different from 0 at the 0.01 level of significance. Both income and bachelor exhibit a significantly positive short-run direct impact on home values. In contrast, crime rate shows a significantly negative short-run direct impact on home values. In addition, the spatial autoregressive coefficient ρ is significantly different from zero at 0.01 level, suggesting that there is a significantly positive spatial dependence in home values.

Figure 3 shows the predicted values for homeValue from the SAR model. Comparing Figures 1 and 3 suggest that the fitted home values capture the trend in the data reasonably well.

In this post, we introduced the SPATIALREG procedure, fit a SAR model, and compared predicted values from the SAR model to those from linear regression. Even though the SAR model presented an improvement over the linear model in terms of AIC, many other models are available in the SPATIALREG procedure that might provide even more desirable results and more accurate predictions. These models include the spatial Durbin model (SDM), spatial error model (SEM), spatial Durbin error model (SDEM), spatial autoregressive confused (SAC) model, spatial autoregressive moving average (SARMA) model, spatial moving average (SMA) model, and so on. In the next post, we will discuss their features and show you how to select the most suitable model for the home value data set. We will also be giving a talk, "Location, Location, Location! SAS/ETS® Software for Spatial Econometric Modeling," at the SAS Analytics Experience conference September 12-14, 2016 in Las Vegas, so stop by and let's talk spatial!

*This post was co-written with Jan Chvosta.*

The post Spatial econometric modeling using PROC SPATIALREG appeared first on Subconscious Musings.

]]>