(This article was first published on ** DataScience+**, and kindly contributed to R-bloggers)

Bootstrap is one of the most famous resampling technique and is very useful to get confidence intervals in situations where classical approach (t- or z- tests) would fail.

Instead of writing down some equations let’s directly see how one may perform bootstrap. Below we will show a simple bootstrap example using the height of 100 person in the population.

set.seed(20151101) height<-rnorm(100,175,6)

Now we will resample with replacement 1000 times and compute the median:

t0<-median(height) t<-sapply(1:1000,function(x) median(sample(x=height,size=100,replace=TRUE))) hist(t) abline(v=t0,col="orange",lwd=3)

And that’s it, this is the essence of bootstrap: resampling the observed data with replacement and computing the statistic of interest (here the median) many times on the resampled data to get a distribution of the statistic of interest. This distribution of the statistic of interest can then be used to compute, for example, confidence intervals.

Bootstrap is used to enable inference on the statistic of interest when the true distribution of this statistic is unknown. For example in linear model the parameter of interest have a known distribution from which standard errors and formal tests can be performed. On the other hand for some statistics (median, differences between two models …), if the analyst do not want to spend time writing down equations, bootstrapping might be a great approach to get standard errors and confidence intervals from the bootstrapped distribution.

There are some situations where bootstrapped will fail: (i) the statistic of interest is at the edge of the parameter space (like minimum or maximum), the bootstraped distribution does not converge (as the number of bootstrap sample increase to infinity) to the true distribution of the statistic. (ii) sample size is small, bootstrapping will not increase the power of statistical tests. If you sample to few data to detect an effect of interest, using bootstrap will not magically solve your problem even worse the bootstrap approach will perform less well than others.

As much as possible will be the answer. Note that in this post I used low numbers to speed up the computations on my “old” computer.

The boot library in R is very convenient to easily compute confidence intervals from bootstrap samples. Non-parametric bootstrap with boot:

library(boot) b1<-boot(data=height,statistic=function(x,i) median(x[i]),R=1000) boot.ci(b1)BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = b1) Intervals : Level Normal Basic 95% (174.1, 176.2 ) (174.2, 176.4 ) Level Percentile BCa 95% (173.9, 176.1 ) (173.8, 176.1 ) Calculations and Intervals on Original Scale

Parametric bootstrap with boot:

x<-runif(100,-2,2) y<-rnorm(100,1+2*x,1) dat<-data.frame(x=x,y=y)

Simple example with a linear model

m<-lm(y~x)

We are interested in getting the confidence intervals for the coefficient of the model:

foo<-function(out){ m<-lm(y~x,out) coef(m) }

The function rgen generate new response vector from the model:

rgen<-function(dat,mle){ out<-dat out$y<-unlist(simulate(mle)) return(out) }

Generate 1000 bootstrap sample

b2<-boot(dat,foo,R=1000,sim="parametric",ran.gen=rgen,mle=m)

Compute the confidence intervals for the two coefficients:

boot.ci(b2,type="perc",index=1)BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = b2, type = "perc", index = 1) Intervals : Level Percentile 95% ( 0.8056, 1.1983 ) Calculations and Intervals on Original Scale

Compute the confidence intervals for the two coefficients:

boot.ci(b2,type="perc",index=2)BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = b2, type = "perc", index = 2) Intervals : Level Percentile 95% ( 1.871, 2.217 ) Calculations and Intervals on Original Scale

In the non-parametric case boot expect two arguments in the function returning the statistic of interest: the first one is the object from which to compute the statistic and the second is a vector of index (i), frequencies (f) or weight (w) defining the bootstrap sample. In our example, boot will generate a series of indices (named i) with replacement and these will be used to subset the original height vector. In the parametric case the function returning the statistic(s) of interest only need one argument: the original dataset. We then need to supply another function (`ran.gen`

) describing how to generate the new data, it needs to return an object of the same form as the original dataset. This random data generating function need two arguments: the first one is the original dataset and the second one contain maximum likelihood estimate for the parameter of interest, basically a model object. The new dataset generated by the `ran.gen`

function will then be passed to the statistic function to compute the bootstrapped value for the statistic of interest. It is then straightforward to get the confidence intervals for the statistic using `boot.ci`

.

Mixed-effect models are rather complex and the distributions or numbers of degrees of freedom of various output from them (like parameters …) is not known analytically. Which is why the author of the lme4 package recommend the use of bootstrap to get confidence intervals around the model parameters, the predicted values but also to get p-values from likelihood ratio tests.

A simple random intercept model:

dat<-data.frame(x=runif(100,-2,2),ind=gl(n=10,k=10)) dat$y<-1+2*dat$x+rnorm(10,0,1.2)[dat$ind]+rnorm(100,0,0.5) m<-lmer(y~x+(1|ind),dat)

Get the bootstrapped confidence intervals for the model parameters:

b_par<-bootMer(x=m,FUN=fixef,nsim=200) boot.ci(b_par,type="perc",index=1)BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 200 bootstrap replicates CALL : boot.ci(boot.out = b_par, type = "perc", index = 1) Intervals : Level Percentile 95% ( 0.393, 1.824 ) Calculations and Intervals on Original Scale Some percentile intervals may be unstable

Get the bootstrapped confidence intervals for the model parameters:

boot.ci(b_par,type="perc",index=2)BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 200 bootstrap replicates CALL : boot.ci(boot.out = b_par, type = "perc", index = 2) Intervals : Level Percentile 95% ( 1.857, 2.041 ) Calculations and Intervals on Original Scale Some percentile intervals may be unstable

Or alternatively:

confint(m,parm=c(3,4),method="boot",nsim=200,boot.type="perc")2.5 % 97.5 % (Intercept) 0.4337793 1.819856 x 1.8611089 2.058930

Get the bootstrapped confidence intervals around the regression curves:

new_dat<-data.frame(x=seq(-2,2,length=20)) mm<-model.matrix(~x,data=new_dat) predFun<-function(.) mm%*%fixef(.) bb<-bootMer(m,FUN=predFun,nsim=200) #do this 200 times

As we did this 200 times the 95% CI will be bordered by the 5th and 195th value:

bb_se<-apply(bb$t,2,function(x) x[order(x)]) new_dat$LC<-bb_se[1,] new_dat$UC<-bb_se[2,] new_dat$pred<-predict(m,newdata=new_dat,re.form=~0)

Plot the results

plot(y~x,dat,pch=16) lines(pred~x,new_dat,lwd=2,col="orange") lines(LC~x,new_dat,lty=2,col="orange") lines(UC~x,new_dat,lty=2,col="orange")

Finally get bootstrapped p-values from the likelihood ratio test between two models.

library(pbkrtest) m_0<-update(m,.~.-x) PBmodcomp(m,m_0,nsim=200)Parametric bootstrap test; time: 14.35 sec; samples: 200 extremes: 0; large : y ~ x + (1 | ind) small : y ~ (1 | ind) stat df p.value LRT 275.19 1 < 2.2e-16 *** PBtest 275.19 0.004975 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Drawing confidence intervals around the regression curves is tricky due to the random effect estimated values which comes with there own uncertainty (have a look at `dotplot(ranef(m,condVar=TRUE)`

) to see it). Bootstrapping is an efficient way to take these uncertainties into account since the random deviates are re-computed for each draw. Finally getting p-values for the effect of a fixed-effect term can be done using a parametric bootstrap approach as described here and implemented in the function `PBmodcomp`

from the pbkrtest package. In the output of PBmodcomp the bootstrapped p-values is in the PBtest line, the LRT line report the standard p-value assuming a chi-square distribution for the LRT value.

With computers being always faster, bootstrap enable us to get reliable confidence interval estimate (given that your original sample size is large enough) without relying on some hypothetical distributions. Bootstrap can also be extended in a bayesian framework, see for example this nice post.

If you have any comments or suggestion feel free to post a comment below.

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataScience+**.

R-bloggers.com offers

(This article was first published on ** Cartesian Faith » R**, and kindly contributed to R-bloggers)

This week we explore two different themes related to data. The first is how to value big data. The second looks at approaches for quantifying individual experience within the context of gender discrimination.

It’s generally taken for granted that data is a strategic asset. Every salesperson knows this, which is why they covet their Rolodex. It’s one thing to know that data is valuable, and it’s another to quantify that value and put it on your balance sheet. Mark van Rijmenam of Datafloq says 20% of large UK companies put big data on the balance sheet as an intangible asset. It turns out that this trend is fairly common throughout the EU as this report by the CEBR demonstrates. That’s generally not the case in the United Staes, where GAAP “prohibit[s] companies from treating data as an asset or counting money spent collecting and analyzing the data as investments instead of costs.” (WSJ) That said, the US “Bureau of Economic Analysis announced … the decision to recognise expenditures … on research and development as fixed investments in the national accounts from July 2013 onwards.” (CEBR)

Despite accounting’s conservative approach to data, it’s clear that data will become a valid accounting asset. The question is how to value it. Doug Laney of Gartner provides a framework for approaching this problem. The CEBR report also gives suggestions on how to go about assigning value. While it makes sense to treat data as an asset, the question in my mind is if the data is effectively utilized to drive revenue growth, does the data retain its value on the balance sheet or is that a form of double counting?

Either way, this trend will likely nudge business strategies further toward open methods and closed data. This is what we see in finance, where many trading methodologies are public and the real key is high quality data. This approach is echoed by Lukas Biewald, when discussing the rationale around why Google open sourced TensorFlow:

A company’s intellectual property and its competitive advantages are moving from their proprietary technology and algorithms to their proprietary data. As data becomes a more and more critical asset and algorithms less and less important, expect lots of companies to open source more and more of their algorithms.

*Read about my thoughts on TensorFlow.*

Diversity has been a hot (button) topic in the tech world for a while now. A recent piece by Jessica Valenti discusses how everyday slights to women are just as bad as overt sexism. What caught my attention is that “there’s no real way to prove that it’s (conscious or unconscious) discrimination” despite women knowing “through experience exactly what’s happening.” Surely there’s a survey methodology that can be used to quantify this sort of experience. The question is whether the experience is considered too subjective to quantify. There are clear dangers to presenting qualitative questions in a quantitative style, so it’s worth steering clear of these issues. My hunch is that in aggregate, a Likert scale approach would be sufficient to at least raise awareness and show in aggregate that the phenomenon is real. And if you’re working in R, be sure to leverage the survey package for analyzing the results.

*If you have ideas on this, sound off in the comments.*

When I was growing up the world was abuzz at the prospect of cold fusion. Unfortunately those claims were invalidated, resulting in a cold winter for fusion research. Much has changed, and now there is not one, but three fusion reactors vying for funding and glory. It’s exciting to think that clean and abundant energy could truly be in our future.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Cartesian Faith » R**.

R-bloggers.com offers

(This article was first published on ** Xi'an's Og » R**, and kindly contributed to R-bloggers)

**O**ne student of mine coded by mistake an independent Metropolis-Hastings algorithm with too small a variance in the proposal when compared with the target variance. Here is the R code of this implementation:

#target is N(0,1) #proposal is N(0,.01) T=1e5 prop=x=rnorm(T,sd=.01) ratop=dnorm(prop,log=TRUE)-dnorm(prop,sd=.01,log=TRUE) ratav=ratop[1] logu=ratop-log(runif(T)) for (t in 2:T){ if (logu[t]>ratav){ x[t]=prop[t];ratav=ratop[t]}else{x[t]=x[t-1]} }

It produces outputs of the following shape

which is quite amazing because of the small variance. The reason for the lengthy freezes of the chain is the occurrence with positive probability of realisations from the proposal with very small proposal density values, as they induce very small Metropolis-Hastings acceptance probabilities and are almost “impossible” to leave. This is due to the lack of control of the target, which is flat over the domain of the proposal for all practical purposes. Obviously, in such a setting, the outcome is unrelated with the N(0,1) target!

It is also unrelated with the normal proposal in that switching to a t distribution with 3 degrees of freedom produces a similar outcome:

It is only when using a Cauchy proposal that the pattern vanishes:

Filed under: Kids, pictures, R, Statistics, University life Tagged: acceptance probability, convergence assessment, heavy-tail distribution, independent Metropolis-Hastings algorithm, Metropolis-Hastings algorithm, normal distribution, Student’s t distribution

To **leave a comment** for the author, please follow the link and comment on their blog: ** Xi'an's Og » R**.

R-bloggers.com offers

Statistics.com is an online learning website with 100+ courses in statistics, analytics, data mining, text mining, forecasting, social network analysis, spatial analysis, etc.

They have kindly agreed to offer R-Bloggers readers a reduced rate of $399 for any of their 23 courses in R, Python, SQL or SAS. These are high-impact courses, each 4-weeks long (normally costing up to $589). They feature hands-on exercises and projects and the opportunity to receive answers online from leading experts like Paul Murrell (member of the R core development team), Chris Brunsdon (co-developer of the GISTools package), Ben Baumer (former statistician for the NY Mets baseball team), and others. These instructors will answer all your questions (via a private discussion forum) over a 4-week period.

You may use the code “R-Blogger15″ when registering. You can register for any R, Python, Hadoop, SQL or SAS course starting on any date, but you must use this code and register BEFORE December 11, 2015. Here is a list of the courses:

1) Using R as a statistical package

**R for Statistical Analysis****Modeling in R****Visualization in R using ggplot2****Graphics in R****Logistic Regression****Bayesian Statistics in R**

2) Learning how to program and build skills in R –

**R Programming Intro 1****R Programming Intro 2****R Programming – Intermediate**One year of daily R use required before taking this course**R Programming – Advanced**Two years of daily R use required before taking this course

3) Specific domains or applications

**Applied Predictive Analytics****Big Data Computing with Hadoop****Big Data Ingestion via API’s****Hadoop: Hive, Sqoop and Spark****Python – Social Data Mining****Introduction to SAS Programming for Analytics****SQL and R – Introduction to Database Queries****Biostatistics in R with Clinical Trial Applications****Data Mining in R****Mapping in R****Spatial Analysis Using R****Microarray Analysis Using R****Survey Analysis Using R**

]]>

(This article was first published on ** Engaging Market Research**, and kindly contributed to R-bloggers)

We have been talking about design thinking in marketing since Tim Brown’s Harvard Business Review article in 2008. It might be easy for the data scientist to dismiss the approach as merely a type of brainstorming for new products or services. Yet, design issues do arise in data visualization where we are concerned with communicating our findings. However, my interest is model selection: Should the analyst select one statistical model over another because the user might find it more helpful in planning interventions or designing new products and services?

For example, the marketing manager who wants to retain current customers seeks guidance from customer satisfaction questionnaires filled with performance ratings and intentions to recommend or purchase again. Motivated by the desire to keep it simple, common practice tends to focus attention on only the most important “causes” of customer retention. As I noted in my first post, Network Visualization of Key Driver Analysis, a more complete picture can be revealed by a correlation graph displaying all the interconnections among all the ratings. The edges or links are colored green or red so that we know if the relationship is positive or negative. The thickest of the path indicates the strength of the correlation. But correlations measure total effects, both those that are direct and those obtained through associations with other ratings.

The designer of intervention strategies aimed at preventing churn could acquire additional insights from the partial correlation graph depicting the effects between all pairs of ratings controlling for all the other ratings in the model. While the correlation map reveals total effects, the partial correlation map removes all but the direct effects. The graph below was created using the R code from my first post to simulate a data set that mimics what is often found when airline passengers complete satisfaction surveys. Once the data were generated, the procedures outlined in my post Undirected Graphs When the Causality is Mutual were followed. The R code is listed at the end of this discussion.

We can pick any node, such as the one labeled “Satisfaction” in the middle of the right-hand side of the figure. A simple way of interpreting this graph is to think of Satisfaction as the dependent variable and the lines radiating from Satisfaction as the weights obtained from the regression of this node on the other 14 ratings. Clearly, overall satisfaction serves as an inclusive summary measure with so many pathways from so many other nodes. Each of the four customer service ratings (below Satisfaction and in light pink) adds its own unique contribution with the greatest impact indicated by the thickest green edge from the Service node. Moreover, Easy Reservation and Ticket Price plus Clean Aircraft with room for people and baggage make incremental improvements in Satisfaction.

The same process can be repeated for any node. Instead of a driver analysis that narrows our thinking to a single dependent variable and its highest regression weights, the partial correlation map opens us to the possibilities. If the goal was customer retention, then the focus would be on the Fly Again node. Recommend seems to have the strongest link to Fly Again. Can the airline induce repeat purchase by encouraging recommendation? What if frequent flyer miles were offered when others entered your name as a recommender? Such a proposal may not be practical in its current form, but the graph supports this type of design thinking.

Because there are no direct paths from the four service nodes to Fly Again, a driver analysis would miss the indirect connection through Satisfaction. And what of this link between Courtesy and Easy Reservation? Do customers infer a “friendly” personality trait that links their perceptions of the way they are treated when they buy a ticket and when they board the plane? Design thinkers would entertain such a possibility and test the hypothesis. Such “cascaded inferences” fill the graph for those willing to look. Perhaps many small and less costly improvements might combine to have a greater impact than concentrating on a single aspect? Encouraging passengers to check their bags would create more overhead storage without reconfiguring the airplane. Let the design thinking begin!

The discussion ends with the identification of “the most important” in a driver analysis. The network, on the other hand, invites creative thought. Isn’t this the point of data science? What can we learn from the data? The answer is a good deal more than can be revealed by the largest coefficient in a single regression equation.

# Calculates Sparse Partial Correlation Matrix

sparse_matrix<-EBICglasso(cor(ratings), n=1000)

round(sparse_matrix,2)

# Plots results

gr<-list(1:4,5:8,9:12,13:15)

node_color<-c("lightgoldenrod","lightgreen","lightpink","cyan")

qgraph(sparse_matrix, fade = FALSE, layout="spring", groups=gr,

color=node_color,labels=names(ratings), label.scale=FALSE,

label.cex=1, node.width=.5, edge.width=.25, minimum=.05)

Created by Pretty R at inside-R.org

To **leave a comment** for the author, please follow the link and comment on their blog: ** Engaging Market Research**.

R-bloggers.com offers

Let’s say you’ve created a robust ML model in R and explain the model to your in-house IT department, it is (currently) definitely not a given that they can easily integrate it. Be it either due to the technology used is unfamiliar to them or because they simply don’t have a in-depth knowledge on ML.

There are a lot of way to go about this. One way (and the focus of this post) is to built a sort of “black-box” model which is accessible through a web-based API. The advantage of this is that a web call can be very easily made from (almost) any programming language, making integration of the ML model quite easy.

Below an example ML model is made accessible by using Jug (disclaimer: I’m the author of Jug :).

We start by creating a very simple model based on the `mtcars`

dataset and the `randomForest`

package (don’t interpret this as the way to correctly train a model). This model is then saved to the file system.

The model tries to predict `mpg`

based on the `disp`

, `wt`

and `hp`

variables.

The summary of the resulting model:

```
Call:
randomForest(formula = mpg ~ disp + wt + hp, data = mtcars)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 5.421786
% Var explained: 84.59
```

For setting up the API we use Jug.

```
Serving the jug at http://127.0.0.1:8080
```

So, what does this code do? The `mpg_fit`

holds the fitted ML model. The `predict_mpg`

function uses this model and returns a predicted value based on the received `disp`

, `wt`

and `hp`

parameters.

The part starting with `jug() %>%`

configures and launches the API. If it receives a `post`

request at the `/mpg_api`

path (with the requested parameters) it returns the result of `decorate(predict_mpg)`

to the post call.

The `decorate`

function is basically a convenience function which maps the supplied parameters (form-data, url-encoded, …) to the function you supply it. The error handler is simply there to avoid the server crashing when it gets called with an unbound path/erroneous data.

The `serve_it()`

call launches the API instance.

The result is that we now have a live web-based API. We can post data to it and get back a predicted value. Below an example post call using `curl`

.

```
curl -s --data "disp=160&wt=2620&hp=110" http://127.0.0.1:8080/mpg_api
>> 18.9758366666667
```

This R app can then be launched on a server which in turn can receive post calls from any type of application which can make web calls.

Also have a look at the base httpuv package (which is the basis for Jug) and the more applied prairie (previously known as `dull`

) and plumber packages.

(This article was first published on ** FishyOperations**, and kindly contributed to R-bloggers)

Building an accurate machine learning (ML) model is a feat on its own. But once you’re there, you still need to find a way to make the model accessible to users. If you want to create a GUI for it, the obvious approach is going after shiny. However, often you don’t want a direct GUI for a ML model but you want to integrate the logic you’ve created into an existing (or new) application things become a bit more difficult.

Let’s say you’ve created a robust ML model in R and explain the model to your in-house IT department, it is (currently) definitely not a given that they can easily integrate it. Be it either due to the technology used is unfamiliar to them or because they simply don’t have a in-depth knowledge on ML.

There are a lot of way to go about this. One way (and the focus of this post) is to built a sort of “black-box” model which is accessible through a web-based API. The advantage of this is that a web call can be very easily made from (almost) any programming language, making integration of the ML model quite easy.

Below an example ML model is made accessible by using Jug (disclaimer: I’m the author of Jug :).

We start by creating a very simple model based on the `mtcars`

dataset and the `randomForest`

package (don’t interpret this as the way to correctly train a model). This model is then saved to the file system.

The model tries to predict `mpg`

based on the `disp`

, `wt`

and `hp`

variables.

The summary of the resulting model:

```
Call:
randomForest(formula = mpg ~ disp + wt + hp, data = mtcars)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 5.421786
% Var explained: 84.59
```

For setting up the API we use Jug.

```
Serving the jug at http://127.0.0.1:8080
```

So, what does this code do? The `mpg_fit`

holds the fitted ML model. The `predict_mpg`

function uses this model and returns a predicted value based on the received `disp`

, `wt`

and `hp`

parameters.

The part starting with `jug() %>%`

configures and launches the API. If it receives a `post`

request at the `/mpg_api`

path (with the requested parameters) it returns the result of `decorate(predict_mpg)`

to the post call.

The `decorate`

function is basically a convenience function which maps the supplied parameters (form-data, url-encoded, …) to the function you supply it. The error handler is simply there to avoid the server crashing when it gets called with an unbound path/erroneous data.

The `serve_it()`

call launches the API instance.

The result is that we now have a live web-based API. We can post data to it and get back a predicted value. Below an example post call using `curl`

.

```
curl -s --data "disp=160&wt=2620&hp=110" http://127.0.0.1:8080/mpg_api
>> 18.9758366666667
```

This R app can then be launched on a server which in turn can receive post calls from any type of application which can make web calls.

Also have a look at the base httpuv package (which is the basis for Jug) and the more applied prairie (previously known as `dull`

) and plumber packages.

To **leave a comment** for the author, please follow the link and comment on their blog: ** FishyOperations**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

by Michael Helbraun

The software business includes travel, and that means hotels. The news that Marriott was acquiring Starwood was of particular interest to me – especially since more than 75% of my 95 nights so far this year on the road have been spent with one of those two companies.

While other folks can evaluate if the deal makes sense financially, I was just curious how this might affect a business traveler. Looking at the news there are those optimistic and plenty concerned. Granted, many of these details on how the loyalty programs will be combined won’t be known for some time, but what we do know is where each company maintains properties.

With 4200+ Marriott and 1700+ Starwood properties I was curious where there might be overlap, and how well the deal would help Marriott to grow in new markets. Luckily R can help in this regard.

The first thing to do is to put together a data set. It would have been nice if the companies had cleaned spreadsheets available publically, but as is normally the case we end up spending a good portion of time gathering and preparing data. In this case scraping, and formatting the data from SPG and Marriott into a spreadsheet with all their property locations. While I won’t go into data cleaning here, for a one time effort on just a few thousand rows of data this was pretty straightforward to do in Excel.

After I had all locations for all properties it was time to bring that data into R to start the analysis. First I was curious where each firm had the most properties – simple to do with a cross tab. NYC seems a logical top 5, but Houston and Atlanta, interesting:

**Top 10 Marriott Locations **

So far so good, but to actually put these on a map it’s much easier if the data has latitude and longitude. The

marGeocoded <- cbind(locations, geocode(locations))

save(marGeocoded, file="D:/Datasets/marGeocoded.RData")

load("D:/Datasets/marGeocoded.RData")

locations <- hotToGeo

hotGeocoded <- cbind(locations, geocode(locations))

save(hotGeocoded, file="D:/Datasets/hotGeocoded.RData")

load("D:/Datasets/hotGeocoded.RData")

Once the lat/long coordinates are merged back into our data set there are a number of ways to plot the results. I’m a fan of the globe plots within Bryan Lewis’s excellent *rthreejs* package. This allows you to stretch a 2D image over a globe which you can then plot on top of and interact with. Here I’ve plotted all the Marriott properties in orange and the Starwood properties in yellow:

After this it seemed like there was the most overlap in the US and Europe. To create a static plot ggmap is very quick:

# Europe map with ggmap

eurPlot <- qmap(location = "Europe", zoom = 4, legend = "bottomright", maptype = "terrain", color = "bw", darken = 0.01)

eurPlot <- eurPlot + geom_point(data = combGeocoded, aes(y = lat, x = lon, colour = firm, size=Counts, alpha=.2))

(eurPlot <- eurPlot + scale_size_continuous(range = c(3,10)))

If we want to create something within an interactive zoom the *leaflet* package is another useful one. It leverages Open Street Map and allows you to pan and zoom:

Aggregating and deriving value from low value info is a great use of R, and this sort of analysis is fun as it gives some additional perspective into a current event. If you would like to play around with this, a copy of the script Download Merger analysis and relevant data files are available Download HotGeocoded and Download MarGeocoded – let us know what you find in the comments.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

**tl;dr**: Black Friday $15 Udemy Deal – (until Friday 27th)

For the next 36 hours, Udemy is offering readers of R-bloggers access to its global online learning marketplace with a (special) $**15 (up to 98% off) deal on over 17,000 of their courses **(including R-Programming and Python courses).

After the 19th, the deal schedule (per course) is as follows:

~~11/18 – $10~~

~~ 11/19 – $10~~

~~11/20 – $11~~

~~11/21 – $11~~

~~11/22 – $12~~

~~ 11/23 – $12~~

~~ 11/24 – $14~~

~~ 11/25 – $14~~

11/26 – $15

11/27 – $15

**Click here to browse ALL (including non-R) courses**

**Advanced R courses:**** **

- Multivariate Data Visualization with R
- Applied Multivariate Analysis with R
- Graphs in R – Data Visualization with R Programming Language
- The Comprehensive Programming in R Course
- More Data Mining with R
- Text Mining, Scraping and Sentiment Analysis with R
- R Programming for Simulation and Monte Carlo Methods
- Programming Statistical Applications in R
- Core skill for data science: learn dplyr package in R
- Extra Fundamentals of R
- Linear Mixed-Effects Models with R

**R courses for “data science”:**** **

- Statistics and Data Science in R from Beginner to Advanced!
- Data Science A-Zª: Real-Life Data Science Exercises Included
- Machine Learning and Statistical Modeling with R Examples
- Applied Data Science with R
- Practical Data Science: Reducing High Dimensional Data in R
- Introduction To Data Science
- Data Science Career Guide – Career Development in Analytics
- Statistics in R – The R Language for Statistical Analysis

**General R courses :**** **

- Essential Fundamentals of R
- An Introduction to Data Visualization in R using ggplot
- Statistics in R – The R Language for Statistical Analysis

- Learn R Programming from Scratch
- Programming With R. Learn How To Program In R – Beginners

- Introduction to R
- R: Learn to Program in R & Use R for Effective Data Analysis
- R Level 1 – Data Analytics with R
- Data Analysis with R
- Learn R for Business Analytics from Basics !

**From Udemy:**

We live in a new world where learning is not limited to the classroom or a book, but now on-demand, at your own pace, and on any device. Udemy is chock full of master courses and mini courses on everything from programming to photography, and we encourage you to take a look.

Their library of courses is quite extensive, you may also find interest in one of their other courses ranging from writing or yoga, excel (yak), communication skills, app developer, web designer or more — still, for $10 (up to 98% off).

]]>