(This article was first published on ** Engaging Market Research**, and kindly contributed to R-bloggers)

The genre shirt asks, “What kind of music do u listen 2?”

Microgenres exist because markets are fragmenting and marketers need names to attract emerging customer segments with increasingly specific preferences. The cost of producing and delivery music now supports a plenitude of joint pairings of recordings and customers. The coevolution of music and its audience binds together listening preferences and available alternatives.

I already know a good deal about your preferences by simply knowing that you listen to German Hip Hop or New Orleans Jazz (see the website Every Noise at Once). Those microgenres are not accidental but were named in order to broadcast the information that customers need in order to find what they want to buy and at the same time that artists require to market their goods. Matchmaking demands its own vocabulary. Over time, the language adapts and only the “fittest” categories survive.

**R Makes It Easy**

The R package NMF simplifies the analysis as I demonstrated in my post on Modeling Plenitude and Speciation. Unfortunately, the data in that post were limited to only 17 broad music categories, but the R code would have been the same had there been several hundred microgenres or several thousand songs.

The output is straightforward once you understand what nonnegative matrix factorization (NMF) is trying to accomplish. All matrix factorizations, as the name implies, attempt to identify “simpler” matrices or factors that will reproduce approximately the original data matrix when multiplied together. Simpler, in this case, means that we replace the many observed variables with a much smaller number of latent variables. The belief is that these latent variables will simultaneously account for both row cliques and column microgenres as they coevolve.

This is the matrix factorization diagram that I borrowed from Wikipedia to illustrate the process.

The original data matrix V holds the non-negative numbers indicating the listening intensities for every microgenre is approximated by multiplying a customer graded membership matrix W times a matrix of factor loadings for the observed variables H. The columns of W and the rows of H reflect the number of latent variables. It can be argued that V contains noise so that reproducing it exactly would be overfitting. Thus, nothing of “true” value is lost by replacing the original observations in V with the customer latent scores in W. As you can see, W has fewer columns, but I can always approximate V using the factor loadings in H as a “decoder” to reconstitute the observations without noise (a form of data compression).

We start our interpretation of the output with H containing the factor loadings for the observed variables. For example, we might expect to see one of the latent variables reflecting interest in classical music with sizable factor loadings for the microgenres associated with such music. One of these microgenre could serve as an “anchor” that defines this latent feature if only the most ardent classical music fans listened to this specific microgenre (e.g., Baroque). Hard rock or jazz might be anchors for other latent variables. If we have been careful to select observed variables with high imagery (e.g., is it easy to describe the hard rock listener?), we will be able to name the latent variable based only on its anchors and other high loadings.

The respondent matrix W serves a similar function for customers. Each column is the same latent variable we just described in our interpretation of H. For every row, W gives us a set of graded membership indices telling us the degree to which each latent variable contributes to that respondent’s pattern of observed scores. The opera lover, for instance, would have a high membership index for the latent variable associated with opera. We would expect to find a similar pairing between the jazz enthusiasts and a jazz-anchored latent variable, assuming that this is the structure underlying the coevolving network of customer and product.

Once again, we need to remember that nonnegative matrix factorization attempts to reproduce the observed data matrix as closely as possible given the constraints placed on it by limiting the number of latent variables and requiring that all the entries be non-negative. The constraint on the number of latent variables should seem familiar to those who have some experience with principal component or factor analysis. However, unlike principal component analysis, NMF adds the restriction that all the elements of W and H must be non-negative. The result is a series of additive effects with latent variables defined as nonnegative linear combinations of observed variables (H) and respondents represented as convex combinations of those same latent variables (W). Moreover, latent variables can only add to the reproduced intensities as they attempt to approximate the original data matrix V. Lee and Seung explain this as “learning the parts of objects.”

All this works quite nicely in fragmented markets where products evolve along with customers desires to generate separate communities. In such coevolving networks, the observed data matrix tends to be sparse and requires a simultaneous clustering of rows and columns. The structure of heterogeneity or individual differences is a mixture of categories and graded category memberships. Customers are free to check as many boxes as they wish and order as much as they like. As a result, respondents can be “pure types” with only one large entry in their row of W and all the other values near zero, or they can be “hybrids” with several substantial membership entries in their row.

Markets will continue to fragment with customers in control and the internet providing more and more customized products. We can see this throughout the entertainment industry (e.g., music, film, television, books, and gaming) but also in almost everything online or in retail warehouses. Increasingly, the analyst will be forced to confront the high-dimensional challenge of dealing with subspaces or local regions populated with non-overlapping customers and products. Block clustering of row and columns is the first step when the data matrix is sparse because these variables over here are relevant only for those respondent over there. Fortunately, R provides the link to several innovative methods for dealing with such case (see my post The Ecology of Data Matrices).

To **leave a comment** for the author, please follow the link and comment on his blog: ** Engaging Market Research**.

R-bloggers.com offers

`shinyapps.io`

: here you go.
Prerequisites are indicated in the tutorial; so are links to the source code. If you have feedback or suggestions for improvements, comment here or post an Issue on the GitHub site.

]]>
(This article was first published on ** A Statistics Blog - R**, and kindly contributed to R-bloggers)

I stand in awe of the R wizards who post their latest and greatest Shiny feats on R-Bloggers. It’s taken me a while to find my way around Shiny, but at last I feel ready to fill in a few gaps for others who may be just starting out and who aspire to write reasonably involved, user-friendly simulation apps. To make it a bit more fun I’ve written it up as an R Markdown document hosted on `shinyapps.io`

: here you go.

Prerequisites are indicated in the tutorial; so are links to the source code. If you have feedback or suggestions for improvements, comment here or post an Issue on the GitHub site.

To **leave a comment** for the author, please follow the link and comment on his blog: ** A Statistics Blog - R**.

R-bloggers.com offers

(This article was first published on ** MilanoR**, and kindly contributed to R-bloggers)

I survived to the Conference Dinner and I took the last bus (you should have participated at useR or read the post of day 2 to understand…), and this morning I woke up in Aalborg for the third (and last) time.

After a keynote session by Thomas Lumley on *How flexible computing expands what an individual can do*, my colleague Enrico and I attended the *Sponsor Session.* This was one of the most interesting sessions of the conference, because they showed us a lot of tools for using R in business environments. The only Diamond Sponsor was *DataRobot*, whose platform focuses on modeling and prediction. Then it was the Platinum Sponsors’ turn: *RStudio* talked about its several tools (IDE, Shiny, R packages, shinyapps.io), *Teradata *showed how they integrate R on its platform for big data, while *Revolution Analytics* is actually a division of Microsoft and their presentation concentrated on Azure. Finally, Gold Sponsors had their ten minutes each: *Alteryx* showed its data pipelining engine and a visual programming framework, *TIBCO* introduced TIBCO Enterprise Runtime for R (TERR), *H2O* illustrated its platform for model guessing, while *HP* introduced HP Distributed R.

After the last keynote speaking by Steffen Lauritzen on *Linear estimating equations for Gaussian graphical models with symmetry*, the conference ended with a goodbye to next year in Stanford, California, USA.

Now, Enrico and I are waiting for the bus which will take us to the airport. This night we will arrive to our houses and on Monday we will resume our work in Quantide, but Chronicles from useR! doesn’t stop here… Next weeks, Enrico and I will publish some great new articles, including photos, funniest moments and curiosities, in-depth analysis on R topics and so on…

To **leave a comment** for the author, please follow the link and comment on his blog: ** MilanoR**.

R-bloggers.com offers

The post Review: Machine Learning with R Cookbook appeared first on Exegetic Analytics.

]]>
(This article was first published on ** Exegetic Analytics » R**, and kindly contributed to R-bloggers)

“Machine Learning with R Cookbook” by Chiu Yu-Wei is nothing more or less than it purports to be: a collection of 110 recipes for applying Data Analysis and Machine Learning techniques in R. I was asked by the publishers to review this book and found it to be an interesting and informative read. It will not help you understand how Machine Learning works (that’s not the goal!) but it will help you quickly learn how to apply Machine Learning techniques to you own problems.

The recipes are broken down into chapters which address the following topics:

- Installing R and an Introduction to R
- Exploring an Example Data Set
- Basic Statistics in R
- Regression Analysis (including Poisson and Binomial models)
- Classification (Decision Trees, Nearest Neighbour, Logistic Regression and Naïve Bayes)
- Classification (Neural Networks and SVMs)
- Model Evaluation and Comparison
- Ensemble Models
- Clustering
- Mining Associations and Sequences
- Dimensionality Reduction
- Big Data and Integration of R with Hadoop

This is a relatively exhaustive list of topics. The last chapter might have better been omitted, but still provides a useful introduction to the use of R with massive data sets.

Each recipe in the book is divided into four parts entitled “Getting Ready”, “How to do it…”, “How it works…” and “See also”. This is a clever structure and an intuitive way to organise the material. In general the “Getting Ready” part provides sufficient background material to prepare you for the task at hand. “How to do it…” presents the meat of the recipe as a step-by-step procedure. The intention with “How it works…” is to explain how and why the recipe works. In many instances the explanations are somewhat superficial or reference details which are not discussed with sufficient depth. They are generally helpful though. The “See also” part provides links to additional material or alternative ways to solve the same problem.

There are some errors in the text and sometimes the language and grammar are imperfect. However, if you want to learn more about using R for Machine Learning, this might be a useful book to have in your collection. I should note that all of these recipes could easily be constructed from online resources, but this book has the merit of assembling them all in one place.

The post Review: Machine Learning with R Cookbook appeared first on Exegetic Analytics.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Exegetic Analytics » R**.

R-bloggers.com offers

(This article was first published on ** 4D Pie Charts » R**, and kindly contributed to R-bloggers)

“Assertion” is computer-science jargon for a run-time check on your code. In R , this typically means function argument checks (“did they pass a numeric vector rather than a character vector into your function?”), and data quality checks (“does the date-of-birth column contain values in the past?”).

R currently has four packages for assertions: `assertive`

, which is mine; `assertthat`

by Hadley Wickham, `assertr`

By Tony Fischetti, and `ensurer`

by Stefan Bache.

Having four packages feels like too many; we’re duplicating effort, and it makes package choice too hard for users. I didn’t know about the existence of `assertr`

or `ensurer`

until a couple of days ago, but the useR conference has helped bring these rivals to my attention. I’ve chatted with the authors of the other three packages to see if we can streamline things a little.

Hadley said that `assertthat`

isn’t a high priority for him – dplyr, ggplot2 and tidyr (among many others) are more important – so he’s not going to develop it further. Since `assertthat`

is mostly a subset of `assertive`

anyway, this shouldn’t be a problem. I’ll take a look how easy it is to provide an `assertthat`

API, so existing users can have a direct replacement.

Tony said that the focus of `assertr`

is predominantly data checking. It only works with data frames, and has a more limited remit than `assertive`

. He plans to change the backend to be built on top of `assertive`

. That is, `assertr`

will be an `assertive`

extension that make it easy to apply assertions to multiple columns in data frames.

Stefan has stated that he prefers to keep `ensurer`

separate, since it has a different philosophical stance to `assertive`

, and I agree. `ensurer`

is optimised for being lightweight and elegant; `assertive`

is optimised for clarity of user code and clarity of error messages (at a cost of some bulk).

So overall, we’re down from four distinct assertion packages to two groups (`assertive`

/`assertr`

and `assertive`

). This feels sensible. It’s the optimum number for minimizing duplication while still having the some competition to spur development onwards.

`ensurer`

has one feature in particular that I definitely want to include in `assertive`

: you can create type-safe functions.

The question of bulk has also been playing on my mind for a while. It isn’t huge by any means – the tar.gz file for the package is 836kB – but the number of functions can make it a little difficult for new users to find their way around. A couple of years ago when I was working with a lot of customer data, I included functions for checking things like the validity of UK postcodes. These are things that I’m unlikely to use at all in my current job, so it seems superfluous to have them. That means that I’d like to make `assertive`

more modular. The core things should be available in an `assertive.base`

package, with specialist assertions in additional packages.

I also want to make it easier for other package developers to include their own assertions in their packages. This will require a bit of rethinking about how the existing assertion engine works, and what internal bits I need to expose.

One bit of feedback I got from the attendees at my tutorial this week was that for simulation usage (where you call the same function millions of times), assertions can slow down the code too much. So a way to turn off the assertions (but keep them there for debugging purposes) would be useful.

The top feature request however, was for the use of pipe compatibility. Stefan’s `magrittr`

package has rocketed in popularity (I’m a huge fan), so this definitely needs implementing. It should be a small fix, so I should have it included soon.

There are some other small fixes like better NA handling and a better error message for `is_in_range`

that I plan to make soon.

The final (rather non-trivial) feature I want to add to assertive is support for error messages in multiple languages. The infrastructure is in place for translations (it currently support both the languages that I know; British English and American English), I just need some people who can speak other languages to do the translations. If you are interested in translating; drop me an email or let me know in the comments.

Tagged: assertions, assertive, r

To **leave a comment** for the author, please follow the link and comment on his blog: ** 4D Pie Charts » R**.

R-bloggers.com offers

The post SparkR with Rstudio in Ubuntu 12.04 appeared first on Pingax.

]]>
(This article was first published on ** Pingax » R**, and kindly contributed to R-bloggers)

Welcome to the blog post! It’s been long time since I wrote last post. I was recently searching about big data with R and I found sparkR package. Few months back I heard about it and it was a separate project on github. Databricks is actively working on sparkR package. They officially announced its integration with Apache spark. In this post, I will discuss about how to configure sparkR with Rstudio in Ubuntu 12.04 and get started using it.

In order to use sparkR package, we need to simply follow few steps. Make sure you have already configured latest spark distribution in your system.

Here we go..

**Step:1** Generate sparkR library from the source code comes with latest spark distribution (1.4.0)

Open terminal and navigate to “spark-1.4.0/R” and run command “./install-dev.sh” as shown below

This will generate lib folder under directory “saprk-1.4.0/R” as shown below

**Step:2 **Open R studio and load sparkR library as shown below

**Step:3 **Initialize sparkContext and create sparkR data frame shown below

That’s it! Complete R code is shown below.

#Load libraries library("rJava") library(SparkR, lib.loc="Path to library") #In my case #library(SparkR, lib.loc="/home/amar/Downloads/spark-1.4.0/R/lib") #Initalize spark context sc <- sparkR.init(sparkHome = "Path to sparkHome") #In my case #sc <- sparkR.init(sparkHome = "/home/amar/Downloads/spark-1.4.0") #Initalize sqlCOntext sqlContext <- sparkRSQL.init(sc) # Create SparkR dataframe from R dataframe SparkDf <- createDataFrame(sqlContext,faithful) head(SparkDf)

Enjoy using sparkR. Write you comments below if you face any difficulties.

Powered by Google+ Comments

The post SparkR with Rstudio in Ubuntu 12.04 appeared first on Pingax.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Pingax » R**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The latest worldwide R user conference has just wrapped up in Aalborg, Denmark and useR! 2015 was the best yet. A hearty round of applause to the organizers for a smoothly run, informative and fun event. To the organizers of next year's event in the Stanford, California: the bar has been raised.

As I was chatting to various participants, the same refrain came out again and again regarding the conference: R has gone mainstream — and that's a *good* thing. It wasn't just by the size of the conference (more than 650 R users from 49 nations were in attendance): the strong representation from industry (half of the participants were using R in the commercial sector) and the participation of many vendors working with R and their partnership in the R Consortium that gave the conference more of a "big software" feel. And yet, the conference still maintained a strong sense of community, a cutting-edge academic/research track, and a great sense of fun (especially the trip to the Robber's Den in Rold Forest, featuring axe-throwing, log-sawing and much merriment). If you weren't there in person, you can catch up on some of the presentations and escapades at the #user2015 hashtag on Twitter.

```
```Unlabeled axes. What a shame. #useR2015 pic.twitter.com/U2iAWCIiwJ

— Karthik Ram (@_inundata) July 2, 2015

For me, I think the theme that stood out the most was the diverse and impactful real-world applications of R. In the space of just a couple of days, I saw how:

- R and Hadoop is used to make the Tribal Wars game engaging for 150M players
- Blind statisticians interact with data and visualizations with speech and Braille from R
- R is used to teach high schoolers to code, as their first programming language
- R is used to help veterinarians identify lameness in horses using 3-d accelerometers
- R used to predict natural gas consumption in the Czech Republic
- R used to estimate biomass of Norwegian forests from tree measurements and airborne laser scans
- R is used to generate more green energy from wind turbines

… and much, much more in the parallel tracks that I wasn't able to attend. One thing this conference demonstrated is that R is much more than software: it's also the extensions that come from its vast developer community, and the applications its put to by its even larger user community. In this morning's keynote, there was a quote that said it perfectly:

"R is a free software

communityfor statistical computing and graphics" — Thomas Lumley.

Indeed. I'm proud to be part of such an amazing community which made useR! 2015 so special. Thanks to all who attended, and again: a very special thanks to the organizers for such a wonderful event.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** MilanoR**, and kindly contributed to R-bloggers)

As written by my colleague Nicola in the post of day 1, yesterday evening I had some beers with some guys of the conference. After having seen the sunrise, I went to sleep at 3 AM, but anyway this morning I was ready for the second day of the useR!

After a keynote session by Di Cook on the main interactive plotting methods which have been used with S and R in the last twenty years, I attended the *Regression* session, where I saw two interesting developments of LMMs and GLMMs besides other specific topics. Nicola, instead, went to *Commercial Offerings* session and got in touch with several R’s implementations for business.

In the afternoon I went to *Teaching 1* session, which talked about tips and hints for effective lectures with R. In particular I found very interesting *Manipulation of Discrete Random Variables in R with DiscreteRV* by Eric Hare, who proposes a package to explain random variables in R in a very comprehensible way. Nicola went to *Visualization 1* session and he found very interesting for his work *tmap: creating thematic maps in a flexible way* by Martijn Tennekes.

Tonight there will be the useR Conference Dinner. The dinner will take place in Rold Forest, the second largest forest in Denmark. This morning we were not supposed to participate at this event but… a kind guy left the conference in advance and gave me its ticket for free. Quantide bought the ticket also for my colleague and as a consequence we will be also able to participate to competitive games, such as long sawings, axe hurling and archery during the dinner!My last oral session I attended at this useR! was Machine *Learning 2* where I found *Machine Learning for Internal Product Measurements* by Douglas Mason the most interesting speaking. Nicola went to *Training 2* session and he was hit by Gail Potter speaking about *Web Applications Teaching Tools for Statistics Using Shiny and R*.The afternoon keynote speaker was Susan Holmes who dealt with human microbiome data.

If we don’t miss the bus for coming back at 11 PM, tomorrow we will still be here at useR and will share our experience with you. See you tomorrow with Chronicles from useR!

To **leave a comment** for the author, please follow the link and comment on his blog: ** MilanoR**.

R-bloggers.com offers