(This article was first published on ** OpenCPU**, and kindly contributed to R-bloggers)

Following a few weeks of testing, OpenCPU 1.5 has been released. OpenCPU is a production-ready framework for embedded statistical computing with R. The system provides a neat API for remotely calling R functions over HTTP via e.g. JSON or Protocol Buffers. The OpenCPU server implementation is very stable and has been thorougly tested. It runs on all major Linux distributions and plays nicely with the RStudio server IDE (demo).

Similarly to shiny, OpenCPU has a single-user/development edition that runs within the interactive R session, and a multi-user (cloud) server for deployments on Linux. Unlinke shiny however, the cloud server comes at no extra cost. On the contrary: you are encouraged to take advantage of the cloud server which is much faster and includes cool features like user libraries, concurrent sessions, continuous integration, customizable security policies, etc.

The OpenCPU API itself has not changed from the 1.4 branch, but the entire underlying stack has been upgraded, hence the version bump. The server now builds on:

- R 3.2.1
- stringi 0.5-5
- jsonlite 0.9.16
- devtools 1.8.0
- RStudio 0.99 (optional)

Navigate to `/ocpu/info`

on your OpenCPU server to inspect the exact versions of all packages used by the system.

In addition to an upgraded package library, this version includes many small tweaks for the deb/rpm installation packages and docker files. Redhat distributions like Fedora and CentOS are now automatically configured with the required SELinux policies.

The download page has instructions for installing the opencpu server on various distributions, either from source or using precompiled binaries. To upgrade an existing installation of opencpu on ubuntu, simply run:

```
sudo add-apt-repository ppa:opencpu/opencpu-1.5
sudo apt-get update
sudo apt-get dist-upgrade
```

Note that this will also upgrade the version of R to 3.2.1 (if you have not already done so) which might require that you reinstall some of your R packages.

For those completely new to OpenCPU there several resources to get started. The presentation from last year’s useR conference gives a broad overview of the system including some basic demo’s. The example apps and jsfiddle scripts show how to use the opencpu.js JavaScript client. The server manual has contains documentation on configuring your opencpu cloud server (although installation should work out of the box).

Finally this paper from my thesis describes more generally the challenges of embedded scientific computing, and the benefits (both technical and human) of decoupling your statistical computing from your front-end or application layer.

To deploy your OpenCPU apps on the public server, simply push your R package to Github and configure the webhook in your repository. Whenever you push an update to Github the package will be reinstalled on the server and can directly be used remotely by anyone on the internet. You can either use the full url or the `ocpu.io`

shorthand url:

`https://public.opencpu.org/ocpu/github/{username}/{package}/`

`https://{username}.ocpu.io/{package}/`

These urls are fully equivalent. Simply replace `{username}`

with your github username, and `{package}`

with your package name. Note that the package name must be identical to the github repository name (as is usually the case).

One prerequisite for using OpenCPU is knowing how to create an R package. There is no way around this; packages are the natural container format for shipping and deploying code/data/manuals in R, and the OpenCPU API assumes this format. Luckily, writing R packages is super easy these days and can be done in less than (10 seconds) using for example RStudio.

The good thing is that once you passed this little hurdle, the full power and flexibility of R and it’s packaging become available to your applications and APIs. Hadley’s latest book on writing R packages gives a nice overview of the R packaging system, and the OpenCPU API provides an easy HTTP interface to all of these features.

To **leave a comment** for the author, please follow the link and comment on his blog: ** OpenCPU**.

R-bloggers.com offers

(This article was first published on ** » R**, and kindly contributed to R-bloggers)

Just few hours before Greeks head to the polls to decide on the bailout agreement, and ultimately, whether the country will stay in the euro, there is no overwhelming advantage of either side. Actually, the margin became blurred over the last three days, with the “Yes” position rehearsing a last-minute recovery. Despite this last-minute trend, the aggregate preference for the “NO” is not too far behind. To frame this in terms of probabilities, that is, the exceeds , I adapted a short function written a while ago to simulate from a Dirichlet distribution, and then to compute posterior probabilities shown in the chart below. It’s really nothing, but the YES outperformed the NO in 57%.

The polls were aggregated and the “Don’t Know” respondents were distributed accordingly to proportion of the Yes/No reported by the polls.

To **leave a comment** for the author, please follow the link and comment on his blog: ** » R**.

R-bloggers.com offers

(This article was first published on ** » R**, and kindly contributed to R-bloggers)

Greeks have been quite volatile on their opinion whether they should accept or not a proposal by the country’s creditors for more austerity to keep aid flowing. The polls conducted over this week look like crazy, though that “belly” was likely provoked by the anxiety on what comes next after Greece not paying IMF back.

The data were collected on the internet, most of them assembled by http://metapolls.net

To **leave a comment** for the author, please follow the link and comment on his blog: ** » R**.

R-bloggers.com offers

(This article was first published on ** Engaging Market Research**, and kindly contributed to R-bloggers)

The genre shirt asks, “What kind of music do u listen 2?”

Microgenres exist because markets are fragmenting and marketers need names to attract emerging customer segments with increasingly specific preferences. The cost of producing and delivery music now supports a plenitude of joint pairings of recordings and customers. The coevolution of music and its audience binds together listening preferences and available alternatives.

I already know a good deal about your preferences by simply knowing that you listen to German Hip Hop or New Orleans Jazz (see the website Every Noise at Once). Those microgenres are not accidental but were named in order to broadcast the information that customers need in order to find what they want to buy and at the same time that artists require to market their goods. Matchmaking demands its own vocabulary. Over time, the language adapts and only the “fittest” categories survive.

**R Makes It Easy**

The R package NMF simplifies the analysis as I demonstrated in my post on Modeling Plenitude and Speciation. Unfortunately, the data in that post were limited to only 17 broad music categories, but the R code would have been the same had there been several hundred microgenres or several thousand songs.

The output is straightforward once you understand what nonnegative matrix factorization (NMF) is trying to accomplish. All matrix factorizations, as the name implies, attempt to identify “simpler” matrices or factors that will reproduce approximately the original data matrix when multiplied together. Simpler, in this case, means that we replace the many observed variables with a much smaller number of latent variables. The belief is that these latent variables will simultaneously account for both row cliques and column microgenres as they coevolve.

This is the matrix factorization diagram that I borrowed from Wikipedia to illustrate the process.

The original data matrix V holds the non-negative numbers indicating the listening intensities for every microgenre is approximated by multiplying a customer graded membership matrix W times a matrix of factor loadings for the observed variables H. The columns of W and the rows of H reflect the number of latent variables. It can be argued that V contains noise so that reproducing it exactly would be overfitting. Thus, nothing of “true” value is lost by replacing the original observations in V with the customer latent scores in W. As you can see, W has fewer columns, but I can always approximate V using the factor loadings in H as a “decoder” to reconstitute the observations without noise (a form of data compression).

We start our interpretation of the output with H containing the factor loadings for the observed variables. For example, we might expect to see one of the latent variables reflecting interest in classical music with sizable factor loadings for the microgenres associated with such music. One of these microgenre could serve as an “anchor” that defines this latent feature if only the most ardent classical music fans listened to this specific microgenre (e.g., Baroque). Hard rock or jazz might be anchors for other latent variables. If we have been careful to select observed variables with high imagery (e.g., is it easy to describe the hard rock listener?), we will be able to name the latent variable based only on its anchors and other high loadings.

The respondent matrix W serves a similar function for customers. Each column is the same latent variable we just described in our interpretation of H. For every row, W gives us a set of graded membership indices telling us the degree to which each latent variable contributes to that respondent’s pattern of observed scores. The opera lover, for instance, would have a high membership index for the latent variable associated with opera. We would expect to find a similar pairing between the jazz enthusiasts and a jazz-anchored latent variable, assuming that this is the structure underlying the coevolving network of customer and product.

Once again, we need to remember that nonnegative matrix factorization attempts to reproduce the observed data matrix as closely as possible given the constraints placed on it by limiting the number of latent variables and requiring that all the entries be non-negative. The constraint on the number of latent variables should seem familiar to those who have some experience with principal component or factor analysis. However, unlike principal component analysis, NMF adds the restriction that all the elements of W and H must be non-negative. The result is a series of additive effects with latent variables defined as nonnegative linear combinations of observed variables (H) and respondents represented as convex combinations of those same latent variables (W). Moreover, latent variables can only add to the reproduced intensities as they attempt to approximate the original data matrix V. Lee and Seung explain this as “learning the parts of objects.”

All this works quite nicely in fragmented markets where products evolve along with customers desires to generate separate communities. In such coevolving networks, the observed data matrix tends to be sparse and requires a simultaneous clustering of rows and columns. The structure of heterogeneity or individual differences is a mixture of categories and graded category memberships. Customers are free to check as many boxes as they wish and order as much as they like. As a result, respondents can be “pure types” with only one large entry in their row of W and all the other values near zero, or they can be “hybrids” with several substantial membership entries in their row.

Markets will continue to fragment with customers in control and the internet providing more and more customized products. We can see this throughout the entertainment industry (e.g., music, film, television, books, and gaming) but also in almost everything online or in retail warehouses. Increasingly, the analyst will be forced to confront the high-dimensional challenge of dealing with subspaces or local regions populated with non-overlapping customers and products. Block clustering of row and columns is the first step when the data matrix is sparse because these variables over here are relevant only for those respondent over there. Fortunately, R provides the link to several innovative methods for dealing with such case (see my post The Ecology of Data Matrices).

To **leave a comment** for the author, please follow the link and comment on his blog: ** Engaging Market Research**.

R-bloggers.com offers

`shinyapps.io`

: here you go.
Prerequisites are indicated in the tutorial; so are links to the source code. If you have feedback or suggestions for improvements, comment here or post an Issue on the GitHub site.

]]>
(This article was first published on ** A Statistics Blog - R**, and kindly contributed to R-bloggers)

I stand in awe of the R wizards who post their latest and greatest Shiny feats on R-Bloggers. It’s taken me a while to find my way around Shiny, but at last I feel ready to fill in a few gaps for others who may be just starting out and who aspire to write reasonably involved, user-friendly simulation apps. To make it a bit more fun I’ve written it up as an R Markdown document hosted on `shinyapps.io`

: here you go.

Prerequisites are indicated in the tutorial; so are links to the source code. If you have feedback or suggestions for improvements, comment here or post an Issue on the GitHub site.

To **leave a comment** for the author, please follow the link and comment on his blog: ** A Statistics Blog - R**.

R-bloggers.com offers

(This article was first published on ** MilanoR**, and kindly contributed to R-bloggers)

I survived to the Conference Dinner and I took the last bus (you should have participated at useR or read the post of day 2 to understand…), and this morning I woke up in Aalborg for the third (and last) time.

After a keynote session by Thomas Lumley on *How flexible computing expands what an individual can do*, my colleague Enrico and I attended the *Sponsor Session.* This was one of the most interesting sessions of the conference, because they showed us a lot of tools for using R in business environments. The only Diamond Sponsor was *DataRobot*, whose platform focuses on modeling and prediction. Then it was the Platinum Sponsors’ turn: *RStudio* talked about its several tools (IDE, Shiny, R packages, shinyapps.io), *Teradata *showed how they integrate R on its platform for big data, while *Revolution Analytics* is actually a division of Microsoft and their presentation concentrated on Azure. Finally, Gold Sponsors had their ten minutes each: *Alteryx* showed its data pipelining engine and a visual programming framework, *TIBCO* introduced TIBCO Enterprise Runtime for R (TERR), *H2O* illustrated its platform for model guessing, while *HP* introduced HP Distributed R.

After the last keynote speaking by Steffen Lauritzen on *Linear estimating equations for Gaussian graphical models with symmetry*, the conference ended with a goodbye to next year in Stanford, California, USA.

Now, Enrico and I are waiting for the bus which will take us to the airport. This night we will arrive to our houses and on Monday we will resume our work in Quantide, but Chronicles from useR! doesn’t stop here… Next weeks, Enrico and I will publish some great new articles, including photos, funniest moments and curiosities, in-depth analysis on R topics and so on…

To **leave a comment** for the author, please follow the link and comment on his blog: ** MilanoR**.

R-bloggers.com offers

The post Review: Machine Learning with R Cookbook appeared first on Exegetic Analytics.

]]>
(This article was first published on ** Exegetic Analytics » R**, and kindly contributed to R-bloggers)

“Machine Learning with R Cookbook” by Chiu Yu-Wei is nothing more or less than it purports to be: a collection of 110 recipes for applying Data Analysis and Machine Learning techniques in R. I was asked by the publishers to review this book and found it to be an interesting and informative read. It will not help you understand how Machine Learning works (that’s not the goal!) but it will help you quickly learn how to apply Machine Learning techniques to you own problems.

The recipes are broken down into chapters which address the following topics:

- Installing R and an Introduction to R
- Exploring an Example Data Set
- Basic Statistics in R
- Regression Analysis (including Poisson and Binomial models)
- Classification (Decision Trees, Nearest Neighbour, Logistic Regression and Naïve Bayes)
- Classification (Neural Networks and SVMs)
- Model Evaluation and Comparison
- Ensemble Models
- Clustering
- Mining Associations and Sequences
- Dimensionality Reduction
- Big Data and Integration of R with Hadoop

This is a relatively exhaustive list of topics. The last chapter might have better been omitted, but still provides a useful introduction to the use of R with massive data sets.

Each recipe in the book is divided into four parts entitled “Getting Ready”, “How to do it…”, “How it works…” and “See also”. This is a clever structure and an intuitive way to organise the material. In general the “Getting Ready” part provides sufficient background material to prepare you for the task at hand. “How to do it…” presents the meat of the recipe as a step-by-step procedure. The intention with “How it works…” is to explain how and why the recipe works. In many instances the explanations are somewhat superficial or reference details which are not discussed with sufficient depth. They are generally helpful though. The “See also” part provides links to additional material or alternative ways to solve the same problem.

There are some errors in the text and sometimes the language and grammar are imperfect. However, if you want to learn more about using R for Machine Learning, this might be a useful book to have in your collection. I should note that all of these recipes could easily be constructed from online resources, but this book has the merit of assembling them all in one place.

The post Review: Machine Learning with R Cookbook appeared first on Exegetic Analytics.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Exegetic Analytics » R**.

R-bloggers.com offers

(This article was first published on ** 4D Pie Charts » R**, and kindly contributed to R-bloggers)

“Assertion” is computer-science jargon for a run-time check on your code. In R , this typically means function argument checks (“did they pass a numeric vector rather than a character vector into your function?”), and data quality checks (“does the date-of-birth column contain values in the past?”).

R currently has four packages for assertions: `assertive`

, which is mine; `assertthat`

by Hadley Wickham, `assertr`

By Tony Fischetti, and `ensurer`

by Stefan Bache.

Having four packages feels like too many; we’re duplicating effort, and it makes package choice too hard for users. I didn’t know about the existence of `assertr`

or `ensurer`

until a couple of days ago, but the useR conference has helped bring these rivals to my attention. I’ve chatted with the authors of the other three packages to see if we can streamline things a little.

Hadley said that `assertthat`

isn’t a high priority for him – dplyr, ggplot2 and tidyr (among many others) are more important – so he’s not going to develop it further. Since `assertthat`

is mostly a subset of `assertive`

anyway, this shouldn’t be a problem. I’ll take a look how easy it is to provide an `assertthat`

API, so existing users can have a direct replacement.

Tony said that the focus of `assertr`

is predominantly data checking. It only works with data frames, and has a more limited remit than `assertive`

. He plans to change the backend to be built on top of `assertive`

. That is, `assertr`

will be an `assertive`

extension that make it easy to apply assertions to multiple columns in data frames.

Stefan has stated that he prefers to keep `ensurer`

separate, since it has a different philosophical stance to `assertive`

, and I agree. `ensurer`

is optimised for being lightweight and elegant; `assertive`

is optimised for clarity of user code and clarity of error messages (at a cost of some bulk).

So overall, we’re down from four distinct assertion packages to two groups (`assertive`

/`assertr`

and `assertive`

). This feels sensible. It’s the optimum number for minimizing duplication while still having the some competition to spur development onwards.

`ensurer`

has one feature in particular that I definitely want to include in `assertive`

: you can create type-safe functions.

The question of bulk has also been playing on my mind for a while. It isn’t huge by any means – the tar.gz file for the package is 836kB – but the number of functions can make it a little difficult for new users to find their way around. A couple of years ago when I was working with a lot of customer data, I included functions for checking things like the validity of UK postcodes. These are things that I’m unlikely to use at all in my current job, so it seems superfluous to have them. That means that I’d like to make `assertive`

more modular. The core things should be available in an `assertive.base`

package, with specialist assertions in additional packages.

I also want to make it easier for other package developers to include their own assertions in their packages. This will require a bit of rethinking about how the existing assertion engine works, and what internal bits I need to expose.

One bit of feedback I got from the attendees at my tutorial this week was that for simulation usage (where you call the same function millions of times), assertions can slow down the code too much. So a way to turn off the assertions (but keep them there for debugging purposes) would be useful.

The top feature request however, was for the use of pipe compatibility. Stefan’s `magrittr`

package has rocketed in popularity (I’m a huge fan), so this definitely needs implementing. It should be a small fix, so I should have it included soon.

There are some other small fixes like better NA handling and a better error message for `is_in_range`

that I plan to make soon.

The final (rather non-trivial) feature I want to add to assertive is support for error messages in multiple languages. The infrastructure is in place for translations (it currently support both the languages that I know; British English and American English), I just need some people who can speak other languages to do the translations. If you are interested in translating; drop me an email or let me know in the comments.

Tagged: assertions, assertive, r

To **leave a comment** for the author, please follow the link and comment on his blog: ** 4D Pie Charts » R**.

R-bloggers.com offers