(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

Another minor maintenance release of the RcppCNPy package arrived on CRAN this evening.

RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

There is only small code change: a path is now checked before an attempt to save. Thanks to Wush for the suggestion. I also added a short new vignette showing how reticulate can be used for NumPy data.

## Changes in version 0.2.9 (2018-03-22)

The

`npySave`

function has a new option to check the path in the given filename.A new vignette was added showing how the

`reticulate`

package can be used instead.

CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

During a discussion with some other members of the R Consortium, the question came up: who maintains the most packages on CRAN? DataCamp maintains a list of most active maintainers by downloads, but in this case we were interested in the total number of packages by maintainer. Fortunately, this is pretty easy to figure thanks to the CRAN repository tools now included in R, and a little dplyr (see the code below) gives the answer quickly[*].

And the answer? The most prolific maintainer is Scott Chamberlain from ROpenSci, who is currently the maintainer of 77 packages. Here's a list of the top 20:

Maint n 1 Scott Chamberlain 77 2 Dirk Eddelbuettel 53 3 Jeroen Ooms 40 4 Hadley Wickham 39 5 Gábor Csárdi 37 6 ORPHANED 37 7 Thomas J. Leeper 29 8 Bob Rudis 28 9 Henrik Bengtsson 28 10 Kurt Hornik 28 11 Oliver Keyes 28 12 Martin Maechler 27 13 Richard Cotton 27 14 Robin K. S. Hankin 24 15 Simon Urbanek 24 16 Kirill Müller 23 17 Torsten Hothorn 23 18 Achim Zeileis 22 19 Paul Gilbert 21 20 Yihui Xie 21

(That list of orphaned packages with no current maintainer includes XML, d3heatmap, and flexclust, to name just 3 of the 37.) Here's the R code used to calculate the top 20:

[*]Well, it would have been quick, until I noticed that some maintainers had two forms of their name in the database, one with surrounding quotes and one without. It seemed like it was going to be trivial to fix with a regular expression, but it took me longer than I hoped to come up with the final regexp on line 4 above, which is now barely distinguishable from line noise. As usual, there an xkcd for this situation:

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** R – Modern Forecasting**, and kindly contributed to R-bloggers)

One of the basic things the students are taught in statistics classes is that the comparison of models using information criteria can only be done when the models have the same response variable. This means, for example, that when you have \(\log(y_t)\) and calculate AIC, then this value is not comparable with AIC from a model with \(y_t\). The reason for this is because the scales of variables are different. But there is a way to make the criteria in these two cases comparable: both variables need to be transformed into the original scale, and we need to understand what are the distributions of these variables in that scale. In our example, if we assume that \(\log(y_t) \sim \mathcal{N}(0, \sigma^2_{l}) \) (where \(\sigma^2_{l}\) is the variance of the residuals of the model in logarithms), then the exponent of this variable will have log-normal distribution:

\begin{equation}

y_t \sim \text{log}\mathcal{N}(0, \sigma^2_{l})

\end{equation}

Just as a reminder, all the information criteria rely on the log-likelihood. For example, here’s the formula of AIC:

\begin{equation} \label{eq:AIC}

AIC = 2k -2\ell ,

\end{equation}

where \(k\) is the number of all the estimated parameters and \(\ell\) is the value of the log-likelihood.

If we use the likelihood of log-normal distribution instead of the likelihood of the normal in \eqref{eq:AIC} for the variable \(y_t\), then the information criteria will become comparable. In order to understand what needs to be done for this transformation, we need to compare the formulae for the two distributions: normal and log-normal. Here’s normal for the variable \(\log y_t\):

\begin{equation} \label{eq:normal}

f(y_t | \theta, \sigma^2_{l}) = \frac{1}{\sqrt{2 \pi \sigma^2_{l}}} e ^{-\frac{\left(\log y_t -\log \mu_{t} \right)^2}{2 \sigma^2_{l}}}

\end{equation}

and here’s the log-normal for the variable \(y_t = \exp(\log(y_t))\) (the multiplicative model in the original scale):

\begin{equation} \label{eq:log-normal}

f(y_t | \theta, \sigma^2_{l}) = \frac{1}{y_t} \frac{1}{\sqrt{2 \pi \sigma^2_{l}}} e ^{-\frac{\left(\log y_t -\log \mu_{t} \right)^2}{2 \sigma^2_{l}}} ,

\end{equation}

where \(\theta\) is the vector of parameters of our model. The main difference between the two distributions is in \(\frac{1}{y_t}\). If we derive the log-likelihood based on \eqref{eq:log-normal}, here is what we get:

\begin{equation} \label{eq:loglikelihoodlognormal}

\ell(\theta, \sigma^2_{l} | Y) = -\frac{1}{2} \left( T \log \left( 2 \pi {\sigma}^2_{l} \right) +\sum_{t=1}^T \frac{\left(\log y_t -\log \mu_{t} \right)^2}{2\sigma^2_{l}} \right) – \sum_{t=1}^T \log y_t ,

\end{equation}

where \(Y\) is the vector of all the actual values in the sample. When we extract likelihood of the model in logarithms, we calculate only the first part of \eqref{eq:loglikelihoodlognormal}, before the “\(-\sum_{t=1}^T \log y_t \)”, which corresponds to the normal distribution. So, in order to produce the likelihood of the model with the variable in the original scale, we need to subtract the sum of logarithms of the response variable from the extracted likelihood.

The function

AIC()

in R, applied to the model in logarithms, will extract the value based on that first part of \eqref{eq:loglikelihoodlognormal}. As a result, in order to fix this and get AIC in the same scale as the variable \(y_t\) we need to take the remaining part into account, modifying equation \eqref{eq:AIC}:

\begin{equation} \label{eq:AICNew}

AIC^{\prime} = 2k -2\ell + 2 \sum_{t=1}^T \log y_t = AIC + 2 \sum_{t=1}^T \log y_t,

\end{equation}

Let’s see an example in R. We will use

longley

data from

datasets

package. First we construct additive and multiplicative models:

modelAdditive <- lm(GNP~Employed,data=longley) modelMultiplicative <- lm(log(GNP)~Employed,data=longley)

And extract the respective AICs:

AIC(modelAdditive) > 142.7824 AIC(modelMultiplicative) > -44.5661

As we see, the values are not comparable. Now let’s modify the second AIC:

AIC(modelMultiplicative) + 2*sum(log(longley$GNP)) > 145.118

So, now the values are comparable, and we can conclude that the first model (additive) is better than the second one in terms of AIC.

Similar technique can be used for the other transformed response variables (square root, Box-Cox transformation), but the respective distributions of the variables would need to be derived, which is not always a simple task.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Modern Forecasting**.

R-bloggers.com offers

(This article was first published on ** R – The R Trader**, and kindly contributed to R-bloggers)

I came across this excellent article lately “Machine learning at central banks” which I decided to use as a basis for a new cheat sheet called **Machine Learning Modelling in R**. The cheat sheet can be downloaded from RStudio cheat sheets repository.

As the R ecosystem is now far too rich to present all available packages and functions, this cheat sheet is by no means exhaustive . It’s rather a guide to what I consider being the most useful tools available in R for modelling. It also contains a standard modelling workflow which represents my view on how to approach a generic modelling exercise. Any comments comments welcome!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – The R Trader**.

R-bloggers.com offers

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**W**hile in Brussels last week I noticed an interesting question on X validated that I considered in the train back home and then more over the weekend. This is a question about spacings, namely how long on average does it take to cover an interval of length L when drawing unit intervals at random (with a torus handling of the endpoints)? Which immediately reminded me of Wilfrid Kendall (Warwick) famous gif animation of coupling from the past via leaves covering a square region, from the top (forward) and from the bottom (backward)…

The problem is rather easily expressed in terms of uniform spacings, more specifically on the maximum spacing being less than 1 (or 1/L depending on the parameterisation). Except for the additional constraint at the boundary, which is not independent of the other spacings. Replacing this extra event with an independent spacing, there exists a direct formula for the expected stopping time, which can be checked rather easily by simulation. But the exact case appears to be add a few more steps to the draws, 3/2 apparently. The following graph displays the regression of the Monte Carlo number of steps over 10⁴ replicas against the exact values:

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.

R-bloggers.com offers

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

There are a number of easy ways to avoid illegible code nesting problems in `R`

.

In this R tip we will expand upon the above statement with a simple example.

At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.

head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")]) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25

One popular way to break up nesting is to use `magrittr`

‘s “`%>%`

” in combination with `dplyr`

transform verbs as we show below.

library("dplyr") mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% head # mpg cyl wt # 1 18.7 8 3.44 # 2 14.3 8 3.57 # 3 16.4 8 4.07 # 4 17.3 8 3.73 # 5 15.2 8 3.78 # 6 10.4 8 5.25

Note: the above code lost (without warning) the row names that are part of `mtcars`

. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.

Many `R`

users *already* routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.

result <- mtcars result <- filter(result, cyl == 8) result <- select(result, mpg, cyl, wt) head(result)

The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable.

I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from `dplyr`

verbs, to base `R`

operators).

. <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] result <- . head(result) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25

The dot intermediate convention is very succinct, and we can use it with base `R`

transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.

library("dplyr") library("microbenchmark") library("ggplot2") timings <- microbenchmark( base = { . <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow }) print(timings) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 122.948 136.948 167.2253 159.688 179.924 349.328 100 ## dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770 100 autoplot(timings)

Durations for related tasks, smaller is better.

*Contrary* to what many repeat, base `R`

is often faster than the `dplyr`

alternative. In this case the base `R`

is 15 times faster (possibly due to `magrittr`

overhead and the small size of this example). We also see, with some care, base `R`

can be quite legible. `dplyr`

is a useful tool and convention, however it is not the *only* allowed tool or *only* allowed convention.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.

R-bloggers.com offers

(This article was first published on ** bnosac :: open analytical helpers**, and kindly contributed to R-bloggers)

Last week we updated the cronR R package and released it to CRAN allowing you to schedule any R code on whichever timepoint you like. The package was updated in order to comply to more stricter CRAN policies regarding writing to folders. Along the lines, the RStudio add-in of the package was also updated. It now looks as shown below and is tailored to Data Scientists that want to automate basic R scripts.

The cronR (https://github.com/bnosac/cronR) and taskscheduleR (https://github.com/bnosac/taskscheduleR) R packages are distributed on CRAN and provide the basic functionalities to schedule R code at your specific timepoints of interest. The taskscheduleR R package is designed to schedule processes on Windows, the cronR R package allows to schedule your jobs on Linux or Mac. Hope you enjoy the packages.

If you need support in automating and integrating more complex R flows in your current architecture, feel free to get into contact here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** bnosac :: open analytical helpers**.

R-bloggers.com offers

(This article was first published on ** Easy Guides**, and kindly contributed to R-bloggers)

**Regression analysis** consists of a set of *machine learning* methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x).

Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).

**Linear regression** is the most simple and popular technique for predicting a continuous variable. It assumes a linear relationship between the outcome and the predictor variables.

The linear regression equation can be written as `y = b0 + b*x + e`

, where:

- b0 is the intercept,
- b is the regression weight or coefficient associated with the predictor variable x.
- e is the residual error

Technically, the linear regression coefficients are detetermined so that the error in predicting the outcome value is minimized. This method of computing the beta coefficients is called the **Ordinary Least Squares** method.

When you have multiple predictor variables, say x1 and x2, the regression equation can be written as `y = b0 + b1*x1 + b2*x2 +e`

. In some situations, there might be an **interaction effect** between some predictors, that is for example, increasing the value of a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining the variation in the outcome variable.

Note also that, linear regression models can incorporate both continuous and **categorical predictor variables**.

When you build the linear regression model, you need to **diagnostic** whether linear model is suitable for your data.

In some cases, the relationship between the outcome and the predictor variables is not linear. In these situations, you need to build a **non-linear regression**, such as *polynomial and spline regression*.

When you have multiple predictors in the regression model, you might want to select the best combination of predictor variables to build an optimal predictive model. This process called **model selection**, consists of comparing multiple models containing different sets of predictors in order to select the best performing model that minimize the prediction error. Linear model selection approaches include **best subsets regression** and **stepwise regression**

In some situations, such as in genomic fields, you might have a large multivariate data set containing some correlated predictors. In this case, the information, in the original data set, can be summarized into few new variables (called principal components) that are a linear combination of the original variables. This few principal components can be used to build a linear model, which might be more performant for your data. This approach is know as **principal component-based methods**, which include: **principal component regression** and **partial least squares regression**.

An alternative method to simplify a large multivariate model is to use **penalized regression**, which penalizes the model for having too many variables. The most well known penalized regression include **ridge regression** and the **lasso regression**.

You can apply all these different regression models on your data, compare the models and finally select the best approach that explains well your data. To do so, you need some statistical metrics to compare the performance of the different models in explaining your data and in predicting the outcome of new test data.

The best model is defined as the model that has the lowest prediction error. The most popular metrics for comparing regression models, include:

**Root Mean Squared Error**, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as`RMSE = mean((observeds - predicteds)^2) %>% sqrt()`

. The lower the RMSE, the better the model.**Adjusted R-square**, representing the proportion of variation (i.e., information), in your data, explained by the model. This corresponds to the overall quality of the model. The higher the adjusted R2, the better the model

Note that, the above mentioned metrics should be computed on a new test data that has not been used to train (i.e. build) the model. If you have a large data set, with many records, you can randomly split the data into training set (80% for building the predictive model) and test set or validation set (20% for evaluating the model performance).

One of the most robust and popular approach for estimating a model performance is **k-fold cross-validation**. It can be applied even on a small data set. k-fold cross-validation works as follow:

- Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
- Reserve one subset and train the model on all other subsets
- Test the model on the reserved subset and record the prediction error
- Repeat this process until each of the k subsets has served as the test set.
- Compute the average of the k recorded errors. This is called the cross-validation error serving as the performance metric for the model.

Taken together, the best model is the model that has the lowest cross-validation error, RMSE.

In this Part, you will learn different methods for regression analysis and we’ll provide practical example in **R**.

The content is organized as follow:

To **leave a comment** for the author, please follow the link and comment on their blog: ** Easy Guides**.

R-bloggers.com offers