(This article was first published on ** Mathew Analytics » R**, and kindly contributed to R-bloggers)

My previous post covered the basics of logistic regression. We must now examine the model to understand how well it fits the data and generalizes to other observations. The evaluation process involves the assessment of three distinct areas – goodness of fit, tests of individual predictors, and validation of predicted values – in order to produce the most useful model. While the following content isn’t exhaustive, it should provide a compact ‘cheat sheet’ and guide for the modeling process.

**Goodness of Fit: Likelihood Ratio Test**

A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This occurs by comparing the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. The null hypothesis, holds that the reduced model is true,so an for the overall model fit statistic that is less than would compel us to reject .

**Goodness of Fit: Pseudo **

With linear regression, the statistic tells us the proportion of variance in the dependent variable that is explained by the predictors. While no equivilent metric exists for logistic regression, there are a number of values that can be of value. Most notable is McFadden’s , which is defined as where is the log likelihood value for the fitted model and is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from to just under , with values closer to zero indicating that the model has no predictive power.

**Goodness of Fit: Hosmer-Lemeshow Test**

The Hosmer-Lemeshow test examines whether the observed proportion of events are similar to the predicted probabilities of occurences in subgroups of the dataset using a pearson chi-square statistic from the 2 x g table of observed and expected frequencies. Small values with large p-values indicate a good fit to the data while large values with p-values below indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject .

**Tests of Individual Predictors: Wald Test**

A wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The idea is to test the hypothesis that the coefficient of an independent variable in the model is not significantly different from zero. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model.

**Tests of Individual Predictors: Variable Importance**

To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models. The t-statistic for each model parameter helps us determine if it’s significantly different from zero.

**Validation of Predicted Values: Classification Rate**

With predictive models, he most critical metric regards how well the model does in predicting the target variable on out of sample observations. The process involves using the model estimates to predict values on the training set. Afterwards, we will compare the predicted target variable versus the observed values for each observation.

**Validation of Predicted Values: ROC Curve**

The receiving operating characteristic is a measure of classifier performance. It’s based on the proportion of positive data points that are correctly considered as positive, , and the proportion of negative data points that are accuratecly considered as negative, . These metrics are expressed through a graphic that shows the trade off between these values. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from to , and values above indicate that the model does a great job in discriminating between the two categories which comprise our target variable.

library(pROC) # Compute AUC for predicting Class with the variable CreditHistory.Critical f1 = roc(Class ~ CreditHistory.Critical, data=training) plot(f1, col="red") library(ROCR) # Compute AUC for predicting Class with the model prob <- predict(mod_fit_one, newdata=testing, type="response") pred <- prediction(prob, testing$Class) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf) auc <- performance(pred, measure = "auc") auc <- auc@y.values[[1]] auc

This post has provided a quick overview of how to evaluate logistic regression models in R. If you have any comments or corrections, please comment below.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Mathew Analytics » R**.

R-bloggers.com offers

(This article was first published on ** Xi'an's Og » R**, and kindly contributed to R-bloggers)

“…likelihood inference is in afundamental way more complicated than the classical method of moments.”

**C**arlos Amendola, Mathias Drton, and Bernd Sturmfels arXived a paper this Friday on “maximum likelihood estimates for Gaussian mixtures are transcendental”. By which they mean that trying to solve the five likelihood equations for a two-component Gaussian mixture does not lead to an algebraic function of the data. (When excluding the trivial global maxima spiking at any observation.) This is not highly surprising when considering two observations, 0 and x, from a mixture of N(0,1/2) and N(μ,1/2) because the likelihood equation

involves both exponential and algebraic terms. While this is not directly impacting (statistical) inference, this result has the computational consequence that the number of critical points ‘and also the maximum number of local maxima, depends on the sample size and increases beyond any bound’, which means that EM faces increasing difficulties in finding a global finite maximum as the sample size increases…

Filed under: Books, R, Statistics Tagged: algebraic geometry, computational statistics, EM, mixtures of distributions, transcendental equations

To **leave a comment** for the author, please follow the link and comment on his blog: ** Xi'an's Og » R**.

R-bloggers.com offers

(This article was first published on ** Win-Vector Blog » R**, and kindly contributed to R-bloggers)

Authors: John Mount (more articles) and Nina Zumel (more articles).

“Essentially, all models are wrong, but some are useful.”

George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a *negotiable* business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series. This part (part 1) sets up the specific problem.

Let’s use a single example to make things concrete. We have used the 2009 KDD Cup dataset to demonstrate estimating variable significance, so we will use it again here to demonstrate model evaluation. The contest task was supervised machine learning. The goal was to build scores that predict things like churn (account cancellation) from a data set consisting of about 50,000 rows (representing credit card accounts) and 234 variables (both numeric and categorical facts about the accounts). An IBM group won the contest with an AUC (“area under the curve”) of 0.76 in predicting churn on held-out data. Using R we can get an AUC of 0.71 on our own hold-out set (meaning we used less data for training) using automated variable preparation, standard gradient boosting, and essentially no parameter tuning (which itself can be automated as it is in packages such as caret).

Obviously a 0.71 AUC would not win the contest. But remember the difference between 0.76 and 0.71 may or may not be statistically significant (something we will touch on in this article) and may or may not *make a business difference*. Typically a business combines a score with a single threshold to convert it into an operating classifier or decision procedure. The threshold is chosen as a business driven compromise between domain driven precision and recall (or sensitivity and specificity) goals. Businesses do not directly experience AUC which summarizes facts about the classifiers the score would induce at many different threshold levels (including ones that are irrelevant to the business). A scoring system whose ROC curve contains another scoring system’s ROC curve is definitely the better classifier, but small increases in AUC don’t always ensure such containment. AUC is an acceptable proxy score when choosing among classifiers (however, it does not have a non-strained reasonable probabilistic interpretation, despite such claims), and it should not be your final business metric.

For this article, however- we will stick with the score evaluation measures: deviance and AUC. But keep in mind that in an actual data science project you are much more likely to quickly get a reliable 0.05 increase in AUC by working with your business partners to transform, clean, or find more variables- than by tuning your post-data- collection machine learning procedure. So we feel score tuning is already over-emphasized and don’t want to dwell too much more on it here.

One way a data science project differs from a machine learning contest is that the choice of score or utility metric is an important choice made by the data scientist, and not a choice supplied by a competition framework. The metric or score must map to utility for the business client. The business goal in supervised machine learning project is usually either classification (picking a group of accounts at higher risk of churn) or sorting (ordering accounts by predicted risk).

Choice of experimental design, data preparation, and choice of metric can be a big driver of project success or failure. For example in hazard models (such as predicting churn) the items that are easiest to score are items that have essentially already happened. You may have call-center code that encodes “called to cancel” as one of your predictive signals. Technically it is a great signal, the person certainly hasn’t cancelled prior to the end of the call. But it is useless to the business. The data-scientist has to help re-design the problem definition and data curation to focus in on customers that are going to cancel soon, but to indicate some reasonable time before they cancel (see here for more on the issue). The business goal is to change the problem to a more useful business problem that may induce a harder machine learning problem. The business goal is not to do as well as possible on a single unchanging machine learning problem.

If the business needs a decision procedure: then part of the project is picking a threshold that converts the scoring system into a classifier. To do this you need some sort of business sensitive pricing of true-positives, false-positives, true-negatives, and false-negatives or working out appropriate trade-offs between precision and recall. While tuning scoring procedures we suggest using one of deviance or AUC as a proxy measure until you are ready to try converting your score into a classifier. Deviance has the advantage that it has nice interpretations in terms of log-likelihood and entropy, and AUC has the advantage that is invariant under any one-to-one monotone transformation of your score.

A classifier is best evaluated with precision and recall or sensitivity and specificity. Order evaluation is best done with an AUC-like score such as the Gini coefficient or even a gain curve.

In most applications the cost of false-positives (accounts the classifier thinks will churn, but do not) is usually very different than the cost of false-negatives (accounts the classifier things will not churn, but do). This means a measure that prices these two errors identically is almost never the right final utility score. Accuracy is exactly one such measure. You must understand most business partners ask for “accurate” classifiers only because it may be the only term they are familiar with. Take the time to discuss appropriate utility measures with your business partners.

Here is an example to really drive the point home. The KDD2009 data set had a churn rate of around 7%. Consider the following two classifiers. Classifier A that predicts “churn” on 21% of the data but captures all of the churners in its positive predictions. Classifier B that predicts “no churn” on all data. Classifier A is wrong 14% of the time and thus has an accuracy of 86%. Classifier B is wrong 7% of the time and thus has an accuracy of 93% and is the more accurate classifier. Classifier A is a “home run” in a business sense (it has recall 1.0 and precision 33%!), Classifier B is absolutely useless. See here, for more discussion on this issue.

In all cases we are going to pick a utility score or statistic. We want to estimate the utility of our model on future data (as our model will hopefully be used on new data in the future). The performance of our model in the future is usually an unknowable quantity. However, we can try to estimate this unknowable quantity by an appeal to the idea of exchangeability. If we had a set of test data that was exchangeable with the unknown future data, then an estimate of our utility on this test set should be a good estimate of future behavior. A similar way to get at this is if future data were independent and identically distributed with the test data then we again could expect to make an estimate.

The issues we run into in designing an estimate of model utility include at least the following:

- Are we attempting to evaluate an actual score or the procedure for building scores? These are two related, but different questions.
- Are we deriving a single point estimate or a distribution of estimates? Are we estimating sizes of effects, significances, or both?
- Are we using data that was involved in the training procedure (which breaks exchangeability!) or fresh data?

Your answers to these questions determine what procedures you should try.

We are going to work through a good number of the available testing and validation procedures. There is no “one true” procedure, so you need to get used to having more than one method to choose from. We suggest you go over each of the upcoming graphs with a ruler and see what conclusions you can draw about the relative utility of each of the models we are demonstrating.

The no-measure procedure is the following: pick a good machine learning procedure, use it to fit the data, and turn that in as your solution. In principle nobody is ever so ill mannered to do this.

However, if you only try one modeling technique and don’t base any decision on your measure or score- how does that differ from having made no measurement? Suppose we (as in this R example) only made one try of Random Forest on the KDD2009 problem? We could present our boss with a ROC graph like the following:

Because we only tried one model the only thing our boss can look for is the AUC above 0.5 (uselessness) or not. They have no idea if 0.67 is large or small. Since or AUC measure drove no decision, it essentially was no measurement.

So at the very least we need to set a sense of scale. We should at least try more than one model.

If we are going to try more than one model, we run into the problem that each model reports different diagnostics. Random forest tends to report error rates, logistic regression reports deviance, GBM reports variable importance. At this point you find you need to standardize on your own quality of score measure and run your own (or library code) on all models.

Now that we have framed the problem, we will continue this series with:

- Part 2: In-training set measures
- Part 3: Out of sample procedures
- Part 4: Cross-validation techniques

The goal is to organize the common procedures into a coherent guide. As we work through the ideas all methods will be shared as R code here.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Win-Vector Blog » R**.

R-bloggers.com offers

(This article was first published on ** Mango Solutions**, and kindly contributed to R-bloggers)

On September 14th-16th Mango Solutions are running the EARL ( Effective Applications of the R Language) Conference for all users, enthusiasts and beginners of the R programming language.

It is an event not to be missed and here are the Top 10 Reasons why…

**1. The Keynote Speakers**

Alex Bellos @alexbellos Dirk Eddelbuettel @eddelbuettel

Joe Cheng @jCheng Hannah Fry @FryRSquared

With so much experience between them the keynote speeches on Tuesday and Wednesday are not to be missed.

**2. Amazing Venue**

The Tower Hotel In London with it’s amazing view is the venue again for this years EARL Conference. Nestled between the River Thames and St Katharine’s Dock and alongside two world Heritage Sites – Tower Bridge and the Tower of London, the Tower Hotel is within easy reach of key places in London.

**3. Expert Speakers**

With speakers from companies like Shell, TIBCO, Mango Solutions, Revolution Analytics, KPMG, UCL, Washington University, Allianz, HP, Jacobs Douwe Egberts, Deloitte to name a few, you can expect an extremely high level of speakers.

**4. Varied Range of Topics on the Agenda**

With over 40 presentations, you are sure to find many that suit your needs. See the full agenda here. Here are a few of them:

**5. The Workshops**

The Conference starts on the Monday 14th September with 4 workshops by:

Joe Cheng from RStudio @Jcheng

Aimee Gott from Mango Solutions @aimeegott_R

Chris Musselle from Mango Solutions @ChrisMusselle

Dirk Eddelbuettel @eddelbuettel

**6. Tower Bridge Walkways Evening Reception**

Spanning the North and South Towers, the award-winning high-level Walkways offer breathtaking panoramic views of London, from 42 metres above the River Thames – an exquisite backdrop for any event.

Guests are treated to the spectacular view of the Shard, the Tower of London, St. Paul’s Cathedral and the City from the West Walkway, and the Docklands, Canary Wharf and Greenwich from the East Walkway.

The venue’s newest “wow” feature: the glass floor also enables guests to enjoy London life beneath their feet from a unique viewpoint.

As the sun sets over London and the skylight becomes illuminated with city lights, the Walkways become a magical setting for our Tuesday Evening Conference Reception.

**7. Networking**

The EARL Conference offers you a great chance to network with a diverse range of R people from a wide range of industries with different levels of expertise.

**8. R Consortium Members**

With several of the board members attending the conference, you will have the chance to meet some R Consortium Board Members. Mango Solutions very own Richard Pugh has recently been announced as inaugural president of the R Consortium. You can read more on this here…

**9. For your Personal Growth and Development**

It goes without saying that you are going to learn so much at the conference.

**10. The Free Cats of course!**

To get your ticket to this great R Conference, then click here

Don’t Miss it!

To **leave a comment** for the author, please follow the link and comment on his blog: ** Mango Solutions**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

by Andrie de Vries

Just more than a year ago I cobbled together some code to work with the (then) new version of Google Sheets. You can still find my musings and code at the blog post Reading data from the new version of Google Spreadsheets.

Since then, Jennifer Bryan (@JennyBryan) published a wonderful package that does far more than I ever tried, in a much more elegant way.

This package just **works**!

I read the first few lines of the excellent vignette, pasted the code into my R session, and had a listing of my Google sheets within seconds:

library(googlesheets)

library(dplyr)

(my_sheets <- gs_ls())

# (expect a prompt to authenticate with Google interactively HERE)

my_sheets %>% glimpse()

The package neatly handles handshaking with Google to get access to your authorisation token. It automatically opens a web browser where you log in to your Google account. Google then presents you with a token that you simply paste into your R session command line, and from there everything just works.

(my_sheets <- gs_ls())

Source: local data frame [6 x 10]

sheet_title author perm version updated

1 RRO with MKL benchmark andrie r new 2015-06-03 15:52:21

2 FMH: UAT Test plan (v1.… des.holmes rw new 2015-02-10 00:07:26

3 Hospital Dat 2013_08_22 markratnaraj… rw new 2013-10-13 20:57:55

4 Strategy Time TBG clementdata rw new 2013-04-23 19:43:21

5 hspot-rfp-alpha-scoresh… benoit.chovet rw new 2013-02-22 16:20:37

6 2012-Jan Mobile interne… simbamangu r new 2013-01-07 06:03:56

Variables not shown: ws_feed (chr), alternate (chr), self (chr), alt_key (chr)

Since the googlesheets package is on CRAN, you can install by using:

install.packages("googlesheets")

I highly recommend reading the following excellent resources:

- The package vignette at https://cran.r-project.org/web/packages/googlesheets/vignettes/basic-usage.html
- The README at the project's github page https://github.com/jennybc/googlesheets
- Jenny's presentation at UseR!2015, available at https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015

To **leave a comment** for the author, please follow the link and comment on his blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** fReigeist » R**, and kindly contributed to R-bloggers)

Darin Christensen and Thiemo Fetzer

**tl;dr**: *Fast computation of standard errors that allows for serial and spatial auto-correlation.*

Economists and political scientists often employ panel data that track units (e.g., firms or villages) over time. When estimating regression models using such data, we often need to be concerned about **two forms of auto-correlation: serial (within units over time) and spatial (across nearby units).** As Cameron and Miller (2013) note in their excellent guide to cluster-robust inference, failure to account for such dependence can lead to incorrect conclusions: “[f]ailure to control for within-cluster error correlation can lead to very misleadingly small standard errors…” (p. 4).

Conley (1999, 2008) develops one commonly employed solution. His approach allows for serial correlation over all (or a specified number of) time periods, as well as spatial correlation among units that fall within a certain distance of each other. For example, we can account for correlated disturbances within a particular village over time, as well as between that village and every other village within one hundred kilometers. As with serial correlation, spatial correlation can be positive or negative. It can be made visually obvious by plotting, for example, residuals after removing location fixed effects.

We provide a new function that allows `R`

users to more easily estimate these corrected standard errors. (Solomon Hsiang (2010) provides code for STATA, which we used to test our estimates and benchmark speed.) Moreover using the excellent `lfe`

, `Rcpp`

, and `RcppArmadillo`

packages (and Tony Fischetti’s Haversine distance function), our function is roughly 20 times faster than the STATA equivalent and can scale to handle panels with more units. (We have used it on panel data with over 100,000 units observed over 6 years.)

This demonstration employs data from Fetzer (2014), who uses a panel of U.S. counties from 1999-2012. The data and code can be downloaded here.

We first use Hsiang’s STATA code to compute the corrected standard errors (spatHAC in the output below). This routine takes just over 25 seconds.

```
cd "~/Dropbox/ConleySEs/Data"
clear; use "new_testspatial.dta"
tab year, gen(yy_)
tab FIPS, gen(FIPS_)
timer on 1
ols_spatial_HAC EmpClean00 HDD yy_*FIPS_2-FIPS_362, lat(lat ) lon(lon ) t(year) p(FIPS) dist(500) lag(5) bartlett disp
# *-----------------------------------------------
# * Variable | OLS spatial spatHAC
# *-------------+---------------------------------
# * HDD | -0.669 -0.669 -0.669
# * | 0.608 0.786 0.838
timer off 1
timer list 1
# 1: 24.8 / 3 = 8.2650
```

Using the same data and options as the STATA code, we then estimate the adjusted standard errors using our new R function. This requires us to first estimate our regression model using the `felm`

function from the `lfe`

package.

```
# Loading sample data:
dta_file <- "~/Dropbox/ConleySEs/Data/new_testspatial.dta"
DTA <-data.table(read.dta(dta_file))
setnames(DTA, c("latitude", "longitude"), c("lat", "lon"))
# Loading R function to compute Conley SEs:
source("~/Dropbox/ConleySEs/ConleySEs_17June2015.R")
ptm <-proc.time()
# We use the felm() from the lfe package to estimate model with year and county fixed effects.
# Two important points:
# (1) We specify our latitude and longitude coordinates as the cluster variables, so that they are included in the output (m).
# (2) We specify keepCx = TRUE, so that the centered data is included in the output (m).
m <-felm(EmpClean00 ~HDD -1 |year +FIPS |0 |lat +lon,
data = DTA[!is.na(EmpClean00)], keepCX = TRUE)
coefficients(m) %>%round(3) # Same as the STATA result.
```

```
HDD
-0.669
```

We then feed this model to our function, as well as the cross-sectional unit (county `FIPS`

codes), time unit (`year`

), geo-coordinates (`lat`

and `lon`

), the cutoff for serial correlation (5 years), the cutoff for spatial correlation (500 km), and the number of cores to use.

```
SE <-ConleySEs(reg = m,
unit = "FIPS",
time = "year",
lat = "lat", lon = "lon",
dist_fn = "SH", dist_cutoff = 500,
lag_cutoff = 5,
cores = 1,
verbose = FALSE)
sapply(SE, sqrt) %>%round(3) # Same as the STATA results.
```

```
OLS Spatial Spatial_HAC
0.608 0.786 0.837
```

`proc.time() -ptm`

```
user system elapsed
1.619 0.055 1.844
```

Estimating the model and computing the standard errors requires just over 1 second, making it over 20 times faster than the comparable STATA routine.

Even with a single core, we realize significant speed improvements. However, the gains are even more dramatic when we employ multiple cores. Using 4 cores, we can cut the estimation of the standard errors down to around 0.4 seconds. (These replications employ the Haversine distance formula, which is more time-consuming to compute.)

```
pkgs <-c("rbenchmark", "lineprof")
invisible(sapply(pkgs, require, character.only = TRUE))
bmark <-benchmark(replications = 25,
columns = c('replications','elapsed','relative'),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 1, verbose = FALSE),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 2, verbose = FALSE),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 4, verbose = FALSE))
bmark %>%mutate(avg_eplased = elapsed /replications, cores = c(1, 2, 4))
```

```
replications elapsed relative avg_eplased cores
1 25 23.48 2.095 0.9390 1
2 25 15.62 1.394 0.6249 2
3 25 11.21 1.000 0.4483 4
```

Given the prevalence of panel data that exhibits both serial and spatial dependence, we hope this function will be a useful tool for applied econometricians working in `R`

.

This was Darin’s first foray into C++, so we welcome feedback on how to improve the code. In particular, we would appreciate thoughts on how to overcome a memory vs. speed tradeoff we encountered. (You can email Darin at darinc[at]stanford.edu.)

The most computationally intensive chunk of our code computes the distance from each unit to every other unit. To cut down on the number of distance calculations, we can fill the upper triangle of the distance matrix and then copy it to the lower triangle. With [math]N[/math] units, this requires only [math](N (N-1) /2)[/math] distance calculations.

However, as the number of units grows, this distance matrix becomes too large to store in memory, especially when executing the code in parallel. (We tried to use a sparse matrix, but this was extremely slow to fill.) To overcome this memory issue, we can avoid constructing a distance matrix altogether. Instead, for each unit, we compute the vector of distances from that unit to every other unit. We then only need to store that vector in memory. While that cuts down on memory use, it requires us to make twice as many [math](N (N-1))[/math] distance calculations.

As the number of units grows, we are forced to perform more duplicate distance calculations to avoid memory constraints – an unfortunate tradeoff. (See the functions `XeeXhC`

and `XeeXhC_Lg`

in ConleySE.cpp.)

`sessionInfo()`

```
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] RcppArmadillo_0.5.400.2.0 Rcpp_0.12.0
[3] geosphere_1.4-3 sp_1.1-1
[5] lfe_2.3-1709 Matrix_1.2-2
[7] ggplot2_1.0.1 foreign_0.8-65
[9] data.table_1.9.4 dplyr_0.4.2
[11] knitr_1.11
loaded via a namespace (and not attached):
[1] Formula_1.2-1 magrittr_1.5 MASS_7.3-43
[4] munsell_0.4.2 xtable_1.7-4 lattice_0.20-33
[7] colorspace_1.2-6 R6_2.1.1 stringr_1.0.0
[10] plyr_1.8.3 tools_3.2.2 parallel_3.2.2
[13] grid_3.2.2 gtable_0.1.2 DBI_0.3.1
[16] htmltools_0.2.6 yaml_2.1.13 assertthat_0.1
[19] digest_0.6.8 reshape2_1.4.1 formatR_1.2
[22] evaluate_0.7.2 rmarkdown_0.8 stringi_0.5-5
[25] compiler_3.2.2 scales_0.2.5 chron_2.3-47
[28] proto_0.3-10
```

To **leave a comment** for the author, please follow the link and comment on his blog: ** fReigeist » R**.

R-bloggers.com offers

(This article was first published on ** Hyndsight » R**, and kindly contributed to R-bloggers)

I’ve always struggled with using `plotmath`

via the `expression`

function in R for adding mathematical notation to axes or legends. For some reason, the most obvious way to write something never seems to work for me and I end up using trial and error in a loop with far too many iterations.

So I am very happy to see the new **latex2exp** package available which translates LaTeX expressions into a form suitable for R graphs. This is going to save me time and frustration!

Here is a quick example showing a Lee-Carter decomposition of some mortality data.

library(demography) library(latex2exp) fit <- lca(fr.mort) par(mfrow=c(2,2), mar=c(4,5,2,1), family="serif") plot(fit$age, fit$ax, type="l", ylab=latex2exp("a_x"), xlab="Age: x") plot(fit$age, fit$bx, type="l", ylab=latex2exp("b_x"), xlab="Age: x") plot(0, type="n", axes=FALSE, xlab="", ylab="") text(1, 0, latex2exp("m_{x,t} = a_x + k_tb_x + e_{x,t}")) plot(fit$kt, ylab=latex2exp("k_t"), xlab="Year: t") |

There are several more examples in the package documentation.

The results are still a little ugly, but that is because of the limitations of base graphics in R. To get something more LaTeX-like, the **tikzDevice** package can be used as follows.

library(demography) library(tikzDevice) fit <- lca(fr.mort) tikz("tikz-test.tex",width=15/2.54,height=12/2.54) par(mfrow=c(2,2),mar=c(4,5,2,1),family="serif") plot(fit[["age"]],fit$ax,type="l", ylab="$a_x$", xlab="Age: $x$") plot(fit[["age"]],fit$bx,type="l", ylab="$b_x$", xlab="Age: $x$") plot(0,type="n",axes=FALSE,xlab="",ylab="") text(1,0,"$m_{x,t} = a_x + k_tb_x + e_{x,t}$") plot(fit$kt,ylab="$k_t$", xlab="Year: $t$") dev.off() |

While the results look much nicer, it is rather slow.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Hyndsight » R**.

R-bloggers.com offers

(This article was first published on ** The Lab-R-torian**, and kindly contributed to R-bloggers)

Dan continues to crank out book chapter-length posts, which probably means that I should jump in before getting further behind…so here we go.

In the next few posts, I’d like to cover some work to help you to process aggregated proficiency testing (PT) data. Interpreting PT data from groups such as the College of American Pathologists (CAP) is, of course, a fundamental task for lab management. Comparing your lab’s results to peer group data from other users of the same instrumentation helps to ensure that your patients receive consistent results, and it provides at least a crude measure to ensure that your instrument performance is “in the ballpark”. Of course, many assays show significant differences between instrument models and manufacturers that can lead to results that are not comparable as a patient moves from institution to institution (or when your own lab changes instruments!). There are a number of standardization and harmonization initiatives underway (see http://harmonization.net, for example) to address this, and understanding which assays show significant bias compared to benchmark studies or national guidelines is a critical task for laboratorians. All of this is further complicated by the fact that sample matrix can significantly affect assay results, and sample commutability is one important reason why we can’t just take, say, CAP PT survey results (not counting the accuracy-based surveys) and determine which assays aren’t harmonized.

However.

With all of those caveats, it can still be useful to look through PT data in a systematic way to compare instruments. Ideally, we’d like to have everything in an R-friendly format that would allow us to ask systematic questions about data (things like “for how many assays does instrument X differ from instrument Y by >30% using PT material?”, or “how many PT materials give good concordance across all manufacturers?”). If we have good, commutable, accuracy-based testing materials, we can do even better. The first task is all of this fun, however, is getting the data into a format that R is happy with; no one I know likes the idea of retyping numbers from paper reports. I’m hoping to talk more about this in a future post, as there are lots of fun R text processing issues lurking here. In the mean time, though, we have a much more modest preliminary task to tackle.

I’m currently staring at a CAP PT booklet. It happens to be D-dimer, but you can pick your own favorite analyte (and PT provider, for that matter). Some of the results are in ng/mL, some are ug/mL, and one is in mg/L. Let’s create an R function that allows us to convert between sets of comparable units. Now, although I know that Dan is in love with SI units (*#murica*), we’ll start by simply converting molar→molar and gravimetric→gravimetric. Yes, we can add fancy analyte-by-analyte conversion tables in the future…but right now we just want to get things on the same scale. In the process, we’ll cover three useful R command families.

First of all, we should probably decide how we want the final function to look. I’m thinking of something like this:

```
results <- labunit.convert(2.3, "mg/dL", "g/L")
results
```

`## [1] 0.023`

…which converts 2.3 mg/dL to 0.023 g/L. We should also give ourselves bonus points if we can make it work with vectors. For example, we may have this data frame:

`mydata`

```
## Value Units Target.Units
## 1 2.30 g/dL mg/L
## 2 47.00 nmol/mL mmol/dL
## 3 0.19 IU/L mIU/L
```

and we would like to be able to use our function like this:

`labunit.convert(mydata$Value, mydata$Units, mydata$Target.Units)`

`## [1] 2.3e+04 4.7e-03 1.9e+02`

We should also handle things that are simpler

`labunit.convert(0.23, "g", "mg")`

`## [1] 230`

Now that we know where we’re going, let’s start by writing a function that just converts between two units and returns the log difference. We’ll call this function `convert.one.unit()`

, and it will take two arguments:

`convert.one.unit("mg", "ng")`

`## [1] 6`

Basically, we want to take a character variable (like, say, “dL”) and break it into two pieces: the metric prefix (“d”) and the base unit (“L”). If it isn’t something we recognize, the function should quit and complain (you could also make it return ‘NA’ and just give a warning instead, but we’ll hold off on that for now). We’ll start with a list of things that we want to recognize.

```
convert.one.unit <- function (unitin, unitout) {
metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y")
metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24)
units.for.lab <- c("mol", "g", "L", "U", "IU")
```

Notice that the `metric.prefixes`

variable contains the appropriate one- or two-character prefixes, and `metric.logmultipliers`

has the corresponding log multiplier (for example, `metric.prefixes[8]`

= “m”, and `metric.logmultipliers[8]`

is -3). It’s also worth noting the `""`

(`metric.prefixes[11]`

), which corresponds to a log multiplier of 0. The fact that `""`

is a zero-length string instead of a null means that we can search for it in a vector…which will be very handy!

This is the point where we tackle the first of the three command families that I told you about. If you’re not familiar with “regular expressions” in R or another language (Perl, Python, whatever), this is your entry point into some very useful text searching capabilities. Basically, a regular expression is a way of specifying a search for a matching text pattern, and it’s used with a number of R commands (`grep(), grepl(), gsub(), regexpr(), regexec()`

, etc.). We’ll use `gsub()`

as an example, since it’s one that many people are familiar with. Suppose that I have the character string “This is not a test”, and I want to change it to “This is a test”. I can feed `gsub()`

a pattern that I want to recognize and some text that I want to use to replace the pattern. For example:

```
my.string <- "This is not a test"
my.altered.string <- gsub("not a ", "", my.string) # replace "not a " with an empty string, ""
my.altered.string
```

`## [1] "This is test"`

That’s fine as far as it goes, but we will drive ourselves crazy if we’re limited to explicit matches. What if, for example, we also to also recognize “This is not…a test”, or “This is not my kind of a test”? We could write three different gsub statements, but that would get old fairly quickly. Instead of exactly matching the text, we’ll use a pattern. A regular expression that will match all three of our input statements is `"not.+a "`

, so we can do the following:

`gsub("not.+a ", "", "This is not a test")`

`## [1] "This is test"`

`gsub("not.+a ", "", "This is not my kind of a test")`

`## [1] "This is test"`

You can read the regular expression `"not.+a "`

as “match the letters ‘not’ followed by a group of one or more characters (denoted by the special symbol ‘.’) followed by an ‘a’”. You can find some very nice tutorials on regular expressions through Google, but for the purposes of this brief lesson I’ll give you a mini-cheat sheet that probably handles 90% of the regular expressions that I have to write:

Special Character | Meaning |
---|---|

. | match any character |

d | match any digit |

D | match anything that isn’t a digit |

s | match white space |

S | match anything that isn’t white space |

t | match a tab (less important in R, since you usually already have things in a data frame) |

^ | match the beginning of the string (i.e. “^Bob” matches “Bob my uncle” but not “Uncle Bob”) |

$ | match the end of the string |

* | match the previous thing when it occurs 0 or more times |

+ | match the previous thing when it occurs 1 or more times |

? | match the previous thing when it occurs 0 or 1 times |

( .. ) | (parentheses) enclose a group of choices or a particular substring in the match |

| | match this OR that (e.g. “(Bob|Pete)” matches “Dr. Bob Smith” or “Dr. Pete Jones” but not “Dr. Sam Jones” |

It’s also important to remember for things like `"d"`

that R uses backslashes as the escape character…so you actually have to write a double backslash, like this: `"\d"`

. A regular expression to match one or more digits would be `"\d+"`

.

OK, back to work. Our next step is to remove all white space from the unit text (we want `"dL"`

to be handled the same way as `" dL"`

or `"dL "`

), so we’ll add the following lines:

```
unitin <- gsub("\s", "", unitin)
unitout <- gsub("\s", "", unitout)
```

See what we’ve done? We asked `gsub()`

to replace every instance of white space (the regular expression is `"\s"`

) with `""`

. Easy.

Next, we want to put together a regular expression that will detect any of our `metric.prefixes`

or `units.for.lab`

. To save typing, we’ll do it with `paste()`

, the second of our three R command families for the day. You probably already know about `paste()`

, but if not, it’s basically the way to join R character variables into one big string. `paste("Hi", "there")`

gives “Hi there” (`paste()`

defaults to joining things with a space), `paste("Super", "cali", "fragi", "listic", sep="")`

changes the separator to `""`

and gives us “Supercalifragilistic”. `paste0()`

does the same thing as `paste(..., sep="")`

. The little nuance that it’s worth noting today is that we are going to join together elements from a single vector rather than a bunch of separate variables…so we need to use the `collapse = "..."`

option, where we set `collapse`

to whatever character we want. You remember from the last section that | (OR) lets us put a bunch of alternative matches into our regular expression, so we will join all of the prefixes like this:

```
prefix.combo <- paste0(metric.prefixes, collapse = "|")
prefix.combo
```

`## [1] "y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y"`

What we’re really after is a regular expression that matches the beginning of the string, followed by 0 or 1 matches to one of the prefixes, followed by a match to one of the units. Soooo…

```
prefix.combo <- paste0(metric.prefixes, collapse = "|")
unit.combo <- paste0(units.for.lab, collapse = "|")
unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$")
unit.search
```

`## [1] "^(y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y)?(mol|g|L|U|IU)$"`

So much nicer than trying to type that by hand. Next we’ll do actual pattern matching using the `regexec()`

command. `regexec()`

, as the documentation so nicely states, returns a list of vectors of substring matches. This is useful, since it means that we’ll get one match for the prefix (in the first set of parentheses of our regular expression), and one match for the units (in the second set of parentheses of our regular expression). I don’t want to belabor the details of this, but if we feed the output of `regexec()`

to the `regmatches()`

command, we can pull out one string for our prefix and another for our units. Since these are returned as a list, we’ll also use `unlist()`

to coerce our results into one nice vector. If the length of that vector is 0, indicating no match, an error is generated.

```
match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin)))
match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout)))
if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")"))
if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))
```

If we were to take a closer look look at `match.unit.in`

, we would see that the first entry is the full match, the second entry is the prefix match, and the third entry is the unit match. To make sure that the units agree (i.e. that we’re not trying to convert grams into liters or something similar), we use:

` if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")`

…and then finish by using the `match()`

command to find the index in the `metric.prefixes`

vector corresponding to the correct prefix (note that if there’s no prefix matched, it matches the `""`

entry of the vector–*very* handy). That index allows us to pull out the corresponding log multiplier, and we then return the difference to get a conversion factor. Our final function looks like this1:

```
convert.one.unit <- function (unitin, unitout) {
# the prefix codes for the metric system
metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y")
# ...and their corresponding log multipliers
metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24)
# The units that we'd like to detect. I guess we could add distance, but that's not too relevant to most of the analytes that I can think of
units.for.lab <- c("mol", "g", "L", "U", "IU")
# remove white space
unitin <- gsub("\s", "", unitin)
unitout <- gsub("\s", "", unitout)
# build the pieces of our regular expression...
prefix.combo <- paste0(metric.prefixes, collapse = "|")
unit.combo <- paste0(units.for.lab, collapse = "|")
# ...and stitch it all together
unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$")
# identify the matches
match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin)))
match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout)))
if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")"))
if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))
if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")
# get the appropriate log multipliers
logmult.in <- metric.logmultipliers[match(match.unit.in[2], metric.prefixes)]
logmult.out <- metric.logmultipliers[match(match.unit.out[2], metric.prefixes)]
# return the appropriate (log) conversion factor
return(logmult.in - logmult.out)
}
# Try it out
convert.one.unit("mL","L")
```

`## [1] -3`

We’re actually most of the way there now. The final family of commands that we’d like to use is `apply()`

, with various flavors that allow you to repeatedly apply (no surprise) a function to many entries of a variable. Dan mentioned this in his last post. He also mentioned not understanding the bad press that `for`

loops get when they’re small. I completely agree with him, but the issue tends to arise when you’re used to a language like C (yes, I know we’re talking about compiled vs. interpreted in that case), where your loops are blazingly fast. You come to R and try nested loops that run from 1:10000, and then you have to go for coffee. `lapply()`

, `mapply()`

, `mapply()`

, `apply()`

, etc. have advantages in the R world. Might as well go with the flow on this one.

We’re going to make a `convert.multiple.units()`

function that takes `unitsin`

and `unitsout`

vectors, binds them together as two columns, and then runs `apply()`

to feed them to `convert.one.unit()`

. Because `apply()`

lets us interate a function over either dimension of a matrix, we can bind the two columns (a vector of original units and a vector of target units) and then iterate over each pair by rows (that’s what the `1`

means as the second argument of `apply()`

: it applies the function by row). If the anonymous function syntax throws you off…let us know in the comments, and we’ll cover it some time. For now, just understand that the last part of the line feeds values to the `convert.one.unit()`

function.

```
convert.multiple.units <- function (unitsin, unitsout) {
apply(cbind(unitsin, unitsout), 1, function (x) {convert.one.unit(x[1], x[2])})
}
```

Finally, we’ll go back to our original `labunit.convert()`

function. Our overall plan is to split each unit by recognizing the “/” character using `strsplit()`

. This returns a list of vectors of split groups (i.e. “mg/dL” becomes the a list where the first element is a character vector (“mg”, “dl”)). We then make sure that the lengths match (i.e. if the input is “mg/dL” and the output if “g/mL” that’s OK, but if the output is “g” then that’s a problem), obtain all the multipliers, and then add them all up. We add because they’re logs…and actually we mostly subtract, because we’re dividing. For cuteness points, we return `2*x[1] - sum(x)`

, which will accurately calculate not only conversions like mg→g and mg/dL→g/L, but will even do crazy stuff like U/g/L→mU/kg/dL. Don’t ask me why you’d want to do that, but it works. The final multiplier is used to convert the vector of values (good for you if you notice that we didn’t check to make sure that the length of the `values `

vector matched the `unitsin`

vector…but we can always recycle our values that way).

```
labunit.convert <- function (values, unitsin, unitsout) {
insep <- strsplit(unitsin, "/")
outsep <- strsplit(unitsout, "/")
lengthsin <- sapply(insep, length)
lengthsout <- sapply(outsep, length)
if (!all(lengthsin == lengthsout)) stop("Input and output units can't be converted")
multipliers <- mapply(convert.multiple.units, insep, outsep)
final.multiplier <- apply(t(multipliers), 1, function (x) {2*x[1] - sum(x)})
return(values * 10^final.multiplier)
}
```

OK, enough. Back over to you, Dan. We now have a piece of code that we can use when we start comparing PT data from different instruments. That’s the immediate plan for future posts2, and before long there may even be an entry with nice graphics like those of my Canadian colleague.

-SRM

- I received a request to convert “G/L” to “M/mL”, which was interpreted as converting billions/L to millions/mL. This requires changing our
`convert.one.unit()`

function to handle a “no units” case. Actually, it’s not as difficult as it sounds; if we just add an empty string (i.e.`""`

) to the end of the`units.for.lab`

vector, our regular expression does the right thing. Your edited line would read`units.for.lab <- c("mol", "g", "L", "U", "IU", "")`

. The reason this works, incidentally, is that there’s no overlap (except`""`

) between the prefixes and the units, so the pattern match doesn’t have a chance to be confused.↩ - Following Dan’s lead, I should point out a major caveat to any such plans is James 4:13-15. Double extra credit if you are interested enough to look it up.↩

To **leave a comment** for the author, please follow the link and comment on his blog: ** The Lab-R-torian**.

R-bloggers.com offers