SAS and R

Project MOSAIC migrates to ggformula

2018-08-27T09:47:00.000-04:00

Project MOSAIC migrates to ggformula

guest entry by Randall Pruim

In 2017, Project MOSAIC announced ggformula, a new package that provides a formula interface to ggplot2 graphics in R. (See, for example, ggformula: another option for teaching graphics in R to beginners.) This package provides a happy medium between lattice and ggplot2 that allows beginners to “do powerful things quickly” by adopting the formula interface of lattice and R’s statistical modeling functions as a means to produce ggplot2 graphics.

Over the past year, our experience with ggformula in our classes and in faculty development workshops together with the feedback we have received from other users have demonstrated ggformula to be flexible, yet easy to learn. As part of an ecosystem that emphasizes a formula interface of lattice and the core R statistical modeling functions early on and adds tidyverse concepts later, ggformula fits better with the rest of our toolkit than do either lattice or ggplot2, providing opportunities for more creativity with less volume.

The recent releases of several Project MOSAIC R packages (mosaic, mosaicData, mosaicCore, and ggformula) and the related fastR2 package mark the official migration of Project MOSAIC from lattice to ggformula as its primary graphics system. Future development includes plans to release an updated version of mosaicModel which will interoperate with ggformula and a new package called ggformulaExtra (currently only available via Github) which adds additional functionality but relies on additional packages beyond ggplot2.

Many of the recent changes to the Project MOSAIC suite of packages will go largely unnoticed by most users but were necesary to allow ggformula to interoperate with the newest version of ggplot2. Among the small number of more noticeable changes are a change in gf_smooth() so that it no longer displays confidence bands by default (use se = TRUE to turn them on), expanded support for “rugs”, support for horizontal versions of histograms, boxplots, and violin plots (using the ggstance package), and the addition of gf_sf() for improved support for choropleth maps (based on the new geom_sf() in ggplot2). Along the way, we also did some light housekeeping (improving documentation, etc.) and migrated most of our package examples from lattice to ggformula.

The basic form of the formula interface is

goal(y ~ x, data = myData)

which corresponds to SAS code like

PROC GOAL DATA = MYDATA; MODEL Y = X; RUN;

goal() can be replaced by a graphing (e.g., gf_point()) or modeling (e.g., lm()) function with the number of variables involved in the formula varying with the complexity of the plot or model desired.

library(mosaic)              # load the mosaic package (and ggformula)
gf_point(length ~ width, data = KidsFeet)                  # scatter plot 
      lm(length ~ width, data = KidsFeet) %>% msummary()   # linear model

##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.8172     2.9381   3.341  0.00192 ** 
## width         1.6576     0.3262   5.081  1.1e-05 ***
## 
## Residual standard error: 1.025 on 37 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.3951 
## F-statistic: 25.82 on 1 and 37 DF,  p-value: 1.097e-05

Users of lattice-based Project MOSAIC materials should have little trouble migrating to ggformula since the types of plots that were easiest to construct with lattice can be created very similarly using ggformula. For example, the following two commands are essentially equivalent (although the resulting plots have a different appearence).

    histogram( ~ age | sex, data = HELPrct,    width = 2, col  = "navy")
gf_dhistogram( ~ age | sex, data = HELPrct, binwidth = 2, fill = "navy")

It is much simpler, however, to create complex plots using ggformula because multiple layers can be stacked using the maggrittr pipe (%>%, which we often read as “then”) familiar to users of the tidyverse suite of packages (and many others as well).

gf_jitter(Sepal.Length ~ Sepal.Width, data = iris, color = ~ Species) %>%
  gf_density2d(alpha = 0.4) %>%
  gf_jitter(geom = "rug", alpha = 0.7) %>%
  gf_lm(linetype = "dashed") %>%
  gf_refine(scale_color_brewer(type = "qual"))

As part of the migration to ggformula, a number of related resources have been or are being converted from lattice to ggformula as well. These include companion volumes for several popular statistics text books, our series of “Little Books”, the Minimal R Vignette, and a side-by-side comparison of lattice and ggformula. In addition, the second edition of Foundations and Applications of Statistics (Pruim, 2018) uses ggformula throughout.

An eventual migration from ggformula to native ggplot2, while not strictly necessary (since the same plots can be made in either system), is easier than the migration from lattice since the underlying grammar and much of the nomenclature of ggformula is borrowed from ggplot2. In the meantime, equivalent ggformula code is generally less verbose and simpler for novices to understand and produce. And the use of %>% for layering avoids the errors that creap in when moving between tidyverse, which also uses %>%, and ggplot2 which uses +. Indeed, data flows can be directed seamlessly into ggformula plotting commands. This can be useful as a debugging step when creating data pipelines or as a way to create a plot for which there is no need to save the pre-processed data.

Galton %>%
  filter(sex == "M") %>%  # select only male adult children
  group_by(family) %>%      #
  sample_n(1) %>%           # choose only one male from each family
  ungroup %>%               #
  mutate(                     # compute z-scores for parents' heights
    zfather = round(mosaic::zscore(father), 2),
    zmother = round(mosaic::zscore(mother), 2)
  ) %>% 
  gf_jitter(zfather ~ zmother, alpha = 0.5, 
            title = "Standardized heights of parents",
            caption = "Source: Galton") %>%
  gf_lm()

It has been over a year since I have used either lattice or ggplot2 for anything other than comparison examples. My co-authors and I have found the switch from lattice to ggformula to be both straightforward (for us) and advantageous (for our students). We encourage you to give it a try in your own work and with your students.

ggformula: another option for teaching graphics in R to beginners

2017-09-21T09:03:00.000-04:00

A previous entry (http://sas-and-r.blogspot.com/2017/07/options-for-teaching-r-to-beginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested.

In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here's Randall:

For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of

the lattice package for graphics,
several functions from the stats package for modeling (e.g., lm(), t.test()), and
the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.

Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.

goal ( y ~ x , data = mydata, ... )

Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.

Trouble in paradise
As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing.

On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify.

ggformula: The third-and-a half way

Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics).

For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the side-by-side boxplots made with lattice in the previous post.

We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the "then" operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.

Comparing groups

Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color)

or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses | very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers.

The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired.

Fitting into the tidyverse work flow

ggformala also fits into a tidyverse-style workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations.

Summary

The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the ggformula package vignettes and tutorials describe these in more detail.

-- Randall Pruim

Options for teaching R to beginners: a false dichotomy?

2017-07-27T15:07:00.003-04:00

I've been reading David Robinson's excellent blog entry "Teach the tidyverse to beginners" (http://varianceexplained.org/r/teach-tidyverse), which argues that a tidyverse approach is the best way to teach beginners. He summarizes two competing curricula:

1) "Base R first": teach syntax such as $ and [[]], built in functions like ave() and tapply(), and use base graphics

2) "Tidyverse first": start from scratch with pipes (%>%) and leverage dplyr and use ggplot2 for graphics

If I had to choose one of these approaches, I'd also go with 2) ("Tidyverse first"), since it helps to move us closer to helping our students "think with data" using more powerful tools (see here for my sermon on this topic).

A third way

Of course, there’s a third option that addresses David’s imperative to "get students doing powerful things quickly". The mosaic package was written to make R easier to use in introductory statistics courses. The package is part of Project MOSAIC (http://mosaic-web.org), an NSF-funded initiative to integrate statistics, modeling, and computing. A paper outlining the mosaic package's "Less Volume, More Creativity" approach was recently published in the R Journal (https://journal.r-project.org/archive/2017/RJ-2017-024). To his credit, David mentions the mosaic package in a response to one of the comments on his blog.

Less Volume, More Creativity

One of the big ideas in the mosaic package is that students build on the existing formula interface in R as a mechanism to calculate summary statistics, generate graphical displays, and fit regression models. Randy Pruim has dubbed this approach "Less Volume, More Creativity".

While teaching this formula interface involves adding a new learning outcome (what is "Y ~ X"?), the mosaic approach simplifies calculation of summary statistics by groups and the generation of two or three dimensional displays on day one of an introductory statistics course (see for example Wang et al., "Data Viz on Day One: bringing big ideas into intro stats early and often" (2017), TISE).

The formula interface also prepares students for more complicated models in R (e.g., logistic regression, classification).

Here's a simple example using the diamonds data from the ggplot2 package. We model the relationships between two colors (D and J), number of carats, and price.

I'll begin with a bit of data wrangling to generate an analytic dataset with just those two colors. (Early in a course I would either hide the next code chunk or make the recoded dataframe accessible to the students to avoid cognitive overload.) Note that an R Markdown file with the following commands is available for download at https://nhorton.people.amherst.edu/mosaic-blog.Rmd.

library(mosaic)
recoded <- diamonds %>%
filter(color=="D" | color=="J") %>%
mutate(col = as.character(color))

We first calculate the mean price (in US$) for each of the two colors.

mean(price ~ col, data = recoded)

   D    J 
3170 5324

This call is an example of how the formula interface facilitates calculation of a variable's mean for each of the levels of another variable. We see that D color diamonds tend to cost less than J color diamonds.

A useful function in mosaic is favstats() which provides a useful set of summary statistics (including sample size and missing values) by group.

favstats(price ~ col, data = recoded)

col	min	Q1	median	Q3	max	mean	sd	n	missing
D	357	911	1838	4214	18693	3170	3357	6775	0
J	335	1860	4234	7695	18710	5324	4438	2808	0

A similar command can be used to generate side by side boxplots. Here we illustrate the use of lattice graphics. (An alternative formula based graphics system (ggformula) will be the focus of a future post.)

bwplot(col ~ price, data = recoded)

The distributions are skewed to the right (not surprisingly since they are prices). If we wanted to formally compare these sample means we could do so with a two-sample t-test (or in a similar fashion, by fitting a linear model).

t.test(price ~ col, data = recoded)
Welch Two Sample t-test

data: price by col
t = -20, df = 4000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2336 -1971
sample estimates:
mean in group D mean in group J
3170 5324

msummary(lm(price ~ col, data = recoded))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3170.0 45.0 70.4 <2e-16 ***
colJ 2153.9 83.2 25.9 <2e-16 ***

Residual standard error: 3710 on 9581 degrees of freedom
Multiple R-squared: 0.0654, Adjusted R-squared: 0.0653

F-statistic: 670 on 1 and 9581 DF, p-value: <2e-16

The results from the two approaches are consistent: the group differences are highly statistically significant. We could conclude that J diamonds tend to cost more than D diamonds, back in the population of all diamonds.

Let's do a quick review of the mosaic modeling syntax to date:
mean(price ~ col)
bwplot(price ~ col) t.test(price ~ col) lm(price ~ col) See the pattern? On a statistical note, it's important to remember that the diamonds were not randomized into colors: this is a found (observational dataset) so there may be other factors at play. The revised GAISE College report reiterates the importance of multivariate thinking in intro stats. Moving to three dimensions Let's continue with the "Less Volume, More Creativity" approach to bring in a third variable: the number of carats in each diamond. xyplot(price ~ carat, groups=col, auto.key=TRUE, type=c("p", "r"), data = recoded)

We see that controlling for the number of carats, the D color diamonds tend to sell for more than the J color diamonds. We can confirm this by fitting a regression model that controls for both variables (and then display the resulting predicted values from this parallel slopes model using plotModel()).

This is a great example of Simpson's paradox: accounting for the number of carats has yielded opposite results from a model that didn't include carats. If we were to move forward with such an analysis we'd need to be sure to undertake an assessment of our model and verify conditions and assumptions (but for the purpose of the blog entry I'll defer that).

Moving beyond mosaic

The revised GAISE College report enunciated the importance of technology when teaching statistics. Many courses still use calculators or web-based applets to incorporate technology into their classes. R is an excellent environment for teaching statistics, but many instructors feel uncomfortable using it (particularly if they feel compelled to teach the $ and [[]] syntax, which many find offputting). The mosaic approach helps make the use of R feasible for many audiences by keeping things simple. It's unfortunately true that many introductory statistics courses don't move beyond bivariate relationships (so students may feel paralyzed about what to do about other factors). The mosaic approach has the advantage that it can bring multivariate thinking, modeling, and exploratory data tools together with a single interface (and modest degree of difficulty in terms of syntax). I've been teaching multiple regression as a descriptive method early in an intro stat course for the past ten years (and it helps to get students excited about material that they haven't seen before). The mosaic approach also scales well: it's straightforward to teach students dplyr/tidyverse data wrangling by adding in the pipe operator and some key data idioms. (So perhaps the third option should be labeled "mosaic and tidyverse".)

See the following for an example of how favstats() can be replaced by dplyr idioms.

recoded %>%
group_by(col) %>%
summarize(meanval = mean(price, na.rm = TRUE))

col	meanval
D	3170
J	5324

That being said, I suspect that many students (and instructors) will still use favstats() for simple tasks (e.g., to check sample sizes, check for missing data, etc). I know that I do. But the important thing is that unlike training wheels, mosaic doesn't hold them back when they want to learn new things. I'm a big fan of ggplot2, but even Hadley agrees that the existing syntax is not what he wants it to be. While it's not hard to learn to use + to glue together multiple graphics commands and to get your head around aesthetics, teaching ggplot2 adds several additional learning outcomes to a course that's already overly pregnant with them.

Side note

I would argue that a lot of what is in mosaic should have been in base R (e.g., formula interface to mean(), data= option for mean()). Other parts are more focused on teaching (e.g., plotModel(), xpnorm(), and resampling with the do() function).

Closing thoughts

In summary, I argue that the mosaic approach is consistent with the tidyverse. It dovetails nicely with David's "Teach tidyverse" as an intermediate step that may be more accessible for undergraduate audiences without a strong computing background. I'd encourage people to check it out (and let Randy, Danny, and me know if there are ways to improve the package).

Want to learn more about mosaic? In addition to the R Journal paper referenced above, you can see how we get students using R quickly in the package's "Less Volume, More Creativity" and "Minimal R" vignettes. We also provide curated examples from commonly used textbooks in the “mosaic resources” vignette and a series of freely downloadable and remixable monographs including The Student’s Guide to R and Start Teaching with R.

thinking with data with "Modern Data Science with R"

2017-07-26T08:39:00.000-04:00

One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data. Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate? How are the results interpreted and communicated?

While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data. The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods. (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics. Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation. Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data. A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:

minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
ethical considerations are raised early, to motivate later examples
recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured

Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.

This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We've taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.

We've made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we're working to create a series of lab activities (e.g., text as data). (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)

Modern Data Science with R

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

RStudio in the cloud with Amazon Lightsail and docker

2016-12-01T16:13:00.000-05:00

About two years ago we published a quick and easy guide to setting up your own RStudio server in the cloud using the Docker service and Digital Ocean. The process is incredibly easy-- about the only cumbersome part is retyping a random password. Today the excitement in virtual private servers is that Amazon is getting into the market, with their Lightsail product. They are not undercutting Digital Ocean entirely-- in fact, their prices look to be just about identical. But Amazon's interface may have some advantages for you, so here's how to get Docker and RStudio running with Amazon Lightsail.

1. Log in to Lightsail

2. Create an Instance; choose the Base OS, and Ubuntu (as of this writing 16.04 LTS)

3. Name it what you like

4. Wait for boot up. Once it's running, click "connect" under the three dots. This opens a console window where you are already logged in, saving some headache vs. Digital Ocean.

5. Time for console commands. Type: sudo apt-get install docker.io Then Y for yes to add the new material.

6. Type: sudo service docker start

7. Now you can start your docker/rstudio container. See our earlier blog post or this link for resources. Shortcuts:

a. Plain Rstudio: sudo docker run -d -p 8787:8787 rocker/rstudio

b. All of Hadleyverse: sudo docker run -d -p 8787:8787 rocker/hadleyverse

c. Custom password: sudo docker run -d -p 8787:8787 -e USER=ken -e PASSWORD=ken rocker/hadleyverse

d. Enable root: sudo docker run -d -p 8787:8787 -e ROOT=TRUE rocker/rstudio

8. Important! While the container is starting, go back to the Lightsail tab in your browser and click in the three dots in the "Running" instance to Manage. then click on the Networking tab. In the table of two enabled ports, click on the plus "Add Another". Leave "Custom" and "All" under "Aplication" and "Protocol", repectively, and change port range to 8787. Save.

9. The public IP is printed on the Networking page there. Cut and paste into your browser with :8787 appended. Your username and password are both rstudio, unless you changed them. To allow additional users onto your cloud server, see this page.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work, except as noted above.

Set up RStudio in the cloud to work with GitHub

2016-01-17T16:41:00.000-05:00

I love GitHub for version control and collaboration, though I'm no master of it. And the tools for integrating git and GitHub with RStudio are just amazing boons to productivity.

Unfortunately, my University-supplied computer does not play well with GitHub. Various directories are locked down, and I can't push or pull to GitHub directly from RStudio. I can't even use install_github() from the devtools package, which is needed for loading Shiny applications up to Shinyapps.io. I lived with this for a bit, using git from the desktop and rsconnect from a home computer. But what a PIA.

Then I remembered I know how to put RStudio in the cloud-- why not install R there, and make that be my GitHub solution?

It works great. The steps are below. In setting it up, I discovered that Digital Ocean has changed their set-up a little bit, so I update the earlier post as well.

1. Go to Digital Ocean and sign up for an account. By using this link, you will get a $10 credit. (Full disclosure: I will also get a $25 credit once you spend $25 real dollars there.) The reason to use this provider is that they have a system ready to run with Docker already built in, which makes it easy. In addition, their prices are quite reasonable. You will need to use a credit card or PayPal to activate your account, but you can play for a long time with your $10 credit-- the cheapest machine is $.007 per hour, up to a $5 per month maximum.

2. On your Digital Ocean page, click "Create droplet". Click on "One-click Apps" and select "Docker (1.9.1 on 14.04)". (The numbers in the parentheses are the Docker and Ubuntu version, and might change over time.) Then a size (meaning cost/power) of machine and the region closest to you. You can ignore the settings. Give your new computer an arbitrary name. Then click "Create Droplet" at the bottom of the page.

3. It takes a few seconds for the droplet to spin up. Then you should see your droplet dashboard. If not, click "Droplets" from the top bar. Under "More", click "Access Console". This brings up a virtual terminal to your cloud computer. Log in (your username is root) using the password that digital ocean sent you when the droplet spun up.

4. Start your RStudio container by typing: docker run -d -p 8787:8787 -e ROOT=TRUE rocker/hadleyverse

You can replace hadleyverse with rstudio if you like, for a quicker first-time installation, but many R users will want enough of Hadley Wickham's packages that it makes sense to install this version. The -e ROOT=TRUE is crucial for our approach to installing git into the container, but see the comment below from Petr Simicek below for another way to do the same thing.

5. Log in to your Cloud-based RStudio. Find the IP address of your cloud computer on the droplet dashboard, and append :8787 to it, and just put it into your browser. For example: http://135.104.92.185:8787. Log in as user rstudio with password rstudio.

6. Install git, inside the Docker container. Inside RStudio, click Tools -> Shell.... Note: you have to use this shell, it's not the same as using the droplet terminal. Type: sudo apt-get update and then sudo apt-get install git-core to install git.

git likes to know who you are. To set git up, from the same shell prompt, type git config --global user.name "Your Handle" and git config --global user.email "an.email@somewhere.edu"

7. Close the shell, and in RStudio, set things up to work with GitHub: Go to Tools -> Global Options -> Git/SVN. Click on create RSA key. You don't need a name for it. Create it, close the window, then view it and copy it.

8. Open GitHub, go to your Profile, click "Edit Profile", "SSH keys". Click "Add key", and just paste in the stuff you copied from RStudio in the previous step.

You're done! To clone an existing repos from Github to your cloud machine, open a new project in RStudio, and select Version Control, then Git, and paste in the URL name that GitHub provides. Then work away!

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, other than as mentioned above, the aggregator is violating the terms by which we publish our work.

R and SAS in the curriculum: getting students to "think with data"

2016-01-06T09:20:00.001-05:00

We're pleased to announce that a special issue of the American Statistician on "Statistics and the Undergraduate Curriculum" (November, 2015) is available at http://amstat.tandfonline.com/toc/utas20/69/4.

Johanna Hardin (Pomona College) and Nick were the guest editors. There are a number of excellent and provocative papers that reinforce the importance of computing using tools such as R and SAS that are likely to be of widespread interest to the community.

Table of Contents

Teaching the Next Generation of Statistics Students to “Think With Data”: Special Issue on Statistics and the Undergraduate Curriculum : Nicholas J. Horton & Johanna S. Hardin, DOI:10.1080/00031305.2015.1094283 (freely available)

Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up George Cobb, DOI:10.1080/00031305.2015.1093029 (freely available for a limited period)

Teaching Statistics at Google-Scale: Nicholas Chamandy, Omkar Muralidharan & Stefan Wager, DOI:10.1080/00031305.2015.1089790

Explorations in Statistics Research: An Approach to Expose Undergraduates to Authentic Data Analysis by Deborah Nolan & Duncan Temple Lang, DOI:10.1080/00031305.2015.1073624

Beyond Normal: Preparing Undergraduates for the Work Force in a Statistical Consulting Capstone by Byran J. Smucker & A. John Bailer, DOI:10.1080/00031305.2015.1077731

A Framework for Infusing Authentic Data Experiences Within Statistics Courses: Scott D. Grimshaw, DOI:10.1080/00031305.2015.1081106

Fostering Conceptual Understanding in Mathematical Statistics: Jennifer L. Green & Erin E. Blankenship, DOI:10.1080/00031305.2015.1069759

The Second Course in Statistics: Design and Analysis of Experiments? by Natalie J. Blades, G. Bruce Schaalje & William F. Christensen, DOI:10.1080/00031305.2015.1086437

A Data Science Course for Undergraduates: Thinking With Data: Ben Baumer, DOI:10.1080/00031305.2015.1081105

Data Science in Statistics Curricula: Preparing Students to “Think with Data” : J. Hardin, R. Hoerl, Nicholas J. Horton, D. Nolan, B. Baumer, O. Hall-Holt, P. Murrell, R. Peng, P. Roback, D. Temple Lang & M. D. Ward, DOI:10.1080/00031305.2015.1077729

Using Online Game-Based Simulations to Strengthen Students’ Understanding of Practical Statistical Issues in Real-World Data Analysis: Shonda Kuiper & Rodney X. Sturdivant, DOI:10.1080/00031305.2015.1075421

Combating Anti-Statistical Thinking Using Simulation-Based Methods Throughout the Undergraduate Curriculum: Nathan Tintle, Beth Chance, George Cobb, Soma Roy, Todd Swanson & Jill VanderStoep, DOI:10.1080/00031305.2015.1081619

What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum : Tim C. Hesterberg, DOI:10.1080/00031305.2015.1089789

Incorporating Statistical Consulting Case Studies in Introductory Time Series Courses: Davit Khachatryan, DOI:10.1080/00031305.2015.1026611

Developing a New Interdisciplinary Computational Analytics Undergraduate Program: A Qualitative-Quantitative-Qualitative Approach: Scotland Leman, Leanna House & Andrew Hoegh, DOI:10.1080/00031305.2015.1090337

From Curriculum Guidelines to Learning Outcomes: Assessment at the Program Level by Beth Chance & Roxy Peck, DOI:10.1080/00031305.2015.1077730

Program Assessment for an Undergraduate Statistics Major: Allison Amanda Moore & Jennifer J. Kaplan, DOI:10.1080/00031305.2015.1087331

The Cobb paper ("Mere Renovation is Too Little Too Late: We Need to Rethink Our Undergraduate Curriculum from the Ground Up (Cobb, 2015) ") has an associated discussion with 19 provocative responses plus George's spirited rejoinder. These materials can be found on the TAS website or individually at http://www.amherst.edu/~nhorton/mererenovation/.

Discussion (and rejoinder):

Response from Albert and Glickman: Attracting undergraduates to statistics through data science
Response from Chance, Peck, and Rossman: Response to mere renovation is too little too late
Response from De Veaux and Velleman: Teaching statistics algorithmically or stochastically misses the point: why not teach holistically?
Response from Fisher and Bailar: Who, what, when and how: changing the undergraduate statistics curriculum
Response from Franklin: We need to rethink the way we teach statistics at K-12
Response from Gelman and Loken: Moving forward in statistics education while avoiding overconfidence
Response from Gould: Augmenting the vocabulary used to describe data
Response from Holcomb, Quinn, and Short: Seeking the niche for traditional mathematics within undergraduate statistics and data science curricula
Response from Kass: The gap between statistics education and statistical practice
Response from King: Training the next generation of statistical scientist
Response from Lane-Getaz: Stirring the curricular pot once again
Response from Notz: Vision or bad dream?
Response from Ridgway: Data Cowboys and Statistical Indians
Response from Temple Lang: Authentic data analysis experience
Response from Utts: Challenges, changes and choices in the undergraduate statistics curriculum
Response from Ward: Learning communities and the undergraduate statistics curriculum
Response from Wickham: Teaching Safe-Stats, not statistical abstinence
Response from Wild: Further, faster, wider
Response from Zieffler: Teardowns, historical renovation, and paint-and-patch: curricular changes and faculty development
Rejoinder by Cobb

Write in-line equations in your Shiny application with MathJax

2015-12-30T14:58:00.000-05:00

I've been working on a Shiny app and wanted to display some math equations. It's possible to use LaTeX to show math using MathJax, as shown in this example from the makers of Shiny. However, by default, MathJax does not allow in-line equations, because the dollar sign is used so frequently. But I needed to use in-line math in my application. Fortunately, the folks who make MathJax show how to enable the in-line equation mode, and the Shiny documentation shows how to write raw HTML. Here's how to do it.

R

Here I replicated the code from the official Shiny example linked above. The magic code is inserted into ui.R, just below withMathJax().
## ui.R

library(shiny)

shinyUI(fluidPage(
  title = 'MathJax Examples with in-line equations',
  withMathJax(),
  # section below allows in-line LaTeX via $ in mathjax. Replace less-than-sign with < 
  # and grater-than-sign with >
  tags$div(HTML("less-than-sign script type='text/x-mathjax-config' greater-than-sign
                MathJax.Hub.Config({
                tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
                });
                less-than-sign /script greater-than-sign
                ")),
  helpText('An irrational number $\\sqrt{2}$
           and a fraction $1-\\frac{1}{2}$'),
  helpText('and a fact about $\\pi$:$\\frac2\\pi = \\frac{\\sqrt2}2 \\cdot
           \\frac{\\sqrt{2+\\sqrt2}}2 \\cdot
           \\frac{\\sqrt{2+\\sqrt{2+\\sqrt2}}}2 \\cdots$'),
  uiOutput('ex1'),
  uiOutput('ex2'),
  uiOutput('ex3'),
  uiOutput('ex4'),
  checkboxInput('ex5_visible', 'Show Example 5', FALSE),
  uiOutput('ex5')
))



## server.R
library(shiny)

shinyServer(function(input, output, session) {
  output$ex1 <- renderUI({
    withMathJax(helpText('Dynamic output 1:  $\\alpha^2$'))
  })
  output$ex2 <- renderUI({
    withMathJax(
      helpText('and output 2 $3^2+4^2=5^2$'),
      helpText('and output 3 $\\sin^2(\\theta)+\\cos^2(\\theta)=1$')
    )
  })
  output$ex3 <- renderUI({
    withMathJax(
      helpText('The busy Cauchy distribution
               $\\frac{1}{\\pi\\gamma\\,\\left[1 +
               \\left(\\frac{x-x_0}{\\gamma}\\right)^2\\right]}\\!$'))
  })
  output$ex4 <- renderUI({
    invalidateLater(5000, session)
    x <- round(rcauchy(1), 3)
    withMathJax(sprintf("If $X$ is a Cauchy random variable, then
                        $P(X \\leq %.03f ) = %.03f$", x, pcauchy(x)))
  })
  output$ex5 <- renderUI({
    if (!input$ex5_visible) return()
    withMathJax(
      helpText('You do not see me initially: $e^{i \\pi} + 1 = 0$')
    )
  })
  })

Give it a try (or check out the Shiny app at https://r.amherst.edu/apps/nhorton/mathjax/)! One caveat is that the other means of in-line display, as shown in the official example, doesn't work when the MathJax HTML is inserted as above.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

2015.2: Did the New England Patriots experience a decrease in fumbles starting in 2007?

2015-02-01T17:14:00.001-05:00

Here's a timely guest entry from Jeffrey Witmer (Oberlin College).

As the “Deflate Gate” saga was unfolding, Warren Sharp analyzed “touches per fumble” for NFL teams before and after 2006, when a rule was changed so that teams playing on the road could provide their own footballs (http://www.sharpfootballanalysis.com/blog/). Sharp noted that, for whatever reason, the Patriots went from being a typical team, as regards fumbling, to a team with a very low fumble rate. Rather than rely on the data the Sharp provides at his website, I choose to collect and analyze some data on my own. I took a random sample of 30 games played by New England and 30 other games. For each game, I recorded all rushing and passing plays (except for QB kneels), but excluded kicking plays (the NFL, rather than the teams, provides special footballs for those plays). (Data source: http://www.pro-football-reference.com/play-index/play_finder.cgi.) I also recorded the weather for the game. (Data source: http://www.wunderground.com/history/.) Once I had the data (in a file that I called AllBig, which can be downloaded from http://www.amherst.edu/~nhorton/AllBig.csv), I noted whether or not there was a fumble on each play, aided by the grep() command:

grep("Fumb", AllBig$Detail, ignore.case=TRUE)

I labeled each play as Late or not according to whether it happened after the rule change:

AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)

Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:

require(mosaic)
tally(~Fumble+Late, data=filter(AllBig,Pats==0))

             Late 
Fumble     0    1      
 0      2588 2919      
 1        54   65

Then I asked for the data in proportion terms:

tally(Fumble~Late, data=filter(AllBig,Pats==0))

and got

               Late 
Fumble       0      1      
 0      0.9796 0.9782      
 1      0.0204 0.0218

For non-Pats there is a tiny increase in fumbles. This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure). mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0)) Repeating this for the Patriots shows a different picture:

tally(~Fumble+Late, data=filter(AllBig,Pats==1))       
         Late 
Fumble   0   1      
 0     996 910      
 1      19   7


tally(Fumble~Late, data=filter(AllBig,Pats==1))       
                Late 
Fumble       0       1      
 0     0.98128 0.99237      
 1     0.01872 0.00763

I fit a logistic regression model with the glm() command: glm(Fumble~Late*Pats, family=binomial, data=AllBig)

Coefficients:             
  Estimate Std. Error z value Pr(>|z|)     
(Intercept)  -3.8697     0.1375  -28.14   <2e-16 *** 
Late          0.0650     0.1861    0.35    0.727     
Pats         -0.0897     0.2693   -0.33    0.739     
Late:Pats    -0.9733     0.4819   -2.02    0.043 *

I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:

AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)

glm(formula = Fumble ~ BadWeather + Temp + Late * Pats, 
  family = binomial, data = AllBig)

Coefficients:             
               Estimate Std. Error z value Pr(>|z|)     
(Intercept) -4.23344    0.43164   -9.81   <2e-16 *** 
BadWeather   0.33259    0.29483    1.13     0.26     
Temp         0.00512    0.00612    0.84     0.40     
Late         0.08871    0.18750    0.47     0.64     
Pats        -0.14183    0.27536   -0.52     0.61     
Late:Pats   -0.91062    0.48481   -1.88     0.06 .

Because there was suspicion that something changed starting in 2007 I added a three-way interaction:

glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
  family = binomial, data = AllBig)

Coefficients:                  
                    Estimate Std. Error z value Pr(>|z|)     
(Intercept)      -4.51110    0.47707   -9.46   <2e-16 *** 
BadWeather        0.34207    0.30013    1.14    0.254     
Temp              0.00831    0.00653    1.27    0.203     
IsAway            0.14791    0.27549    0.54    0.591     
Late              0.13111    0.26411    0.50    0.620     
Pats             -0.80019    0.54360   -1.47    0.141     
IsAway:Late      -0.07348    0.37463   -0.20    0.845     
IsAway:Pats       0.94335    0.63180    1.49    0.135     
Late:Pats         0.51536    0.71379    0.72    0.470     
IsAway:Late:Pats -3.14345    1.29480   -2.43    0.015 *

There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.

#Simulation of deflategate data null hypothesis
Late = rep(1,72)  #creates 72 late fumbles
Early = rep(0,73)   #creates 73 early fumbles
alldata = append(Late,Early)   #puts the two groups together
table(alldata)  #check to see that we have what we want

cards =1:length(alldata)  # creates 145 cards, one "ID number" per fumble

FumbleLate = NULL  # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
  cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
  cardsgroup2 = cards[-cardsgroup1]  # puts the remaining cards in group 2
  NEPats = (alldata[cardsgroup2])  #reads the values of the cards in group 2
  FumbleLate[i] = sum(NEPats)  # counts NEPats late fumbles (the only stat we need)
}

table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results

sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086

Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.

Example 2015.1: Time to refinance?

2015-01-05T13:45:00.000-05:00

In the US, it's typical to borrow a fairly substantial portion of the cost of a new house from a bank. The cost of these loans, the mortgage rate, varies over time depending on what the financial wizards see in their crystal balls. What this means over time is that when the mortgage rates go down, the cost of living in your own house magically decreases--you take a new loan at the lower rate and pay off your old loan with it-- then you only have to pay off the new loan at the lower rate. You can find mortgage rate calculators on the web very easily-- if you don't mind their collecting your data and being bombarded with ads if you let their cookies trace you.

Instead, you can use SAS or R to calculate what you might pay for a new loan with various posted rates. There are some sophisticated tools available for either package if you're interested in the remaining principal or the proportion of each payment that's principal. Here, we just want to check the monthly payment.

R
We'll begin by writing a little function to calculate the monthly payment from the principal, interest rate (in per cent), and term (in years) of the loan. This is basic stuff, but the code here is adapted from a function written by Thomas Girke of UC Riverside.

mortgage <- function(principal=300000, rate=3.875, term=30) { 
  J <- rate/(12 * 100)
  N <- 12 * term
  M <- principal*J/(1-(1+J)^(-N))
  monthPay <<- M
  return(monthPay)
}

To compare the monthly costs for a series of loans offered by a local bank, we'll input the bank's loans as a data frame. To save typing, we'll use the rep() function to generate the term of the loan and the points.

offers = data.frame(
  principal = rep(275000, times=9),
  term = rep(c(30,20,15), each=3), 
  points = rep(c(0,1,2), times=3),
  rate = c(3.875, 3.75, 3.5, 3.625, 3.5, 3.375, 3, 2.875, 2.75))

> offers

  principal term points  rate
1    275000   30      0 3.875
2    275000   30      1 3.750
3    275000   30      2 3.500
4    275000   20      0 3.625
5    275000   20      1 3.500
6    275000   20      2 3.375
7    275000   15      0 3.000
8    275000   15      1 2.875
9    275000   15      2 2.750

(Points are an up-front cost a borrower can pay to lower the mortgage rate for the loan.) With the data and function in hand, it's easy to add the monthly cost to the data frame:

offers$monthly = with(offers, mortgage(rate=rate, term=term, principal=principal))

> offers

  principal term points  rate  monthly
1    275000   30      0 3.875 1293.152
2    275000   30      1 3.750 1273.568
3    275000   30      2 3.500 1234.873
4    275000   20      0 3.625 1612.610
5    275000   20      1 3.500 1594.889
6    275000   20      2 3.375 1577.282
7    275000   15      0 3.000 1899.100
8    275000   15      1 2.875 1882.611
9    275000   15      2 2.750 1866.210

In theory, each of these costs are fair, and the borrower should choose based on monthly costs they can afford, as well as whether they see a better value in having money in hand to spend on a better quality of life or to invest it in savings or in paying off their house sooner. Financial professionals often discuss things like the total dollars spent or the total spent on interest vs. principal, as well.

SAS
The SAS/ETS package provides the LOAN procedure, which can calculate the detailed analyses mentioned above. For simple calculations like this one, we can use the mort function in the data step. It will find and return the missing one of the four parameters-- principal, payment, rate, and term. To enter the data in a manner similar to R, we'll use array statements and do loops.

data t;
principal = 275000; 
array te [3] (30,20,15);
array po [3] (0,1,2); 
array ra [9] (.03875, .0375, .035, .03625, .035, 
              .03375, .03, .02875, .0275);
do i = 1 to 3;
  do j = 1 to 3;
    term = te[i];
 points = po[j];
 rate = ra[ 3 * (i-1) +j];
 monthly = mort(principal,.,rate/12, term*12);
    output;
  end;
end;
run;

proc print noobs data = t; 
var principal term points rate monthly; run;

principal    term    points      rate     monthly

  275000      30        0      0.03875    1293.15
  275000      30        1      0.03750    1273.57
  275000      30        2      0.03500    1234.87
  275000      20        0      0.03625    1612.61
  275000      20        1      0.03500    1594.89
  275000      20        2      0.03375    1577.28
  275000      15        0      0.03000    1899.10
  275000      15        1      0.02875    1882.61
  275000      15        2      0.02750    1866.21

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission. If you read this on another aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page other than as noted above, the aggregator is violating the terms by which we publish our work.

RStudio in the cloud for dummies, 2014/2015 edition

2014-12-02T11:00:00.000-05:00

In 2012, we presented a post showing how to run RStudio in the cloud on an Amazon server. There were 7 steps, including one with 7 sub-steps, one of which had 6 sub-sub-steps. It was still pretty easy, for what it was-- an effectively free computer in the cloud to run R on.

Today, we show the modern-- 3 years later!-- way to get the same result, only this approach is much easier, and the resulting installation includes all the best goodies of RStudio, including Markdown -> PDF and Hadley Wickham's packages pre-installed. Update, 2016: Digital ocean has changed their set-up, slightly. Check out the first step or two of this post in place of the first two steps below, if you're just starting out.

The approach builds on Docker, an infrastructure that saves start-up time and overhead, as well as efforts led by Dirk Eddelbuettel and Carl Boettiger to develop a Docker application of R. This project is called Rocker, and interested readers are encouraged to read the details. But if you want to just get up and running, here are the simple steps to get going.

1. Go to Digital Ocean and sign up for an account. By using this link, you will get a $10 credit. (Full disclosure: Ken will also get a $25 credit once you spend $25 real dollars there.) The reason to use this provider is that they have a system ready to run with Docker already built in. In addition, their prices are quite reasonable. You will need to use a credit card or PayPal to activate your account, but you can play for a long time with your $10 credit-- the cheapest machine is $.007 per hour, up to a $5 per month maximum.

2. On your Digital Ocean page, click "Create droplet". Then choose an (arbitrary) name, a size (meaning cost/power) of machine, and the region closest to you. You can ignore the settings. Under "Select Image", choose the "Applications" tab and select "Docker (1.3.2 on 14.04)". (The numbers in the parentheses are the Docker and Ubuntu version, and might change over time.) Then click "Create Droplet" at the bottom of the page.

3. It takes about a minute for the machine to start up. When it's ready, click the "Console Access" button. This opens a text terminal to your Ubuntu machine, inside your web page. Press enter to get a prompt, and log in (your username is root) using the password that was sent to your e-mail. You'll have to change the password.

4a. To start a terminal session of R, type

docker run --rm -ti rocker/r-base

you should see a bunch of messages about pulling and downloading, but eventually you will get the ">" prompt-- you can do R in here, but who would want to?

4b. To get RStudio server running, type

docker run -d -p 8787:8787 rocker/rstudio

But this is really not where you want to be. Instead, run the following command, to get a set-up that includes more useful packages installed in and with R.

docker run -d -p 8787:8787 rocker/hadleyverse

5. Use it! The IP address of your server is displayed below the terminal where you typed in your docker command. Open a new browser tab and go to the address http://(ip address):8787. For example: http://135.104.92.185:8787. You'll see the RStudio login screen, and can enter "rstudio" (without the quotes) as the username and password. The system is well tuned enough that you can open a new file --> markdown --> PDF and immediately click "Knit PDF", and see the example document beautifully presented back to you in moments.

That's it. It's still way cooler than sliced bread. let us know if you try it, and if you run into any trouble. Oh, and if you're feeling creeped out by the standard username and password in your RStudio, you can set them up from your docker command as follows.

docker run -d -p 8787:8787 -e USER=ken -e PASSWORD=ken rocker/hadleyverse

Other customization details and further information can be found on this Rocker page.

Update
I should perhaps have noted that what you are running here is in fact RStudio Server, and that you can allow additional users on your RStudio using instructions found here.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Example 2014.13: Statistics doesn't have to be so hard! Resampling in R and SAS

2014-11-17T13:30:00.000-05:00

A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book).

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water), 
                x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

 TRUE FALSE 
  0.1  99.9

The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.

alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.

resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) - 
                mean(alldata[resample_labels=="water"])

resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.

resamp_means = function(data, labs){
  resample_labels = sample(labs)
  resample_diff = mean(data[resample_labels=="beer"]) - 
    mean(data[resample_labels=="water"])
  return(resample_diff)
}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:

alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)

SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.

data bites;;
if _n_ le 18 then drink="water";
  else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;

proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;

proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;

data om3; 
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;

proc print data = om3; var obsdiff; run;

            Obs    obsdiff
             1     4.37778

(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.

data rerand;
set bites;
randorder = uniform(0);
run;

proc sort data = rerand; by randorder; run;

data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
  else redrink = "beer";
run;

proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;

proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;

data rrdiff; 
set rerand4;
rrdiff = beer-water;
run;

proc print data = rrdiff; var rrdiff; run;

            Obs     rrdiff
             1     -1.73778

One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.

proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;

To get a p-value and make a histogram, we use the macro variable created earlier.

data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;

proc means data = ssresamp4 sum; var exceeds; run;

             The MEANS Procedure
          Analysis Variable : exceeds
                          Sum
                    9.0000000

proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;

The p-value is 0.001 (= (9+1)/10000).

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Example 2014.12: Changing repeated measures data from wide to narrow format

2014-10-21T14:11:00.000-04:00

Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.

Obs    ID    CESD1    CESD2    CESD3    CESD4    CESD

  1     1       7        .        8        5      49
  2     2      11        .        .        .      30
  3     3      14        .        .       49      39
  4     4      44        .        .       20      15
  5     5      26       27       15       28      39
...

Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

 Obs    ID    time    CESD    cesd_tv

   1     1     1       49         7
   2     1     2       49         .
   3     1     3       49         8
   4     1     4       49         5
...

In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

R
We'll begin by making a narrower data frame with just the columns noted above. We use the select() function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function. This operates on the column names as character vectors in a hopefully obvious way.

load("c:/book/savedfile")

library(dplyr)
wide = select(ds, id, starts_with("cesd"))

Now we'll convert to the long format. The standard R approach is to use the reshape() function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.

long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),
               v.names = "cesd_tv",
               idvar = c("id", "cesd"), direction="long")
long[long$id == 1,]

       id cesd time cesd_tv
1.49.1  1   49    1       7
1.49.2  1   49    2      NA
1.49.3  1   49    3       8
1.49.4  1   49    4       5

In the preceding, the varying parameter is a list of columns which vary over time, while the id.var columns appear at each time. The v.names parameter is the name of the column which will hold the values of the varying variables.

Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.

c1 = subset(wide, select= c(id, cesd, cesd1))
c1$time = 1
names(c1)[3] = "cesd_tv"

c2 = subset(wide, select= c(id, cesd, cesd2))
c2$time = 2
names(c2)[3] = "cesd_tv"

c3 = subset(wide, select= c(id, cesd, cesd3))
c3$time = 3
names(c3)[3] = "cesd_tv"

c4 = subset(wide, select= c(id, cesd, cesd4))
c4$time = 4
names(c4)[3] = "cesd_tv"

long = rbind(c1,c2,c3,c4)
long[long$id==1,]

     id cesd cesd_tv time
1     1   49       7    1
454   1   49      NA    2
907   1   49       8    3
1360  1   49       5    4

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.

library(tidyr)
gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>%
mutate(time = as.numeric(substr(names,5,5))) %>%
arrange(id,time) -> long

head(long)

  id cesd names cesd_tv time
1  1   49 cesd1       7    1
2  1   49 cesd2      NA    2
3  1   49 cesd3       8    3
4  1   49 cesd4       5    4
5  2   30 cesd1      11    1
6  2   30 cesd2      NA    2

The gather() function takes a data frame (the first argument) and returns new columns named in the key and value parameter. The contents of the columns are the names (in the key) and the values (in the value) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The mutate function is like the R base function transform() but has some additional features and may be faster in some settings. Finally, the arrange() function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to display

The %>% operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the mutate() function should be read as taking the result of the gather() as its input data frame, and sending its output data frame into the arrange() function. For Ken, the right assignment arrow (-> long) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as long = gather... or long <- gather.. , etc.

SAS
In SAS, we'll make the narrow data set using the keep statement in the data step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by starts_with() in dplyr.

data all;
set "c:/book/help.sas7bdat";
run;

data wide;
set all;
keep id cesd:;
run;

The simpler way to make the desired data set is with the transpose procedure. Here the by statement forces the variables listed in that statement not to be transposed. The notsorted options save us having to actually sort the variables. Otherwise the procedure works like gather(): each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for gather()'s key (SAS: _NAME_)and value (SAS: COL1) though these can be changed.

proc transpose data = wide out = long_a;
by notsorted id notsorted cesd;
run;

data long;
set long_a;
time = substr(_name_, 5);
rename col1=cesd_tv;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv; 
run;

 Obs    ID    time    CESD    cesd_tv

   1     1     1       49         7
   2     1     2       49         .
   3     1     3       49         8
   4     1     4       49         5

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.

data long;
set wide;
time = 1; cesd_tv = cesd1; output;
time = 2; cesd_tv = cesd2; output;
time = 3; cesd_tv = cesd3; output;
time = 4; cesd_tv = cesd4; output;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv; 
run;

 Obs    ID    time    CESD    cesd_tv

   1     1     1       49         7
   2     1     2       49         .
   3     1     3       49         8
   4     1     4       49         5

Example 2014.11: Contrasts the basic way for R

2014-09-30T11:00:00.000-04:00

As we discuss in section 6.1.4 of the second edition, R and SAS handle categorical variables and their parameterization in models quite differently. SAS treats them on a procedure-by-procedure basis, which leads to some odd differences in capabilities and default parameterizations. For example, in the logistic procedure, the default is effect cell coding, while in the genmod procedure-- which also fits logistic regression-- the default is reference cell coding. Meanwhile, many procedures can only accommodate reference cell coding.

In R, in contrast, categorical variables can be designated as "factors" and parameterization stored an attribute of the factor.

In section 6.1.4, we demonstrate how the parameterization of a factor can be easily changed on the fly, in R, in lm(),glm(), and aov, using the contrasts= option in those functions. Here we show how to set the attribute more generally, for use in functions that don't accept the option. This post was inspired by a question from Julia Kuder, of Brigham and Women's Hospital.

SAS
We begin by simulating censored survival data as in Example 7.30. We'll also export the data to use in R.

data simcox;
  beta1 = 2;
  lambdat = 0.002; *baseline hazard;
  lambdac = 0.004; *censoring hazard;
  do i = 1 to 10000;
    x1 = rantbl(0, .25, .25,.25);
    linpred = exp(-beta1*(x1 eq 4));
    t = rand("WEIBULL", 1, lambdaT * linpred);
    * time of event;
    c = rand("WEIBULL", 1, lambdaC);
           * time of censoring;
    time = min(t, c);    * which came first?;
    censored = (c lt t);
    output;
  end;
run;

proc export data=simcox replace
outfile="c:/temp/simcox.csv"
dbms=csv;
run;

Now we'll fit the data in SAS, using effect coding.

proc phreg data=simcox;
class x1 (param=effect);
model time*censored(0)= x1 ;
run;

We reproduce the rather unexciting results here for comparison with R.

                     Parameter     Standard     
 Parameter     DF     Estimate        Error 

 x1        1    1     -0.02698      0.03471
 x1        2    1     -0.01211      0.03437
 x1        3    1     -0.05940      0.03458

R
In R we read the data in, then use the C() function to assign the contr.sum contrast to a version of the x1 variable that we save as a factor. Once that is done, we can fit the proportional hazards regression with the desired contrast.

simcox<- read.csv("c:/temp/simcox.csv")
sc2 = transform(simcox, x1.eff = C(as.factor(x1), contr.sum(4)))
effmodel <- coxph(Surv(time, censored)~ x1.eff,data= sc2)
summary(effmodel)

We excerpt the relevant output to demonstrate equivalence with SAS.

            coef exp(coef) se(coef)  
x1.eff1 -0.02698   0.97339  0.03471  
x1.eff2 -0.01211   0.98797  0.03437  
x1.eff3 -0.05940   0.94233  0.03458

Example 2014.10: Panel by a continuous variable

2014-08-18T10:30:00.000-04:00

In Example 8.40, side-by-side histograms, we showed how to generate histograms for some continuous variable, for each level of a categorical variable in a data set. An anonymous reader asked how we would do this if both the variables were continuous. Keep the questions coming!

SAS
The SAS solution we presented relied on the sgpanel procedure. There, the panelby statement names a variable for which each distinct value will generate a panel. If there are many values, for example for a continuous variable, there will be many panels generated, which is probably not the desired result. As far as we know, there is no option to automatically categorize a continuous panel variable in proc sgpanel. If this is required, a two-step approach will be needed to first make groups of one of the variables.

We do that below using proc rank. In this approach, the groups option is the number of groups required and the ranks statement names a new variable to hold the group indicator. Once the groups are made, the same code demonstrated earlier can be used. (This is an example of "it's never too late to learn"-- I used to do this via a sort and a data step with implied variables, until I realized that there had to be a way to it via a procedure. --KK)

In this setting, the panels are another approach to the data we examine in a scatterplot. As an example, we show the mental compentency score by grouping of the physical competency score in the HELP data set.

proc rank data = 'c:\book\help.sas7bdat' groups = 6 out = catmcs;
var mcs;
ranks mcs_sextile;
run;

title "Histograms of PCS by sextile of MCS";
proc sgpanel data = catmcs;
  panelby mcs_sextile / columns = 3 rows =2;
  histogram pcs;
run;

We also demonstrate the columns and rows options to the panelby statement, which allow control over the presentation of the panel results. The graphic produced is shown above.

R
Our R solution in the earlier entry used the lattice package (written by Deepayan Sarkar) to plot a formula such as histogram(~a | b). A simple substitution of a continuous covariate b into that syntax will also generate a panel for each distinct value of the covariates: a factor is expected. In the package, an implementation of Trellis graphics, the term "shingles" is used to approach the notion of categorizing a continuous variable for making panels. The function equal.count() is provided to make the (possibly overlapping) categories of the variables, and uses the panel headers to suggest the ranges of continuous covariate that are included in each panel.

ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(lattice)
histogram(~ pcs | equal.count(mcs), 
   main="Histograms of PCS by shingle of MCS",
   index.cond=list(c(4,5,6,1,2,3)),data=ds)

Note that the default ordering of panels in lattice is left to right, bottom to top. The index.cond option here re-orders the panels to go from left to right, top to bottom.

The default behavior of equal.count() is to allow some overlap between the categories, which is a little odd. In addition, there is a good deal of visual imprecision in the method used to identify the panels-- there's no key given, and the only indicator of the shingle value is the shading of the title bars. A more precise method would be to use the quantile() function manually, as we demonstrated in example 8.7, the Hosmer and Lemeshow goodness-of-fit test. We show here how the mutate() function in Hadley Wickham's dplyr package can be used to add a new variable to a data frame.

require(dplyr)

ds = mutate(ds, cutmcs = cut(ds$mcs, 
   breaks = quantile(ds$mcs, probs=seq(0,1, 1/6)), include.lowest=TRUE))
histogram(~ pcs | cutmcs,  main="Histograms of PCS by sextile of MCS",
          index.cond=list(c(4,5,6,1,2,3)), data=ds)

This shows the exact values of the bin ranges in the panel titles, surely a better use of that space. Minor differences in the histograms are due to the overlapping categories included in the previous version.

Finally, we also show the approach one might use with the ggplot2 package, an implementation of Leland Wilkinson's Grammar of Graphics, coded by Hadley Wickham. The package includes the useful cut_number() function, which does something similar to the cut(..., breaks=quantile(...)) construction we showed above. In ggplot2, "facets" are analogous to the shingles used in lattice.

library(ggplot2)
ds = mutate(ds, cutmcsgg = cut_number(ds$mcs, n=6))
ggplot(ds, aes(pcs)) + geom_bar() + 
  facet_wrap(~cutmcsgg) + ggtitle("Histograms of PCS by sextile of MCS")

Roughly, we can read the syntax to state: 1) make a plot from the ds dataset in which the primary analytic variable will be pcs; 2) make histograms; 3) make facets of the cutmcsgg variable; 4) add a title. Since the syntax is a little unusual, Hadley provides the qplot() function, a wrapper which operates more like traditional functions. An identical plot to the above can be generated with qplot() as follows:

qplot(data=ds,x=pcs, geom="bar", facets= ~cutmcsgg, 
   main="Histograms of PCS by sextile of MCS")

Example 2014.9: Rolling averages. Also: Second Edition is shipping!

2014-08-11T14:30:00.000-04:00

As of today, the second edition of "SAS and R: Data Management, Statistical Analysis, and Graphics" is shipping from CRC Press, Amazon, and other booksellers. There are lots of additional examples from this blog, new organization, and other features we hope you'll find useful. Thanks for your support. We'll be continuing to blog.

Now, on to today's main course.

For cyclical data, it's sometimes useful to generate rolling averages-- the average of some number of recent measurements, usually one full cycle. For example, for retail sales, one might want the rolling average of the most recent week. The rolling average will dampen the effects of repeated patterns but still show the location of the data.

In keeping with our habit of plotting personal data (e.g.,Example 8.11, Example 8.12, example 10.1, Example 10.2), I'll use my own weight recorded over the past 6 months. After reading about "alternate day dieting" in The Atlantic, I decided to try the diet described in the book by Varady. I've never really tried to diet for weight loss before, but this diet has worked really well for me over the past six months. The basics are that you eat 500 calories every other day (diet days) and on the non-diet days you eat what you want. There's a little science supporting the approach. I can't really recommend the book, unfortunately, unless you're a fan of the self-help style.

As you can imagine, one's weight tends to fluctuate pretty wildly between diet days and non-diet days. The cycle is just two days, but to get a sense of my weight at any given time, it might be best to use the rolling average of the past, say, four days.

The beginning of the data, available from http://www.amherst.edu/~nhorton/sasr2/datasets/weight.txt, follows.

1/11/14 219
1/12/14 NA
1/13/14 219
1/14/14 NA
1/15/14 221.8
1/16/14 218
...

R
As you can tell from the NAs, I compiled the data with the intent to read it into R.

> weights = read.table("c:/temp/weight.txt")
> head(weights)

       V1    V2
1 1/11/14 219.0
2 1/12/14    NA
3 1/13/14 219.0
4 1/14/14    NA
5 1/15/14 221.8
6 1/16/14 218.0

Note, though, that the date values are just character strings (read in as a factor variable), and not so useful as read in.

> str(weights)
'data.frame': 161 obs. of  2 variables:
 $ V1: Factor w/ 161 levels "1/11/14","1/12/14",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ V2: num  219 NA 219 NA 222 ...

The lubridate package contributed by the invaluable Hadley Wickham contains functions to make it easier to use dates in R. Here, I use its mdy() function to convert characters values into R dates.

library(lubridate)
with(weights, plot(V2 ~ mdy(V1), 
  xlim = c(mdy("1/1/14"),mdy("6/30/14")),
  ylab="Weight", xlab="Date"))

The simple plot has enough values that you can clearly see the trend of weight loss over time, and perhaps the rolling average exercise is somewhat misplaced, here. To calculate the rolling average, I adapted (below) the lag function from section 2.2.18 (2nd edition; 1.4.17 in the 1st ed.)-- this is a simpler version that does not check for errors. The result of lag(x,k) is a vector with the first k values missing and with the remaining values being the beginning values of x. Thus the ith value of lag(x,k) is x[i-k]. To get the rolling average, I just take the mean of several lags. Here I use the rowMeans() function to do it for all the values at once. The lines() function adds the rolling average to the plot.

lag = function(x,k) {
  return( c(rep(NA,k), x[1:(length(x)-k)]) )
}

y = weights$V2
ra = rowMeans(
  matrix(c(y,lag(y,1),lag(y,2),lag(y,3)),ncol=4,byrow=F),
    na.rm=T)

lines(mdy(weights$V1),ra)

The final plot is shown above. Note that the the initial values of the lagged vector are missing, as are weights for several dates throughout this period. The na.rm=T option causes rowMeans() to return the mean of the observed values-- equivalent to a single imputation of the mean of the observed values, which perhaps Nick will allow me in this setting (note from NH: I don't have major issues with this). There are also two periods where I failed to record weights for four days running. For these periods, rowMeans() returns NaN, or "Not a Number". This is usefully converted to regions in the plot where the running average line is not plotted. Compare, for instance, with the default SAS behavior shown below. For the record, I was ill in early May and had little appetite regardless of my dieting schedule.

SAS
The data can be easily read with the input statement. The mmddyy7. informat tells SAS that the data in the first field are as many as 7 characters long and should be read as dates. SAS will store them as SAS dates (section 2.4 in the 2nd edition; 1.6 in the 1st edition). As the data are read in, I use the lagk functions (section 2.2.18 2nd edition; 1.4.17 in the 1st ed.) to recall the values from recent days and calculate the rolling average as I go.

data weights;
infile "c:\temp\weight.txt";
input date mmddyy7. weight;
ra = mean(weight,lag(weight), lag2(weight), lag3(weight));
run;

Note that the input statement expects the weight values to be numbers, and interprets the NAs in the data as "Invalid data". It inserts missing values into the data set, which is what we desire. The mean function provides the mean of the non-missing values. When the weight and all of the lagged values of weight are missing, it will return a missing value. With the rolling average in hand, I can plot the observed weights and the rolling average. To print Julian dates rather than SAS dates, use the format statement to tell SAS that the date variable should be printed using the date. format.

symbol1 i = none v=dot c = blue;
symbol2 i = j v = none c = black w=5;
proc gplot data = weights;
plot (weight ra)*date /overlay;
format date date.;
run;

The results are shown below. The main difference from the R plot is that the gaps in my recording do not appear in the line. The SAS symbol statement, the equivalent of the lines() function, more or less, does not encounter NaNs, but only missing values, and so it connects the points. I think R's behavior is more appropriate here-- there's no particular reason to suppose a linear interpolation between the observed data points is best, and so the line ought to be missing.

Example 2014.8: Estimate power for an interaction, by simulation

2014-06-30T11:30:00.000-04:00

In our last entry, we demonstrated how to simulate data from a logistic regression with an interaction between a dichotomous and a continuous covariate. In this entry we show how to use the simulation to estimate the power to detect that interaction. This is a simple, elegant, and powerful idea: simply simulate data under the alternative, and count the proportion of times the null is rejected. This is an estimate of power. If we lack infinite time to simulate data sets, we can also generate confidence intervals for the proportion.

R
In R, extending the previous example is almost trivially easy. The coef() function, applied to a glm summary object, returns an array with the parameter estimate, standard error, test statistic, and p-value. In one statement, we can extract the p-value for the interaction and return an indicator of a rejected null hypothesis. (This line is commented on below.) Then the routine is wrapped as a trivial function.

logist_inter = function() {
  c = rep(0:1,each=50)  # sample size is 100
  x = rnorm(100)
  lp = -3 + 2*c*x
  link_lp = exp(lp)/(1 + exp(lp))
  y = (runif(100) < link_lp) 

  log.int = glm(y~as.factor(c)*x, family=binomial)
  reject = ifelse( coef(summary(log.int))[4,4] < .05, 1, 0)
      # The coef() function above gets the parameter estimates; the [4,4] 
      # element is the p-value for the interaction.
  return(reject)
}

Running the function many times is also trivial, using the replicate() function.

pow1 = replicate(100, logist_inter())

The result is an array of 1s and 0s. To get the estimated power and confidence limits, we use the binom.test() function.

binom.test(sum(pow1), 100)

The test gives a p-value against the null hypothesis that the probability of rejection is 0.5, which is not interesting. The interesting part is at the end.

95 percent confidence interval:
 0.3219855 0.5228808
sample estimates:
probability of success 
                  0.42

It would be simple to adjust this code to allow a change in the number of subjects or of the effect sizes, etc.

SAS
In SAS, generating the data is no trouble, but evaluating the power programmatically requires several relatively cumbersome steps. To generate multiple data sets, we include the data generation loop from the previous entry within another loop. (Note that the number of observations has also been reduced vs. the previous entry.)

data test;
do ds = 1 to 100;  #100 data sets
  do i = 1 to 100; #100 obs/data set
    c = (i gt 50);
    x = normal(0);
    lp = -3 + 2*c*x;
    link_lp = exp(lp)/(1 + exp(lp));
    y = (uniform(0) lt  link_lp); 
    output;
  end;
end;
run;

Then we fit all of the models at once, using the by statement. Here, the ODS system suppresses voluminous output and is also used to capture the needed results in a single data set. The name of the piece of output that holds the parameter estimates (parameterestimates) can be found with the ods trace on statement.

ods select none;
ods output parameterestimates= int_ests;
proc logistic data = test ;
  by ds;
  class c (param = ref desc);
  model y(event='1') = x|c;
run;
ods exclude none;

The univariate procedure can be used to count the number of times the null hypothesis of no interaction would be rejected. To do this, we use the loccount option to request a table of location counts, and the mu0 option to specify that the location of interest is 0.05. As above, since our goal is to use the count programmatically, we also extract the result into a data set. If you're following along at home, it's probably worth your while to print out some of this data to see what it looks like.

ods output locationcounts=int_power;
proc univariate data = int_ests loccount mu0=.05;
  where variable = "x*c";
  var probchisq;
run;

For example, while the locationcounts data set reports the number of observations above and below 0.05, it also reports the number not equal to 0.05. This is not so useful, and we need to exclude this row from the next step. We do that with a where statement. Then proc freq gives us the proportion and (95%) confidence limits we need, using the binomial option to get the confidence limits and the weight statement to convey the fact that the count variable represents the number of observations.

proc freq data = int_power;
  where count ne "Num Obs ^= Mu0";
  tables count / binomial;
  weight value;
run;

Finally, we find our results:

                        Binomial Proportion
                       Count = Num Obs < Mu0

                  Proportion                0.4000
                  ASE                       0.0490
                  95% Lower Conf Limit      0.3040
                  95% Upper Conf Limit      0.4960

                  Exact Conf Limits
                  95% Lower Conf Limit      0.3033
                  95% Upper Conf Limit      0.5028

We estimate our power at only 40%, with a confidence limit of (30%, 50%). This agrees closely enough with R: we don't need to narrow the limit to know that we'll need a larger sample size.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Example 2014.7: Simulate logistic regression with an interaction

2014-06-24T11:30:00.000-04:00

Reader Annisa Mike asked in a comment on an early post about power calculation for logistic regression with an interaction.

This is a topic that has come up with increasing frequency in grant proposals and article submissions. We'll begin by showing how to simulate data with the interaction, and in our next post we'll show how to assess power to detect the interaction using simulation.

As in our earlier post, our method is to construct the linear predictor and the link function separately. This should help to clarify the roles of the parameter values and the simulated data.

SAS
In keeping with Annisa Mike's question, we'll simulate the interaction between a categorical and a continuous covariate. We'll make the categorical covariate dichotomous and the continuous one normal. To keep things simple, we'll leave the main effects null-- that is, the effect of the continuous covariate when the dichotomous one is 0 is also 0.

data test;
do i = 1 to 1000;
    c = (i gt 500);
    x = normal(0);
    lp = -3 + 2*c*x;
    link_lp = exp(lp)/(1 + exp(lp));
    y = (uniform(0) lt link_lp); 
 output;
end;
run;

In proc logistic, unlike other many other procedures, the default parameterization for categorical predictors is effect cell coding. This can lead to unexpected and confusing results. To get reference cell coding, use the syntax for the class statement shown below. This is similar to the default result for the glm procedure. If you need identical behavior to the glm procedure, use param = glm. The desc option re-orders the categories to use the smallest value as the reference category.

proc logistic data = test plots(only)=effect(clband);
class c (param = ref desc);
model y(event='1') = x|c;
run;

The plots(only)=effect(clband) construction in the proc logistic statement generates the plot shown above. If c=0, the probability y=1 is small for any value of x, and a slope of 0 for x is tenable. If c=1, the probability y=1 increases as x increases and nears 1 for large values of x.

The parameters estimated from the data show good fidelity to the selected values, though this is merely good fortune-- we'd expect the estimates to often be more different than this.

                              Standard         Wald
Parameter     DF   Estimate      Error   Chi-Square   Pr > ChiSq

Intercept      1    -3.0258     0.2168     194.7353       <.0001
x              1    -0.2618     0.2106       1.5459       0.2137
c         1    1    -0.0134     0.3387       0.0016       0.9685
x*c       1    1     2.0328     0.3168      41.1850       <.0001

R
As sometimes occurs, the R code resembles the SAS code. Creating the data, in fact, is quite similar. The main differences are in the names of the functions that generate the randomness and the vectorized syntax that avoids the looping of the SAS datastep.

c = rep(0:1,each=500)
x = rnorm(1000)
lp = -3 + 2*c*x
link_lp = exp(lp)/(1 + exp(lp))
y = (runif(1000) < link_lp)

We fit the logistic regression with the glm() function, and examine the parameter estimates.

log.int = glm(y~as.factor(c)*x, family=binomial)
summary(log.int)

Here, the estimate for the interaction term is further from the selected value than we lucked into with the SAS simulation, but the truth is well within any reasonable confidence limit.

                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.3057     0.2443 -13.530  < 2e-16 ***
as.factor(c)1     0.4102     0.3502   1.171    0.241    
x                 0.2259     0.2560   0.883    0.377    
as.factor(c)1:x   1.7339     0.3507   4.944 7.66e-07 ***

A simple plot of the predicted values can be made fairly easily. The predicted probabilities of y=1 reside in the summary object as log.int$fitted.values. We can color them according to the values of the categorical predictor by defining a color vector and then choosing a value from the vector for each observation. The resulting plot is shown below. If we wanted confidence bands as in the SAS example, we could get standard error for the (logit scale) predicted values using the predict() function with the se.fit option.

mycols = c("red","blue")
plot(log.int$fitted.values ~ x, col=mycols[c+1])

Example 2014.6: Comparing medians and the Wilcoxon rank-sum test

2014-06-12T09:00:00.000-04:00

A colleague recently contacted us with the following question: "My outcome is skewed-- how can I compare medians across multiple categories?"

What they were asking for was a generalization of the Wilcoxon rank-sum test (also known as the Mann-Whitney-Wilcoxon test, among other monikers) to more than two groups. For the record, the answer is that the Kruskal-Wallis test is the generalization of the Wilcoxon, the one-way ANOVA to the Wilcoxon's t-test.

But this question is based on a false premise: that the the Wilcoxon rank-sum test is used to compare medians. The premise is based on a misunderstanding of the null hypothesis of the test. The actual null hypothesis is that there is a 50% probability that a random value from one population exceeds an random value from the other population. The practical value of this is hard to see, and thus in many places, including textbooks, the null hypothesis is presented as "the two populations have equal medians". The actual null hypothesis can be expressed as the latter median hypothesis, but only under the additional assumption that the shapes of the distributions are identical in each group.

In other words, our interpretation of the test as comparing the medians of the distributions requires the location-shift-only alternative to be the case. Since this is rarely true, and never assessed, we should probably use extreme caution in using, and especially in interpreting, the Wilcoxon rank-sum test.

To demonstrate this issue, we present a simple simulation, showing a case of two samples with equal medians but very different shapes. In one group, the values are exponential with mean = 1 and therefore median log 2, in the other they are normal with mean and median = log 2 and variance = 1. We generate 10,000 observations and show that the Wilcoxon rank-sum test rejects the null hypothesis. If we interpret the null incorrectly as applying to the medians, we will be misled. If our interest actually centered on the medians for some reasons, an appropriate test that would not be sensitive to the shape of the distribution could be found in a quantile regression. Another, of course, would be the median test. We show that this test does not reject the null hypothesis of equal medians, even with the large sample size.

We leave aside the deeper questions of whether comparing medians is a useful substitute for comparing means, or whether means should not be compared when distributions are skewed.

R
In R, we simulate two separate vectors of data, then feed them directly to the wilcox.test() function (section 2.4.2).

y1 = rexp(10000)
y2 = rnorm(10000) + log(2)

wilcox.test(y1,y2)

This shows a very small p-value, denoting the fact not that the medians are unequal but that one or the other of these distributions generally has larger values. Hint: it might be the one with the long tail and no negative values.

 Wilcoxon rank sum test with continuity correction

data:  y1 and y2
W = 55318328, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

To set up the quantile regression, we put the observations into a single vector and make a group indicator vector. Then we load the quantreg package and use the rq() function (section 4.4.4) to fit the median regression.

y = c(y1,y2)
c = rep(1:2, each = 10000)

library(quantreg)
summary(rq(y~as.factor(c)))

This shows we would fail to reject the null of equal medians.

Call: rq(formula = y ~ as.factor(c))

tau: [1] 0.5

Coefficients:
              Value    Std. Error t value  Pr(>|t|)
(Intercept)    0.68840  0.01067   64.53047  0.00000
as.factor(c)2 -0.00167  0.01692   -0.09844  0.92159
Warning message:
In rq.fit.br(c, y, tau = tau, ...) : Solution may be nonunique

SAS
In SAS, we make a data set with a group indicator and use it to generate data conditionally.

data wtest;
do i = 1 to 20000;
  c = (i gt 10000);
  if c eq 0 then y = ranexp(0);
    else y = normal(0) + log(2);
  output;
  end;
run;

The Wilcoxon rank-sum test is in proc npar1way (section 2.4.2).

proc npar1way data = wtest wilcoxon;
class c;
var y;
run;

As with R, we find a very small p-value.

                       The NPAR1WAY Procedure

             Wilcoxon Scores (Rank Sums) for Variable y
                      Classified by Variable c

                     Sum of     Expected      Std Dev         Mean
  c          N       Scores     Under H0     Under H0        Score

  0      10000    106061416    100005000   408258.497   10606.1416
  1      10000     93948584    100005000   408258.497    9394.8584


                      Wilcoxon Two-Sample Test

                Statistic             106061416.0000

                Normal Approximation
                Z                            14.8348
                One-Sided Pr >  Z             <.0001
                Two-Sided Pr > |Z|            <.0001

                t Approximation
                One-Sided Pr >  Z             <.0001
                Two-Sided Pr > |Z|            <.0001

             Z includes a continuity correction of 0.5.


                        Kruskal-Wallis Test

                Chi-Square                  220.0700
                DF                                 1
                Pr > Chi-Square               <.0001

The median regression is in proc quantreg (section 4.4.4). As in R, we fail to reject the null of equal medians.

proc quantreg data =wtest;
class c;
model y = c;
run;

                       The QUANTREG Procedure

                  Quantile and Objective Function

              Predicted Value at Mean          0.6909


                        Parameter Estimates

                            Standard    95% Confidence
    Parameter   DF Estimate    Error        Limits       t Value

    Intercept    1   0.6909   0.0125    0.6663    0.7155   55.06
    c         0  1   0.0165   0.0160   -0.0148    0.0479    1.03
    c         1  0   0.0000   0.0000    0.0000    0.0000     .

                        Parameter Estimates

                        Parameter   Pr > |t|

                        Intercept     <.0001
                        c         0   0.3016
                        c         1    .

Example 2014.5: Simple mean imputation

2014-04-25T10:22:00.000-04:00

We're both users of multiple imputation for missing data. We believe it is the most practical principled method for incorporating the most information into data analysis. In fact, one of our more successful collaborations is a review of software for multiple imputation.

But, for me at least, there are times when a simpler form of imputation may be useful. For example, it may be desirable to calculate the mean of the observed values and substitute it for any missing values. Typically it would be unwise to attempt to use a data set completed in this way for formal inference, but it could be convenient under deadline pressure or for a very informal overview of the data.

Nick disagrees. He finds it hard to imagine any setting in which he would ever use such a primitive approach. He passes on to the reader the sage advice he received in graduate school: that making up data in such an ad-hoc fashion might be construed as dereliction or even misconduct. Use of single imputation approaches (which yield bias in many settings and attenuate estimates of variance) seems hard to justify in 2014. But one of the hallmarks of our partnership is that we can agree to disagree on an absolute ban, while jointly advising the reader to proceed with great caution.

SAS
In SAS, it would possible to approach this using proc means to find the means and then add them back into the data set in a data step. But there is a simpler way, using proc standard.

proc standard data=indata out=outdata replace; 
run;

This will replace the values of all missing numeric variables in the indata data set with the mean of the observed values, and save the result in a new data set, outdata. To restrict the operation to specific variables, use a var statement.

R
There may be a function designed to do this in R, but it's simple enough using the features of the language. We provide an option using the bracket ([) extractor operator and another using the ifelse() function. The latter may be more approachable for those less familiar with R.

df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
df$y[is.na(df$y)] = mean(df$y, na.rm=TRUE)

# alternative

df = transform(df, y = ifelse(is.na(y), mean(y, na.rm=TRUE), y))

In the first example, we identify elements of y that are NA, and replace them with the mean, if so. In the second, we test each element of y; if it is NA, we replace with the mean, otherwise we replace with the original value.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Example 2014.4: Hilbert Matrix

2014-04-14T09:22:00.000-04:00

Rick Wicklin showed how to make a Hilbert matrix in SAS/IML. Rick has a nice discussion of these matrices and why they might be interesting; the value of H_{r,c} is 1/(r+c-1). We show how to make this matrix in the data step and in R. We also show that Rick's method for displaying fractions in SAS/IML works in PROC PRINT, and how they can be displayed in R.

SAS
In the SAS data step, we'll use an array to make the columns of the matrix and do loops to put values into each cell in a row and output the row into the data set before incrementing the row value. This is a little awkward, but does at least preserve the simple 1/(r+c-1) function of the cell values. We arrange the approach using a global macro to be more general.

%let n = 5;
data h;
array val [&n] v1 - v&n;
  do r = 1 to &n;
    do c = 1 to &n;
      val[c] = 1/(r+c-1);
   end;
    output;
  end;
run;

To print the resulting matrix, we use the fract format, as Rick demonstrated. Pretty nice! The noobs option in the proc print statement suppresses the row number that would otherwise be shown.

proc print data = h noobs; 
var v1 - v5;
format _all_ fract.; 
run;

 v1    v2     v3     v4     v5

  1    1/2    1/3    1/4    1/5
1/2    1/3    1/4    1/5    1/6
1/3    1/4    1/5    1/6    1/7
1/4    1/5    1/6    1/7    1/8
1/5    1/6    1/7    1/8    1/9

R
As is so often the case in R, a solution can be generated in one line using nesting. Also as usual, though, its a bit unnatural, and we don't deconstruct it here.

n = 5
1/sapply(1:n,function(x) x:(x+n-1))

A more straightforward approach follows. It's certainly less efficient, though efficiency would seem a non-issue in any application of this code. It's also the kind of code that R aficionados would scoff at. It does have the attractiveness of perfect clarity, however. We begin by defining an empty matrix, then simply loop through the cells of the matrix, assigning values one by one.

n=5
h1 = matrix(nrow=n,ncol=n)
for (r in 1:n) {
  for (c in 1:n)
    h1[r,c] = 1/(r+c-1)
}

To display the fractions, we use the fractions() function in MASS package that's distributed with R.

> library(MASS)
> fractions(h1)
     [,1] [,2] [,3] [,4] [,5]
[1,]   1  1/2  1/3  1/4  1/5 
[2,] 1/2  1/3  1/4  1/5  1/6 
[3,] 1/3  1/4  1/5  1/6  1/7 
[4,] 1/4  1/5  1/6  1/7  1/8 
[5,] 1/5  1/6  1/7  1/8  1/9

Example 2014.3: Allow different variances by group

2014-02-27T11:27:00.000-05:00

One common violation of the assumptions needed for linear regression is heterscedasticity by group membership. Both SAS and R can easily accommodate this setting.

Our data today comes from a real example of vitamin D supplementation of milk. Four suppliers claimed that their milk provided 100 IU of vitamin D. The null hypothesis is that they all deliver this accurately, but there was some question about whether the variance was the same between the suppliers. Unfortunately, there are only four observations for each supplier.

SAS
In SAS, we'll read the data in with a simple input statement.

data milk;
input value milk_cat;
cards;
77               1
85               1
91               1
88               1
93               2
101              2
126              2
103              2
103              3
88               3
109              3
85               3
95               4
83               4
91               4
86               4
;;
run;

To fit the model, we'll use the group option to the repeated statement in proc mixed. This is specifically designed to allow differing values for groups sharing the same covariance structure. In this case, it's a simple structure: no correlation, constant value on the diagonal. The key pieces of output are selected out using ODS. To assess variance terms, it's best to use maximum likelihood, rather than REML fitting.

ods select covparms lrt;
proc mixed data = milk method = ml;
class milk_cat;
model value = milk_cat/solution;
repeated/group=milk_cat type = simple;
run;

  Covariance Parameter Estimates

Cov Parm     Group         Estimate

Residual     milk_cat 1     27.1875
Residual     milk_cat 2      150.69
Residual     milk_cat 3      100.69
Residual     milk_cat 4     21.1875


  Null Model Likelihood Ratio Test

    DF    Chi-Square      Pr > ChiSq

     3          5.13          0.1623

There's not too much evidence to support different variances, but the power is likely quite small, so we'll retain this model when assessing the null hypothesis of equal means. A REML fit ought to be preferable here, but to agree with the R results, we fit again with ML. Note that proc mixed is not smart enough to accurately determine the degrees of freedom remaining (16 observations - 4 variance parameters - 4 mean parameters) so we need to manually specify the denominator degrees of freedom using the ddf option to the model statement.

ods select solutionf tests3;
proc mixed data = milk method = ml;
class milk_cat;
model value = milk_cat/solution ddf=8;
repeated/group=milk_cat type = simple;
run;

                   Solution for Fixed Effects

                               Standard
Effect     milk_cat  Estimate     Error    DF  t Value  Pr > |t|

Intercept             88.7500    2.3015    12    38.56    <.0001
milk_cat   1          -3.5000    3.4776     8    -1.01    0.3437
milk_cat   2          17.0000    6.5551     8     2.59    0.0319
milk_cat   3           7.5000    5.5199     8     1.36    0.2113
milk_cat   4                0         .     .      .       .


        Type 3 Tests of Fixed Effects

              Num     Den
Effect         DF      DF    F Value    Pr > F

milk_cat        3       8       3.86    0.0563

So there's some reason to suspect that the suppliers may be different.

R
In R, we'll re-type this simple data set and assign the group labels manually.

value = c(77,85,91,88,93,101,126,103,103,88,109,85,95,83,91,86)
mc = as.factor(rep(1:4, each=4))
milk= data.frame(value, mc)

To fit the model with unequal variances, we'll use the gls() function in the nlme library. (Credit department: Ben Bolker discusses this and more complex models in an Rpubs document.) (Note that we use maximum likelihood, as we did in SAS.)

library(nlme)
mod = gls(value~mc, data=milk, weights = varIdent(form = ~1|mc), 
   method="ML")

To assess the hypothesis of equal variances, we'll compare to the homoscedasticity model using the anova() function.

mod2 = gls(value~mc, data=milk, method="ML")
anova(mod,mod2)

     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod      1  8 125.3396 131.5203 -54.66981                        
mod2     2  5 124.4725 128.3355 -57.23625 1 vs 2 5.132877  0.1623

This result is identical to SAS, although this is a likelihood ratio test and SAS shows a Wald test.

To assess whether the suppliers are different, we need to compare to the model with just an intercept. When using REML for both models, there was a warning message and a different answer than SAS (using REML in SAS as well). So we'll stick with maximum likelihood here.

mod3  = gls(value~1,data=milk, weights = varIdent(form = ~1|mc), method="ML")
anova(mod3,mod)

With ML in both programs we get the same results.

     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod3     1  5 126.8858 130.7488 -58.44292                        
mod      2  8 125.3396 131.5203 -54.66981 1 vs 2 7.546204  0.0564

Example 2014.2: Block randomization

2014-01-22T11:20:00.000-05:00

This week I had to block-randomize some units. This is ordinarily the sort of thing I would do in SAS, just because it would be faster for me. But I had already started work on the project R, using knitr/LaTeX to make a PDF, so it made sense to continue the work in R.

R
As is my standard practice now in both languages, I set thing up to make it easy to create a function later. I do this by creating variables with the ingredients to begin with, then call them as variables, rather than as values, in my code. In the example, I assume 40 assignments are required, with a block size of 6.
I generate the blocks themselves with the rep() function, calculating the number of blocks needed to ensure at least N items will be generated. Then I make a data frame with the block numbers and a random variate, as well as the original order of the envelopes. The only possibly confusing part of the sequence is the use of the order() function. What it returns is a vector of integer values with the row numbers of the original data set sorted by the listed variables. So the expression a1[order(a1$block,a1$rand),] translates to "from the a1 data frame, give me the rows ordered by sorting the rand variable within the block variable, and all columns." I assign the arms in a systematic way to the randomly ordered units, then resort them back into their original order.

seed=42
blocksize = 6
N = 40
set.seed(seed)
block = rep(1:ceiling(N/blocksize), each = blocksize)
a1 = data.frame(block, rand=runif(length(block)), envelope= 1: length(block))
a2 = a1[order(a1$block,a1$rand),]
a2$arm = rep(c("Arm 1", "Arm 2"),times = length(block)/2)
assign = a2[order(a2$envelope),]

> head(assign,12)
   block       rand envelope   arm
1      1 0.76450776        1 Arm 1
2      1 0.62361346        2 Arm 2
3      1 0.14844661        3 Arm 2
4      1 0.08026447        4 Arm 1
5      1 0.46406955        5 Arm 1
6      1 0.77936816        6 Arm 2
7      2 0.73352796        7 Arm 2
8      2 0.81723044        8 Arm 1
9      2 0.17016248        9 Arm 2
10     2 0.94472033       10 Arm 2
11     2 0.29362384       11 Arm 1
12     2 0.14907205       12 Arm 1

It's trivial to convert this to a function-- all I have to do is omit the lines where I assign values to the seed, sample size, and block size, and make the same names into parameters of the function.

blockrand = function(seed,blocksize,N){
  set.seed(seed)
  block = rep(1:ceiling(N/blocksize), each = blocksize)
  a1 = data.frame(block, rand=runif(length(block)), envelope= 1: length(block))
  a2 = a1[order(a1$block,a1$rand),]
  a2$arm = rep(c("Arm 1", "Arm 2"),times = length(block)/2)
  assign = a2[order(a2$envelope),]
  return(assign)
}

SAS
This job is also pretty simple in SAS. I use the do loop, twice, to produce the blocks and items (or units) within block, sssign the arm systematically, and generate the random variate which will provide the sort order within block. Then sort on the random order within block, and use the "Obs" (observation number) that's printed with the data as the envelope number.

%let N = 40;
%let blocksize = 6;
%let seed = 42;
data blocks;
call streaminit(&seed);
do block = 1 to ceil(&N/&blocksize);
  do item = 1 to &blocksize;
     if item le &blocksize/2 then arm="Arm 1";
    else arm="Arm 2";
     rand = rand('UNIFORM');
  output;
  end;
end;
run;

proc sort data = blocks; by block rand; run;

proc print data = blocks (obs = 12) obs="Envelope"; run;

Envelope    block    item     arm       rand

    1         1        3     Arm 1    0.13661
    2         1        1     Arm 1    0.51339
    3         1        5     Arm 2    0.72828
    4         1        2     Arm 1    0.74696
    5         1        4     Arm 2    0.75284
    6         1        6     Arm 2    0.90095
    7         2        2     Arm 1    0.04539
    8         2        6     Arm 2    0.15949
    9         2        4     Arm 2    0.21871
   10         2        1     Arm 1    0.66036
   11         2        5     Arm 2    0.85673
   12         2        3     Arm 1    0.98189

It's also fairly trivial to make this into a macro in SAS.

%macro blockrand(N, blocksize, seed); 
data blocks;
call streaminit(&seed);
do block = 1 to ceil(&N/&blocksize);
  do item = 1 to &blocksize;
     if item le &blocksize/2 then arm="Arm 1";
    else arm="Arm 2";
     rand = rand('UNIFORM');
  output;
  end;
end;
run;

proc sort data = blocks; by block rand; run;
%mend blockrand;

Example 2014.1: "Power" for a binomial probability, plus: News!

2014-01-14T14:06:00.000-05:00

Hello, folks! I'm pleased to report that Nick and I have turned in the manuscript for the second edition of SAS and R: Data Management, Statistical Analysis, and Graphics. It should be available this summer. New material includes some of our more popular blog posts, plus reproducible analysis, RStudio, and more.

To celebrate, here's a new example. Parenthetically, I was fortunate to be able to present my course: R Boot Camp for SAS users at Boston University last week. One attendee cornered me after the course. She said: "Ken, R looks great, but you use SAS for all your real work, don't you?" Today's example might help a SAS diehard to see why it might be helpful to know R.

OK, the example: A colleague contacted me with a typical "5-minute" question. She needed to write a convincing power calculation for the sensitivity-- the probability that a test returns a positive result when the disease is present, for a fixed number of cases with the disease. I don't know how well this has been explored in the peer-reviewed literature, but I suggested the following process:
1. Guess at the true underlying sensitivity
2. Name a lower bound (less than the truth) which we would like the observed CI to exclude
3. Use basic probability results to report the probability of exclusion, marginally across the unknown number of observed positive tests.

This is not actually a power calculation, of course, but it provides some information about the kinds of statements that it's likely to be possible to make.

R

In R, this is almost trivial. We can get the probability of observing x positive tests simply, using the dbinom() function applied to a vector of numerators and the fixed denominator. Finding the confidence limits is a little trickier. Well, finding them is easy, using lapply() on binom.test(), but extracting them requires using sapply() on the results from lapply(). Then it's trivial to generate a logical vector indicating whether the value we want to exclude is in the CI or not, and the sum of the probabilities we see a number of positive tests where we include this value is our desired result.

> truesense = .9
> exclude = .6
> npos = 20
> probobs = dbinom(0:npos,npos,truesense)
> cis = t(sapply(lapply(0:npos,binom.test, n=npos), 
               function(bt) return(bt$conf.int)))
> included = cis[,1] < exclude & cis[,2] > exclude
> myprob = sum(probobs*included)
> myprob
[1] 0.1329533

(Note that I calculated the inclusion probability, not the exclusion probability.)

Of course, the real beauty and power of R is how simple it is to turn this into a function:

> probinc = function(truesense, exclude, npos) {
  probobs = dbinom(0:npos,npos,truesense)
  cis = t(sapply(lapply(0:npos,binom.test, n=npos), 
               function(bt) return(bt$conf.int)))
   included = cis[,1] < exclude & cis[,2] > exclude
   return(sum(probobs*included))
}
> probinc(.9,.6,20)
[1] 0.1329533

SAS

My SAS process took about 4 times as long to write.
I begin by making a data set with a variable recording both the number of events (positive tests) and non-events (false negatives) for each possible value. These serve as weights in the proc freq I use to generate the confidence limits.

%let truesense = .9;
%let exclude = .6;
%let npos = 20;

data rej;
do i = 1 to &npos;
  w = i; event = 1; output;
  w = &npos - i; event = 0; output;
  end;
run;

ods output binomialprop = rej2;
proc freq data = rej;
by i;
tables event /binomial(level='1');
weight w;
run;

Note that I repeat the proc freq for each number of events using the by statement. After saving the results with the ODS system, I have to use proc transpose to make a table with one row for each number of positive tests-- before this, every statistic in the output has its own row.

proc transpose data = rej2 out = rej3;
where name1 eq "XL_BIN" or name1 eq "XU_BIN";
by i;
id name1;
var nvalue1;
run;

In my fourth data set, I can find the probability of observing each number of events and multiply this with my logical test of whether the CI included my target value or not. But here there is another twist. The proc freq approach won't generate a CI for both the situation where there are 0 positive tests and the setting where all are positive in the same run. My solution to this was to omit the case with 0 positives from my for loop above, but now I need to account for that possibility. Here I use the end=option to the set statement to figure out when I've reached the case with all positive (sensitivity =1). Then I can use the reflexive property to find the confidence limits for the case with 0 events. Then I'm finally ready to sum up the probabilities associated with the number of positive tests where the CI includes the target value.

data rej4;
set rej3 end = eof;
prob = pdf('BINOMIAL',i,&truesense,&npos);
prob_include = prob * ((xl_bin < &exclude) and (xu_bin > &exclude));
output;
if eof then do;
   prob = pdf('BINOMIAL',0,&truesense,&npos);
   prob_include = prob * (((1 - xu_bin) < &exclude) and ((1 - xl_bin) > &exclude));
   output;
   end;
run;

proc means data = rej4 sum;
var prob_include;
run;

Elegance is a subjective thing, I suppose, but to my eye, the R solution is simple and graceful, while the SAS solution is rather awkward. And I didn't even make a macro out of it yet!

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Example 10.8: The upper 95% CI is 3.69

2012-12-10T10:08:00.000-05:00

Apologies for the long and unannounced break-- the longest since we started blogging, three and a half years ago. I was writing a 2-day course for SAS users to learn R. Contact me if you're interested. And Nick and I are beginning work on the second edition of our book-- look for it in the fall. Please let us know if you have ideas about what we omitted last time or would otherwise like to see added. In the mean time, we'll keep blogging, though likely at a reduced rate.

Today: what can you say about the probability of an event if the observed number of events is 0? It turns out that the upper 95% CI for the probability is 3.69/N. There's a sweet little paper with some rationale for this, but it's in my other office. And I couldn't recall the precise value-- so I used SAS and R to demonstrate it to myself.

R

The R code is remarkably concise. After generating some Ns, we write a little function to perform the test and extract the (exact) upper 95% confidence limit. This is facilitated by the "..." notation, which passes along unused arguments to functions. Then we use apply() to call the new function for each N, passing the numerator 0 each time. Note that apply() needs a matrix argument, so the simple vector of Ns is converted to a matrix before use. [The sapply() function will accept a vector input, but took about 8 times as long to run.] Finally, we plot the upper limit * N against N. showing the asymptote. A log scaled x-axis is useful here, and is achieved with the log='x' option. (Section 5.3.12.) the result is shown above.

bin.m = seq(10, 10000, by=5)
mybt = function(...) { binom.test(...)$conf.int[2] }
uci = apply(as.matrix(bin.m), 1, mybt, x=0)
plot(y=bin.m * uci, x=bin.m, ylim=c(0,4), type="l", 
     lwd=5, col="red", cex=5, log='x',  
     ylab="Exact upper CI", xlab="Sample size", 
     main="Upper CI when there are 0 cases observed")
abline(h=3.69)

SAS

In SAS, the data, really just the N and a numerator of 0, are generated in a data step. The CI are found using the binomial option in the proc freq tables statement and saved using the output statement. Note that the weight statement is used here to avoid having a row for each Bernoulli trial.

data binm;
do n = 10 to 10000 by 5;
  x=0;
  output;
  end;
run;

ods select none;
proc freq data=binm;
by n;
weight n;
tables x / binomial;
output out=bp binomial;
run;
ods select all;

To calculate the upper limit*N, another data step is needed-- note that in this setting SAS will only produce the lower limit against the probability that all observations share the same value, thus the subtraction from 1 shown below. The log scale x-axis is obtained with the logbase option to the axis statement. (Section 5.3.12.) The result is shown below.

data uci;
set bp;
limit = (1-xl_bin) * n;
run;

axis1 order = (0 to 4 by 1);
axis2 logbase=10 logstyle=expand;
symbol1 i = j v = none c = red w=5 l=1;
proc gplot data=uci;
plot limit * n / vref=3.69 vaxis=axis1 haxis=axis2;
label n="Sample size" limit="Exact upper CI";
run;
quit;

It's clear that the upper 95% limit on the number of successes asymptotes to about 3.69. Thus the upper limit on the binomial probability p is 3.69/N.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.