Many univariate descriptive statistics are intuitive. However, weighted statistic are less intuitive. A weight variable changes the computation of a statistic by giving more weight to some observations than to others. This article shows how to compute and visualize weighted percentiles, also known as a weighted quantiles, as computed by […]

The post Weighted percentiles appeared first on The DO Loop.

The post Weighted percentiles appeared first on All About Statistics.

]]>
Many univariate descriptive statistics are intuitive. However, weighted statistic are less intuitive.
A weight variable changes the computation of a statistic by giving more weight to some observations than to others. This article shows how to compute and visualize weighted percentiles, also known as a weighted quantiles,
as computed by PROC MEANS and PROC UNIVARIATE in SAS.
Recall that percentiles and quantiles are the same thing: the 100*p*^{th} percentile is equal to the *p*^{th} quantile.

I do not discuss survey data in this article. Survey statisticians use weights to make valid inferences in survey data, and you can see the SAS documentation to learn about how to use weights to estimate variance in complex survey designs.

*How to understand weighted percentiles #Statistics #StatWisdom*

Click To Tweet

Before we calculate a weighted statistic, let's remember that a weight variable is not that same as a frequency variable. A frequency variable, which associates a positive integer with each observation, specifies that each observation is replicated a certain number of times. There is nothing unintuitive about the statistics that arise from including a frequency variable. They are the same that you would obtain by duplicating each record according to the value of the frequency variable.

Weights are not frequencies. Weights can be fractional values. When comparing a weighted and unweighted analyses, the key idea is this: an unweighted analysis is equivalent to a weighted analysis for which the weights are all 1. An "unweighted analysis" is really a misnomer; it should be called an "equally weighted" analysis!

In the computational formulas that SAS uses for weighted percentiles, the weights are divided by the sum of the weights. Therefore only relative weights are important, and the formulas simplify if you choose weights that sum to 1. For the remainder of this article, assume that the weights sum to unity and that an unweighted analysis has weights equal to 1/
To understand how weights change the computation of percentiles, let's review the standard unweighted computation of the empirical percentiles (or quantiles) of a set of *n* numbers. First, sort the data values from smallest to largest. Then construct the empirical cumulative distribution function (ECDF). Recall that the ECDF is a piecewise-constant step function that increases by 1/*n* at each data point. The quantity 1/*n* represents the fact that each observation is weighted equally in this analysis.

The quantile function is derived from the CDF function, and
the quantile function for a discrete distribution is also a step function.
You can use the graph of the ECDF to compute the quantiles. For example, suppose your data are

{
1
1.9
2.2
3
3.7
4.1
5 }

The following graph shows the ECDF for these seven values:

The data values are indicated by tick marks along the horizontal axis. Notice that the ECDF jumps by 1/7 at each data value because there are seven unique values.

I should really show you the graph of the quantile function (an "inverse function" to the CDF), but you can visualize the graph of the quantile function if you rotate your head clockwise by 90 degrees. To find a quantile, start at some value on the Y axis, move across until you hit a vertical line, and then drop down to the X axis to find the datum value. For example, to find the 0.2 quantile (=20th percentile), start at Y=0.2 and move right horizontally until you hit the vertical line over the datum 1.9. Thus 1.9 is the 20th percentile. Similarly, the 0.6 quantile is the data value 3.7. (I omit details about what to do if you hit a horizontal line.)

Of course, SAS can speed up this process. The following call to PROC MEANS displays the 20th, 40th, 60th, and 80th percentiles:

data A; input x wt; datalines; 1 0.25 1.9 0.05 2.2 0.15 3.0 0.25 3.7 0.15 4.1 0.10 5 0.05 ; proc means data=A p20 p40 p60 p80; var x; /* unweighted analysis: data only */ run; |

The previous section shows the relationship between percentile values and the graph of the ECDF. This section describes how the ECDF changes if you specify unequal weights for the data. The change is that the weighted ECDF will jump by the (standardized) weight at each data value. Because the weights sum to unity, the CDF is still a step function that rises from 0 to 1, but now the steps are not uniform in height. Instead, data that have relatively large weights produce a large step in the graph of the ECDF function.

In the previous section, the DATA step defined a weight variable. The weights for this example are

{
0.25
0.05
0.15
0.25
0.15
0.10
0.05 }

The following graph shows the weighted ECDF for these weights:

By using this weighted ECDF, you can read off the weighted quantiles. For example, to find the 0.2 weighted quantile, start at Y=0.2 and move horizontally until you hit a vertical line, which is over the datum 1.0. Thus 1.0 is the 20th weighted percentile. Similarly, the 0.6 quantile is the data value 3.0. You can confirm this by calling PROC MEANS with a WEIGHT statement, as shown below:

proc means data=A p20 p40 p60 p80; weight wt; /* weighted analysis */ var x; run; |

You can use a physical model to intuitively understand weighted percentiles. The model is the same as I used to
visualize a weighted mean. Namely, imagine a point-mass of *w*_{i} concentrated at position *x*_{i} along a massless rod. Finding a weighted percentile *p* is equivalent to finding the first location along the rod (moving from left to right) at which the proportion of the weight is greater than *p*. (I omit how to handle special percentiles for which the proportion is equal to *p*.)

The physical model looks like the following:

From the figure you can see that *x*_{1} is the *p*^{th} quantile for *p* < 0.25. Similarly, *x*_{2} is the *p*^{th} quantile for 0.25 < *p* < 0.30, and so forth.

If you want to apply these concepts to your own data, you can download the SAS program that generates the CDF graphs and computes the weighted percentiles.

The post Weighted percentiles appeared first on The DO Loop.

**Please comment on the article here:** **The DO Loop**

The post Weighted percentiles appeared first on All About Statistics.

]]>The post Nassi-Shneiderman Diagrams appeared first on All About Statistics.

]]>Programming languages use words and symbols to represent structures like blocks and conditions. A visual representation of these structures seems useful to keep track of all the different cases, see the scope of variables, etc. Nassi-Shneiderman diagrams offer just such a representation.

The structure of programs is sometimes shown using flow charts: decisions create branches, repetitions can be seen by arrows that point back, etc. Flow charts are an adequate representation of assembly language programs, but they're a poor fit for the structured, high-level programming languages almost everybody has been using for many decades now.

In a paper published in 1973, Isaac Nassi and Ben Shneiderman described the idea for a more structured visual approach. Their diagrams are sometimes called *structured flowcharts*, but are much more commonly known as *Nassi-Shneiderman diagrams.*

For some reason, they caught on much more in the German-speaking world than in the U.S. or elsewhere. I remember being taught the diagrams, and how they were much better suited for structured programming than flowcharts, in high school. As a teenager, I drew a good number of these diagrams of sorting algorithms (in particular the confusing *QuickSort*) to figure out how they worked.

Structured flowcharts have some interesting properties. They are a much better structural fit for high-level programming languages than standard flow charts. High-level languages like C, Pascal, Java, Python, etc. have constructs like loops, blocks that define scope, ifs with a defined else structure (and often cascading if-elseif-elseif-…else structures), switch/case statements, function calls (including recursion), etc.

Assembly has no such thing, all it knows are conditional and unconditional jumps. Those are nicely captured in flow charts, but whether a jump is part of a loop or a condition is only apparent when looking at the larger structure of the program. A Nassi-Shneiderman diagram immediately shows you what the high-level construct is.

By keeping the structure compact, structured flowcharts also help you see when you're missing cases. Granted, this isn't practical to do while debugging of a complex program (though it could be if automated tools existed for that), but it's great for teaching and for beginning programmers to increase their understanding of what is going on (and reduce their frustration and random guessing when trying to fix things).

It's hard to find examples of Nassi-Shneiderman diagrams on the web. Here's a function printing out the Fibonacci sequence that I grabbed from an outdated COMP101 class website. It should be fairly easy to figure out what the different elements mean from the context.

There is a short but helpful Wikipedia page on Nassi-Shneiderman diagrams, and Ben Shneiderman has a little page up as well (though most of the links there are dead). An article about how to create these diagrams in Excel has some helpful illustrations as well. The original paper is also available in all its 1970s typewriter glory, hand-drawn illustrations and all.

**Please comment on the article here:** **eagereyes**

The post Nassi-Shneiderman Diagrams appeared first on All About Statistics.

]]>I am not a lawyer (“IANAL” in web-speak); but even if I were, you should take this with a grain of salt (same way you take everything you hear from anyone). If you want the straight dope for U.S. law, see the U.S. government Copyright FAQ; it’s surprisingly clear for government legalese. What is copyrighted? […]

The post Who owns your code and text and who can use it legally? Copyright and licensing basics for open-source appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Who owns your code and text and who can use it legally? Copyright and licensing basics for open-source appeared first on All About Statistics.

]]>I am not a lawyer (“IANAL” in web-speak); but even if I were, you should take this with a grain of salt (same way you take everything you hear from anyone). If you want the straight dope for U.S. law, see the U.S. government Copyright FAQ; it’s surprisingly clear for government legalese.

**What is copyrighted?**

Computer code and written material such as books, journals, and web pages, are subject to copyright law. Copyright is for the expression of an idea, not the idea itself. If you want to protect your ideas, you’ll need a patent (or to be good at keeping secrets).

**Who owns copyrighted material?**

In the U.S., copyright is automatically assigned to the author of any text or computer code. But if you want to sue someone for infringing your copyright, the government recommends registering the copyright. And most of the rest of the world respects U.S. copyright law.

Most employers require as part of their employment contract that copyright for works created by their employees be assigned to the employer. Although many people don’t know this, most universities require the assignment of copyright for code written by university research employees (including faculty and research scientists) to the university. Typically, universities allow the author to retain copyright for books, articles, tutorials, and other traditional written material. Web sites (especially with code) and syllabuses for courses are in a grey area.

The copyright holder may assign copyright to others. This is what authors do for non-open-access journals and books—they assign the copyright to the publisher. That means that even they may not be able to legally distribute copies of the work to other people; some journals allow crippled (non-official) versions of the works to be distributed. The National Institutes of Health require all research to be distributed openly, but they don’t require the official version to be so, so you can usually find two versions (pre-publication and official published version) of most work done under the auspices of the NIH.

**What protections does copyright give you?**

You can dictate who can use your work and for what. There are fair use exceptions, but I don’t understand the line between fair use and infringement (like other legal definitions, it’s all very fuzzy and subject to past and future court decisions).

**Licensing**

For others to be able to use copyrighted text or code legally, the copyrighted material must be explicitly licensed for such use by the copyright holder. Just saying “common domain” or “this is trivial” isn’t enough. Just saying “do whatever you want with it” is in a grey area gain, because it’s not a recognized license and presumably that “whatever you want” doesn’t involve claiming copyright ownership. The actual copyright holder needs to explicitly license the material.

There is a frightening degree of non-conformance among open-source contributors, largely I suspect, due to misunderstandings of the author’s employment contract and copyright law.

**Derived works**

Most of the complication from software licensing comes from so-called derived works. For example, I download open-source package A, then extend it to produce open-source package B that includes open-source package A. That’s why most licenses explicitly state what happens in these cases. The reason we don’t like the Gnu Public Licenses (GPL) is that they restrict derived works with copyleft (forcing package B to adopt the same license, or at best one that’s compatible). That’s why I insisted on the BSD license for Stan—it’s maximally open in tems of what it allows others to do with the code, and it’s compatible with GPL. R’s licensed under the GPL, which means projects built on R, such as RStan, must also be released under the GPL plus whatever license the project is released under (we just went GPL for RStan).

**Where does Stan stand?**

Columbia owns the copyright for all code written by Columbia research staff (research faculty, postdocs, and research scientists). It’s less clear (from our reading of the faculty handbook) who owns works created by Ph.D. students and teaching faculty. For non-Columbia contributions, the author (or their assignee) retains copyright for their contribution. The advantage of this distributed copyright is that ownership isn’t concentrated with one company or person; the disadvantage is that we’ll never be able to contact everyone to change licenses, etc.

The good news is that Columbia’s Tech Ventures office (the controller of software copyrights at Columbia), has given the Stan project a signed waiver that allows us to release all past and future work on Stan under open source licenses. They maintain the copyright, though, under our employment contracts (at least for the research faculty and research scientists).

For other contributors, we now require them to explicitly state who owns the copyrighted contribution and to agree that the copyright holder gives permission to license the material under the relevant license (BSD for most of Stan, GPL or MIT for some of the interfaces).

The other good news is that most universities and companies are coming around and allowing their employees to contribute to open-source projects. The Gnu Public License (GPL) is often an exception for companies, because they are afraid of its copyleft properties.

**C.Y.A.**

The Stan project is trying to cover our asses from being sued in the future by a putative copyright holder, though we don’t like having to deal with all this crap (pun intended).

Luckily, most universities these days seem to be opening up to open source (no, that wasn’t intended to continue the metaphor of the previous paragraph).

**But what about patents?**

Don’t get me started on software patents. Or patent trolls. Like copyrights, patents protect the owner of intellectual property against its illegal use by others. Unlike copyright, which is about the realization of an idea (such as a way of writing a recipe for chocolate chip cookies), patents are more abstract and are about the right to realize ideas (such as making a chocolate chip cookie in any fashion). If you need to remember one thing about patent law, it’s that a patent lets you stop others from using your patented technology—it doesn’t let you use it (your patent B may depend on some other patent A).

**Or trademarks?**

Like patents, trademarks prevent other people from (legally) using your intellectual property without your permission, such as building a knockoff logo or brand. Trademarks can involve names, font choices, color schemes, etc. The trademark itself can involve fonts, color schemes, similar names, etc. But they tend to be limited to areas, so we could register a trademark for Stan (which we’re considering doing), without running afoul of the down-under Stan.

There are also unregistered trademarks, but I don’t know all the subtleties about what rights registered trademarks grant you over the unregistered ones. Hopefully, we’ll never be writing that little R in a circle above the Stan name, ~~Stan~~; even if you do register a trademark, you don’t have to use that annoying mark—it’s just there to remind people that the item in question is trademarked.^{®}

The post Who owns your code and text and who can use it legally? Copyright and licensing basics for open-source appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Who owns your code and text and who can use it legally? Copyright and licensing basics for open-source appeared first on All About Statistics.

]]>If any of you are members of the Marketing Research Association, could you please contact them and ask them to change their position on this issue: I have a feeling they won’t mind if you call them at home. With an autodialer. “Pollsters now must hand-dial cellphones, at great expense,” indeed. It’s that expensive to […]

The post Oooh, it burns me up appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Oooh, it burns me up appeared first on All About Statistics.

]]>If any of you are members of the Marketing Research Association, could you please contact them and ask them to change their position on this issue:

I have a feeling they won’t mind if you call them at home. With an autodialer. “Pollsters now must hand-dial cellphones, at great expense,” indeed. It’s that expensive to pay people to push a few buttons, huh?

Those creepy lobbyists are so creepy. Yeah, yeah, I know they’re part of the political process, but I don’t have to like them or their puppets in Congress.

The post Oooh, it burns me up appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Oooh, it burns me up appeared first on All About Statistics.

]]>The post Variable pruning is NP hard appeared first on All About Statistics.

]]>I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).

That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard.

When we say something is “theoretically hard” we mean we can contrive examples of it that encode instances of other problems thought to be hard. Thus the ability to solve arbitrary instances of our problem serves to solve arbitrary instances of the thought to be hard problem. This is a technical statement and doesn’t mean we don’t know how to do a good job on the problem. It just means it would be incredibly noteworthy to claim efficient complete optimality in *all* possible cases.

Let `Z`

denote the set of integers and `Q`

denote the rational numbers. The problem we are considering is:

INSTANCE: An integer `T`

, integer `K`

, and a data set `x(i),y(i)`

with `x(i) in Z^n`

and `y(i) in Z`

for `i=1,...,m`

.

QUESTION: Is there `B0 in Q`

and `B in Q^n`

such that `sum_{i=1,...,m} (B0 + B.x(i) - y(i))^2 ≤ T`

and no more than `K`

entries of `B`

are non-zero?

Call this problem “size `K`

regression model of quality `T`

” or (“sKrT” for short). Phrasing sKrT as a decision problem is a mere technical detail, we consider answering if there is a sKrT solution to be pretty much equivalent to finding such solutions. The input is taken to be integers for technical reasons, and can one can approximate various real number problems by scaling and rounding.

The hope is sKrT makes precise the goal one hopes stepwise regression is approximating: finding a good model for `y`

using only `K`

of the `x`

variables (at least on training data, there are also issues of multiple comparison to consider).

What I would like to point out is: solving sKrT is at least as hard as NP. That is: if we could always answer sKrT question quickly and correctly we could solve arbitrary problems in the complexity class NP (itself thought to be difficult).

The quickest way to see sKrT is likely hard is through the classic reference “Computers and Intractability, A Guide to the Theory of NP-Completeness”, Michael R. Gary, David S. Johnson, W.H. Freedman and Company, 1979. Their problem number MP5 “Minimum Weight Solution to Linear Equations” would be easy to solve given the ability to quickly solve sKrT instances. Formally MP5 is defined as:

INSTANCE: Finite set

`X`

of pairs`(x,b)`

where`x`

is an`m`

-tuple of integers and`b`

is an integer, and a positive integer`K ≤ m`

.QUESTION: Is there an

`m`

-tuple`y`

with rational entries such that`y`

has at most`K`

non-zero entries and such that`x . y = b`

for all`(x,b) in X`

?

The encoding is trivial. We encode an MP5 instance as a analogous sKrT instance with `T=0`

and one additional row of the form `(x(i)=0 in Z^n,y(i)=0 in Z)`

(which pushes `B0`

to `0`

). It should be obvious that checking for a zero sum of squared error linear regression is at least as powerful as checking for solvability of linear equations.

This means an intuition that MP5 may be hard becomes an intuition that sKrT may be hard.

All a hardness result seem to prohibit is a “magic wand” approach that aways returns perfect answers quickly (and it doesn’t actually prohibit it, but it means it would be very big news to find and certify such a magic wand). In many cases one can find approximately best solutions with high probability. Essentially this is a signal that in discussing variable selection it makes sense to consider heuristic and empirical results (trust methods that have tended to work well). In our follow-up articles we will discuss why you would want to prune down to `K`

variables (speed up algorithms, cut down on over-fit, and more) and effective pruning techniques (though Nina Zumel already has shared useful notes here).

**Please comment on the article here:** **Statistics – Win-Vector Blog**

The post Variable pruning is NP hard appeared first on All About Statistics.

]]>Mike Carniello writes: This article in the NYT leads to the full text, in which these statement are buried (no pun intended): What is the probability that two given texts were written by the same author? This was achieved by posing an alternative null hypothesis H0 (“both texts were written by the same author”) and […]

The post Better to just not see the sausage get made appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Better to just not see the sausage get made appeared first on All About Statistics.

]]>Mike Carniello writes:

This article in the NYT leads to the full text, in which these statement are buried (no pun intended):

What is the probability that two given texts were written by the same author? This was achieved by posing an alternative null hypothesis H0 (“both texts were written by the same author”) and attempting to reject it by conducting a relevant experiment. If its outcome was unlikely (P ≤ 0.2), we rejected the H0 and concluded that the documents were written by two individuals. Alternatively, if the occurrence of H0 was probable (P > 0.2), we remained agnostic.

See the footnote to this table:

Ahhh, so horrible. The larger research claims might be correct, I have no idea. But I hate to see such crude statistical ideas being used, it’s like using a pickaxe to dig for ancient pottery.

The post Better to just not see the sausage get made appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Better to just not see the sausage get made appeared first on All About Statistics.

]]>The post Sad night appeared first on All About Statistics.

]]>I've just heard the very sad news that Richard Nixon has passed away this morning. I can't say I knew Richard very well, but I thought he really was a lovely guy and I am very saddened.

I knew of him (among other things) through his work on covariate adjustment in health economic evaluations, which I think was part of his PhD at the MRC Cambridge. I then got in contact with him more closely when I was thinking of organising the short course based on BMHE, since he and Chris were already doing something like that. I suggested we did the course together and he was very enthusiastic about it. In fact, when he was asked to teach a short course at the University of Alberta, he said the three of us should have a go, which we did. Then we taught the course at Bayes 2014, UCL and at a one-day workshop organised by the RSS. He fell ill just before the last edition of the course.

Tonight I have a very vivid memory of the time we were in Edmonton having dinner after the first night of the course when I told that for some reason Italians usually get really crossed about chicken in pizza and that he used to tease me with that every time we've met since, saying that he would love a pizza with chicken. And how we used to introduce ourselves to the audience $-$ and how sometimes people were to young to get the references. I'll miss you, Richard.

**Please comment on the article here:** **Gianluca Baio's blog**

The post Sad night appeared first on All About Statistics.

]]>I got a book in the mail attached to some publicity material that began: Over the last several years, a different kind of science book has found a home on consumer bookshelves. Anchored by meticulous research and impeccable credentials, these books bring hard science to bear on the daily lives of the lay reader; their […]

The post Letters we never finished reading appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Letters we never finished reading appeared first on All About Statistics.

]]>I got a book in the mail attached to some publicity material that began:

Over the last several years, a different kind of science book has found a home on consumer bookshelves. Anchored by meticulous research and impeccable credentials, these books bring hard science to bear on the daily lives of the lay reader; their authors—including Malcolm Gladwell . . .

OK, then.

The book might be ok, though. I wouldn’t judge it on its publicity material.

The post Letters we never finished reading appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Letters we never finished reading appeared first on All About Statistics.

]]>The post How to create a free distributed data collection "app" with R and Google Sheets appeared first on All About Statistics.

]]>Jenny Bryan, developer of the google sheets R package, gave a talk at Use2015 about the package.

One of the things that got me most excited about the package was an example she gave in her talk of using the Google Sheets package for data collection at ultimate frisbee tournaments. One reason is that I used to play a little ultimate back in the day.

Another is that her idea is an amazing one for producing cool public health applications. One of the major issues with public health is being able to do distributed data collection cheaply, easily, and reproducibly. So I decided to write a little tutorial on how one could use Google Sheets and R to create a free distributed data collecton “app” for public health (or anything else really).

- A Google account and access to Google Sheets
- R and the googlesheets package.

What we are going to do is collect data in a Google Sheet or sheets. This sheet can be edited by anyone with the link using their computer or a mobile phone. Then we will use the `googlesheets`

package to pull the data into R and analyze it.

After you have a first thing to do is to go to the Google Sheets I suggest bookmarking this page: https://docs.google.com/spreadsheets/u/0/ which skips the annoying splash screen.

Create a blank sheet and give it an appropriate title for whatever data you will be collecting.

Next, we need to make the sheet *public on the web* so that the *googlesheets* package can read it. This is different from the sharing settings you set with the big button on the right. To make the sheet public on the web, go to the “File” menu and select “Publish to the web…”. Like this:

then it will ask you if you want to publish the sheet, just hit publish

Copy the link it gives you and you can use it to read in the Google Sheet. If you want to see all the Google Sheets you can read in, you can load the package and use the `gs_ls`

function.

```
library(googlesheets)
sheets = gs_ls()
sheets[1,]
```

```
## # A tibble: 1 x 10
## sheet_title author perm version updated
## <chr> <chr> <chr> <chr> <time>
## 1 app_example jtleek rw new 2016-08-26 17:48:21
## # ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
## # alternate <chr>, self <chr>, alt_key <chr>
```

It will pop up a dialog asking for you to authorize the `googlesheets`

package to read from your Google Sheets account. Then you should see a list of spreadsheets you have created.

In my example I created a sheet called “app_example” so I can load the Google Sheet like this:

```
## Identifies the Google Sheet
example_sheet = gs_title("app_example")
```

```
## Sheet successfully identified: "app_example"
```

```
## Reads the data
dat = gs_read(example_sheet)
```

```
## Accessing worksheet titled 'Sheet1'.
```

```
## No encoding supplied: defaulting to UTF-8.
```

```
head(dat)
```

```
## # A tibble: 3 x 5
## who_collected at_work person time date
## <chr> <chr> <chr> <chr> <chr>
## 1 jeff no ingo 13:47 08/26/2016
## 2 jeff yes roger 13:47 08/26/2016
## 3 jeff yes brian 13:47 08/26/2016
```

In this case the data I’m collecting is about who is at work right now as I’m writing this post :). But you could collect whatever you want.

Now that you have the data published to the web, you can read it into Google Sheets. Also, anyone with the link will be able to view the Google Sheet. But if you don’t change the sharing settings, you are the only one who can edit the sheet.

This is where you can make your data collection distributed if you want. If you go to the “Share” button, then click on advanced you will get a screen like this and have some options.

*Private data collection*

In the example I’m using I haven’t changed the sharing settings, so while you can *see* the sheet, you can’t edit it. This is nice if you want to collect some data and allow other people to read it, but you don’t want them to edit it.

*Controlled distributed data collection*

If you just enter people’s emails then you can open the data collection to just those individuals you have shared the sheet with. Be careful though, if they don’t have Google email addresses, then they get a link which they could share with other people and this could lead to open data collection.

*Uncontrolled distributed data collection*

Another option is to click on “Change” next to “Private - Only you can access”. If you click on “On - Anyone with the link” and click on “Can View”.

Then you can modify it to say “Can Edit” and hit “Save”. Now anyone who has the link can edit the Google Sheet. This means that you can’t control who will be editing it (careful!) but you can really widely distribute the data collection.

Once you have distributed the link either to your collaborators or more widely it is time to collect data. This is where I think that the “app” part of this is so cool. You can edit the Google Sheet from a Desktop computer, but if you have the (free!) Google Sheets app for your phone then you can also edit the data on the go. There is even an offline mode if the internet connection isn’t available where you are working (more on this below).

One of the major issues with distributed data collection is quality control. If possible you want people to input data using (a) a controlled vocubulary/system and (b) the same controlled vocabulary/system. My suggestion here depends on whether you are using a controlled distributed system or an uncontrolled distributed system.

For the controlled distributed system you are specifically giving access to individual people - you can provide some training or a walk through before giving them access.

For the uncontrolled distributed system you should create a *very* detailed set of instructions. For example, for my sheet I would create a set of instructions like:

- Every data point must have a label of who collected in in the
`who_collected`

column. You should pick a username that does not currently appear in the sheet and stick with it. Use all lower case for your username. - You should either report “yes” or “no” in lowercase in the
`at_work`

column. - You should report the name of the person in all lower case in the
`person`

column. You should search and make sure that the person you are reporting on doesn’t appear before introducing a new name. If the name already exists, use the name spelled exactly as it is in the sheet already. - You should report the
`time`

in the format hh:mm on a 24 hour clock in the eastern time zone of the United States. - You should report the
`date`

in the mm/dd/yyyy format.

You could be much more detailed depending on the case.

One of the cool things about Google Sheets is that they can even be edited without an internet connection. This is particularly useful if you are collecting data in places where internet connections may be spotty. But that may generate conflicts if you use only one sheet.

There may be different ways to handle this, but one I thought of is to just create one sheet for each person collecting data (if you are using controlled distributed data collection). Then each person only edits the data in their sheet, avoiding potential conflicts if multiple people are editing offline and non-synchronously.

Anyone with the link can now read the most up-to-date data with the following simple code.

```
## Identifies the Google Sheet
example_sheet = gs_url("https://docs.google.com/spreadsheets/d/177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o/pubhtml")
```

```
## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.
```

```
## Putative key: 177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o
```

```
## Sheet successfully identified: "app_example"
```

```
## Reads the data
dat = gs_read(example_sheet, ws="Sheet1")
```

```
## Accessing worksheet titled 'Sheet1'.
```

```
## No encoding supplied: defaulting to UTF-8.
```

```
dat
```

```
## # A tibble: 3 x 5
## who_collected at_work person time date
## <chr> <chr> <chr> <chr> <chr>
## 1 jeff no ingo 13:47 08/26/2016
## 2 jeff yes roger 13:47 08/26/2016
## 3 jeff yes brian 13:47 08/26/2016
```

Here the url is the one I got when I went to the “File” menu and clicked on “Publish to the web…”. The argument `ws`

in the `gs_read`

command is the name of the worksheet. If you have multiple sheets assigned to different people, you can read them in one at a time and then merge them together.

So that’s it, its pretty simple. But as I gear up to teach advanced data science here at Hopkins I’m thinking a lot about Sean Taylor’s awesome post Real scientists make their own data

I think this approach is a super cool/super lightweight system for collecting data either on your own or as a team. As I said I think it could be really useful in public health, but it could also be used for any data collection you want.

**Please comment on the article here:** **Simply Statistics**

The post How to create a free distributed data collection "app" with R and Google Sheets appeared first on All About Statistics.

]]>The post Not So Standard Deviations Episode 21 – This Might be the Future! appeared first on All About Statistics.

]]>Hilary and I are apart again and this time we’re talking about political polling. Also, they discuss Trump’s tweets, and the fact that Hilary owns a bowling ball.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!

If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show Notes:

Download the audio for this episode.

Listen here:

**Please comment on the article here:** **Simply Statistics**

The post Not So Standard Deviations Episode 21 – This Might be the Future! appeared first on All About Statistics.

]]>