The post December Reading List appeared first on All About Statistics.

]]>Goodness me! November went by really quickly!

- Bagnato, L., L. De Capitani, & A. Punzo, 2016. Testing for serial independence: Beyond the portmanteau approach.
*American Statistician*, in press. - Aastveit, K.A., C. Foroni, & F. Ravazzolo, 2016. Density forecasts with MIDAS models.
*Journal of Applied Econometrics*, in press. - Chang, C-L. & M. McAleer, 2016. A simple test for causality in volatility. Discussion Paper TI 2016-094/III, Tinbergen Institute.
- Frank, J. & B. Klar, 2016. Methods to test for equality of two normal distributions.
*Statistical Methods and Applications*, 25, 581-599. - Gao, J., G. Pan, & Y. Yang, 2016. Estimation of structural breaks in large panels with cross-sectional dependence. Working Paper 12/16, Department of Econometrics and Business Statistics, Monash University.
- Skeels, C.L. & F. Windmeijer, 2016. On the Stock-Yogo tables. Discussion Paper 16/679, Department of Economics, Univesity of Bristol.

**Please comment on the article here:** **Econometrics Beat: Dave Giles' Blog**

The post December Reading List appeared first on All About Statistics.

]]>Paul Alper points to this excellent news article by Aaron Carroll, who tells us how little information is available in studies of diet and public health. Here’s Carroll: Just a few weeks ago, a study was published in the Journal of Nutrition that many reports in the news media said proved that honey was no […]

The post So little information to evaluate effects of dietary choices appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post So little information to evaluate effects of dietary choices appeared first on All About Statistics.

]]>Paul Alper points to this excellent news article by Aaron Carroll, who tells us how little information is available in studies of diet and public health. Here’s Carroll:

Just a few weeks ago, a study was published in the Journal of Nutrition that many reports in the news media said proved that honey was no better than sugar as a sweetener, and that high-fructose corn syrup was no worse. . . .

Not so fast. A more careful reading of this research would note its methods. The study involved only 55 people, and they were followed for only two weeks on each of the three sweeteners. . . . The truth is that research like this is the norm, not the exception. . . .

Readers often ask me how myths about nutrition get perpetuated and why it’s not possible to do conclusive studies to answer questions about the benefits and harms of what we eat and drink.

Good question. Why is it that supposedly evidence-based health recommendations keep changing?

Carroll continues:

Almost everything we “know” is based on small, flawed studies. . . . This is true not only of the newer work that we see, but also the older research that forms the basis for much of what we already believe to be true. . . .

The honey study is a good example of how research can become misinterpreted. . . . A 2011 systematic review of studies looking at the effects of artificial sweeteners on clinical outcomes identified 53 randomized controlled trials. That sounds like a lot. Unfortunately, only 13 of them lasted for more than a week and involved at least 10 participants. Ten of those 13 trials had a Jadad score — which is a scale from 0 (minimum) to 5 (maximum) to rate the quality of randomized control trials — of 1. This means they were of rather low quality. None of the trials adequately concealed which sweetener participants were receiving. The longest trial was 10 weeks in length.

According to Carroll, that’s it:

This is the sum total of evidence available to us. These are the trials that allow articles, books, television programs and magazines to declare that “honey is healthy” or that “high fructose corn syrup is harmful.” This review didn’t even find the latter to be the case. . . .

My point is not to criticize research on sweeteners. This is the state of nutrition research in general. . . .

I just have one criticism. Carroll writes:

The outcomes people care about most — death and major disease — are actually pretty rare.

Death isn’t so rare. Everyone dies! Something like 1/80 of the population dies every year. The challenge is connecting the death to a possible cause such as diet.

Carroll also talks about the expense and difficulty of doing large controlled studies. Which suggests to me that we should be able to do better in our observational research. I don’t know exactly how to do it, but there should be some useful bridge between available data, on one hand, and experiments with N=55, on the other.

**P.S.** I followed a link to another post by Carroll which includes this crisp graph:

The post So little information to evaluate effects of dietary choices appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post So little information to evaluate effects of dietary choices appeared first on All About Statistics.

]]>The post Be careful evaluating model predictions appeared first on All About Statistics.

]]>One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.

Please read on for a nasty little example.

Consider the following “harmless data frame.”

```
```d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),
actual=c(2,3,1,2,2,3,1,2))

The recommended test for checking the quality of “`prediction`

” related to “`actual`

” is an F-test (this is the test `stats::lm`

uses). We can directly run this test with `sigr`

(assuming we have installed the package) as follows:

```
```sigr::formatFTest(d,'prediction','actual',format='html')$formatStr

**F Test** summary: (*R*^{2}=-16, *F*(1,6)=-5.6, *p*=n.s.).

`sigr`

reports an R-squared of -16 (please see here for some discussion of R-squared). This may be confusing, but it correctly communicates we have no model and in fact “`prediction`

” is worse than just using the average (a very traditional null-model).

However, `cor.test`

appears to think “`prediction`

” is a usable model:

```
```cor.test(d$prediction,d$actual)
Pearson's product-moment correlation
data: d$prediction and d$actual
t = 1.1547, df = 6, p-value = 0.2921
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3977998 0.8697404
sample estimates:
cor
0.4264014

This is all for a prediction where `sum((d$actual-d$prediction)^2)==66`

which is larger than `sum((d$actual-mean(d$actual))^2)==4`

. We concentrate on effects measures (such as R-squared) as we can drive the p-values wherever we want just by adding more data rows. Our point is: you are worse off using this model than using the mean-value of the actual (2) as your constant predictor. To my mind that is not a good prediction. And `lm`

seems similarly excited about “`prediction`

.”

```
```summary(lm(actual~prediction,data=d))
Call:
lm(formula = actual ~ prediction, data = d)
Residuals:
Min 1Q Median 3Q Max
-0.90909 -0.43182 0.09091 0.52273 0.72727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2727 0.3521 6.455 0.000655 ***
prediction 0.3636 0.3149 1.155 0.292121
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7385 on 6 degrees of freedom
Multiple R-squared: 0.1818, Adjusted R-squared: 0.04545
F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921

One reason to not trust the `lm`

result is it didn’t score the quality of “`prediction`

“. It scored the quality of “`0.3636*prediction + 2.2727`

.” It can be the case that “`0.3636*prediction + 2.2727`

” is in fact a good predictor. But that doesn’t help us if it is “`prediction`

” we are showing to our boss or putting into production. We can *try* to mitigate this by insisting `lm`

try to stay closer to the original by turning off the intercept or offset with the “`0+`

” notation. That looks like the following.

```
```summary(lm(actual~0+prediction,data=d))
Call:
lm(formula = actual ~ 0 + prediction, data = d)
Residuals:
Min 1Q Median 3Q Max
0.00 0.00 1.00 2.25 3.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
prediction -1.0000 0.6094 -1.641 0.145
Residual standard error: 1.927 on 7 degrees of freedom
Multiple R-squared: 0.2778, Adjusted R-squared: 0.1746
F-statistic: 2.692 on 1 and 7 DF, p-value: 0.1448

Even the `lm(0+)`

‘s adjusted prediction is bad as we see below:

```
```d$lmPred <- predict(lm(actual~0+prediction,data=d))
sum((d$actual-d$lmPred)^2)
[1] 26

Yes, the `lm(0+)`

found a way to improve the prediction; but the improved prediction is still very bad (worse than using a well chosen constant). And it is hard to argue that “`-prediction`

” is the same model as “`prediction`

.”

Now `sigr`

is fairly new code, so it is a bit bold saying it is right when it disagrees with the standard methods. However `sigr`

is right in this case. The standard methods are not so much wrong as different, for two reasons:

- They are answering different questions. The F-test is designed to check if the predictions in-hand are good or not; “
`cor.test`

” and “`lm %>% summary`

” are designed to check if any re-scaling of the prediction is in fact good. These are different questions. Using “`cor.test`

” or “`lm %>% summary`

” to test the utility of a potential variable is a good idea. The reprocessing hidden in these tests is consistent with the later use of a variable in a model. Using them to score model results that are supposed be directly used is wrong. - From the standard R code’s point of view it isn’t obvious what the right “null model” is. Remember our original point: the quality measures on
`lm(0+)`

are designed to see how well`lm(0+)`

is working. This means the`lm(0+)`

scores the quality of its output (not its inputs) so it gets credit for flipping the sign on the prediction. Also it considers the natural null-model to be one it can form where there are no variable driven effects. Since there is no intercept or “dc-term” in these models (caused by the “`0+`

” notation) the grand average is not considered a plausible null-model as it isn’t in the concept space of the modeling situation the`lm`

was presented with. Or from`help(summary.lm)`

:

R^2, the ‘fraction of variance explained by the model’,

R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2),

where y* is the mean of y[i] if there is an intercept and zero otherwise.

I admit, this *is* very confusing. But it corresponds to documentation, and makes sense from a modeling perspective. It is correct. The silent switching of null model from average to zero make sense in the context it is defined in. It does not make sense for testing our prediction, but that is just one more reason to use the proper F-test directly instead of trying to hack “`cor.test`

” or “`lm(0+) %>% summary`

” to calculate it for you.

And that is what `sigr`

is about: standard tests (using `R`

supplied implementations) with a slightly different calling convention to better document intent (which in our case is almost always measuring the quality of a model separate from model construction). It is a new library, so it doesn’t yet have the documentation needed to achieve its goal, but we will eventually get there.

**Please comment on the article here:** **Statistics – Win-Vector Blog**

The post Be careful evaluating model predictions appeared first on All About Statistics.

]]>Ari Lamstein writes: I chuckled when I read your recent “R Sucks” post. Some of the comments were a bit … heated … so I thought to send you an email instead. I agree with your point that some of the datasets in R are not particularly relevant. The way that I’ve addressed that is […]

The post Some U.S. demographic data at zipcode level conveniently in R appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Some U.S. demographic data at zipcode level conveniently in R appeared first on All About Statistics.

]]>Ari Lamstein writes:

I chuckled when I read your recent “R Sucks” post. Some of the comments were a bit … heated … so I thought to send you an email instead.

I agree with your point that some of the datasets in R are not particularly relevant. The way that I’ve addressed that is by adding more interesting datasets to my packages. For an example of this you can see my blog post choroplethr v3.1.0: Better Summary Demographic Data. By typing just a few characters you can now view eight demographic statistics (race, income, etc.) of each state, county and zip code in the US. Additionally, mapping the data is trivial.

I haven’t tried this myself, but assuming it works . . . that’s great to be able to make maps of American Community Survey data at the zipcode level!

The post Some U.S. demographic data at zipcode level conveniently in R appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Some U.S. demographic data at zipcode level conveniently in R appeared first on All About Statistics.

]]>The post Good stuff around appeared first on All About Statistics.

]]>Lately, I've been publicising quite heavily our Summer school and new MSc, but of course, we're not the only one to plan for interesting things worth mentioning $-$ well, of course this is highly subjective... But then again, this blog is (mainly) about Bayesian stuff, so what's the problem with that?...

Anyway, I know of at least a couple of very interesting events:

1) Petros' course on Decision modeling using R, in Toronto, in February 2017. Last year he kindly invited me and I gave some sort of BCEA tutorial, which I really enjoyed.

2) Emmanuel's summer school on advanced Bayesian methods, in Leuven, in September 2017 (I think their website is not live yet, but info will be available at the i-Biostat website). I think they'll do a three-day course on non-parametric Bayesian methods and then a two-day course on Bayesian clinical trials.

**Please comment on the article here:** **Gianluca Baio's blog**

The post Good stuff around appeared first on All About Statistics.

]]>Nate Silver agrees with me that much of that shocking 2% swing can be explained by systematic differences between sample and population: survey respondents included too many Clinton supporters, even after corrections from existing survey adjustments. In Nate’s words, “Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees.” Last time we looked carefully […]

The post Survey weighting and that 2% swing appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Survey weighting and that 2% swing appeared first on All About Statistics.

]]>Nate Silver agrees with me that much of that shocking 2% swing can be explained by systematic differences between sample and population: survey respondents included too many Clinton supporters, even after corrections from existing survey adjustments.

In Nate’s words, “Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees.” Last time we looked carefully at this, my colleagues and I found that pollsters weighted for sex x ethnicity and age x education, but not by ethnicity x education.

I could see that this could be an issue. It goes like this: Surveys typically undersample less-educated people, I think even relative to their proportion of voters. So you need to upweight the less-educated respondents. But less-educated respondents are more likely to be African Americans and Latinos, so this will cause you to upweight these minority groups. Once you’re through with the weighting (whether you do it via Mister P or classical raking or Bayesian Mister P), you’ll end up matching your target population on ethnicity and education, but not on their interaction, so you could end up with too few low-income white voters.

There’s also the gender gap: you want the right number of low-income white male and female voters in each category. In particular, we found that in 2016 the gender gap increased with education, so if your sample gets some of these interactions wrong, you could be biased.

Also a minor thing: Back in the 1990s the ethnicity categories were just white / other and there were 4 education categories: no HS / HS / some college / college grad. Now we use 4 ethnicity categories (white / black / hisp / other) and 5 education categories (splitting college grad into college grad / postgraduate degree). Still just 2 sexes though. For age, I think the standard is 18-29, 30-44, 45-64, and 65+. But given how strongly nonresponse rates vary by age, it could make sense to use more age categories in your adjustment.

Anyway, Nate’s headline makes sense to me. One thing surprises me, though. He writes, “most pollsters apply demographic weighting by race, age and gender to try to compensate for this problem. It’s less common (although by no means unheard of) to weight by education, however.” Back when we looked at this, a bit over 20 years ago, we found that some pollsters didn’t weight at all, some weighted only on sex, and some weighted on sex x ethnicity and age x education. The surveys that did very little weighting relied on the design to get a more representative sample, either using quota sampling or using tricks such as asking for the youngest male adult in the household.

Also, Nate writes, “the polls may not have reached enough non-college voters. It’s a bit less clear whether this is a longstanding problem or something particular to the 2016 campaign.” All the surveys I’ve seen (except for our Xbox poll!) have massively underrepresented young people, and this has gone back for decades. So no way it’s just 2016! That’s why survey organizations adjust for age. There’s always a challenge, though, in knowing what distribution to adjust to, as we don’t know turnout until after the election—and not even then, given all the problems with exit polls.

**P.S.** The funny thing is, back in September, Sam Corbett-Davies, David Rothschild, and I analyzed some data from a Florida poll and came up with the estimate that Trump was up by 1 in that state. This was a poll where the other groups analyzing the data estimated Clinton up by 1, 3, or 4 points. So, back then, our estimate was that a proper adjustment (in this case, using party registration, which we were able to do because this poll sampled from voter registration lists) would shift the polls by something like 2% (that is, 4% in the differential between the two candidates). But we didn’t really do anything with this. I can’t speak for Sam or David, but I just figured this was just one poll and I didn’t take it so seriously.

In retrospect maybe I should’ve thought more about the idea that mainstream pollsters weren’t adjusting their numbers enough. And in retrospect Nate should’ve thought of that too! Our analysis was no secret; it appeared in the New York Times. So Nate and I were both guilty of taking the easy way out and looking at poll aggregates and not doing the work to get inside the polls. We’re doing that now, in December, but I we should’ve been doing it in October. Instead of obsessing about details of poll aggregation, we should’ve been working more closely with the raw data.

**P.P.S.** Could someone please forward this email to Nate? I don’t think he’s getting my emails any more!

The post Survey weighting and that 2% swing appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Survey weighting and that 2% swing appeared first on All About Statistics.

]]>The post ratio-of-uniforms [#4] appeared first on All About Statistics.

]]>**P**ossibly the last post on random number generation by Kinderman and Monahan’s (1977) *ratio-of-uniform method*. After fiddling with the Gamma(a,1) distribution when a<1 for a while, I indeed figured out a way to produce a bounded set with this method: considering an arbitrary cdf Φ with corresponding pdf φ, the uniform distribution on the set Λ of (u,v)’s in **R⁺**x**X** such that

0≤u≤Φο**ƒ**[φοΦ⁻¹(u)v]

induces the distribution with density proportional to **ƒ** on φοΦ⁻¹(U)V. This set Λ has a boundary that is parameterised as

u=Φο**ƒ**(x), v=1/φο**ƒ**(x), x∈Χ

which remains bounded in u since Φ is a cdf and in v if φ has fat enough tails. At both 0 and ∞. When **ƒ** is the Gamma(a,1) density this can be achieved if φ behaves like log(x)² near zero and like a inverse power at infinity. Without getting into all the gory details, closed form density φ and cdf Φ can be constructed for all a’s, as shown for a=½ by the boundaries in u and v (yellow) below

which leads to a bounded associated set Λ

At this stage, I remain uncertain of the relevance of such derivations, if only because the set A thus derived is ill-suited for uniform draws proposed on the enclosing square box. And also because a Gamma(a,1) simulation can rather simply be derived from a Gamma(a+1,1) simulation. But, who knows?!, there may be alternative usages of this representation, such as innovative slice samplers. Which means the ratio-of-uniform method may reappear on the ‘Og one of those days…

Filed under: Books, pictures, R, Statistics, University life Tagged: Luc Devroye, Non-Uniform Random Variate Generation, random number generation, ratio of uniform algorithm, University of Warwick

**Please comment on the article here:** **R – Xi'an's Og**

The post ratio-of-uniforms [#4] appeared first on All About Statistics.

]]>The post RStudio in the cloud with Amazon Lightsail and docker appeared first on All About Statistics.

]]>About two years ago we published a quick and easy guide to setting up your own RStudio server in the cloud using the Docker service and Digital Ocean. The process is incredibly easy-- about the only cumbersome part is retyping a random password. Today the excitement in virtual private servers is that Amazon is getting into the market, with their Lightsail product. They are not undercutting Digital Ocean entirely-- in fact, their prices look to be just about identical. But Amazon's interface may have some advantages for you, so here's how to get Docker and RStudio running with Amazon Lightsail.

1. Log in to Lightsail

2. Create an Instance; choose the Base OS, and Ubuntu (as of this writing 16.04 LTS)

3. Name it what you like

4. Wait for boot up. Once it's running, click "connect" under the three dots. This opens a console window where you are already logged in, saving some headache vs. Digital Ocean.

5. Time for console commands. Type: sudo apt-get install docker.io Then Y for yes to add the new material.

6. Type: sudo service docker start

7. Now you can start your docker/rstudio container. See our earlier blog post or this link for resources. Shortcuts:

a. Plain Rstudio: sudo docker run -d -p 8787:8787 rocker/rstudio

b. All of Hadleyverse: sudo docker run -d -p 8787:8787 rocker/hadleyverse

c. Custom password: sudo docker run -d -p 8787:8787 -e USER=ken -e PASSWORD=ken rocker/hadleyverse

d. Enable root: sudo docker run -d -p 8787:8787 -e ROOT=TRUE rocker/rstudio

8.

9. The public IP is printed on the Networking page there. Cut and paste into your browser with :8787 appended. Your username and password are both rstudio, unless you changed them. To allow additional users onto your cloud server, see this page.

**Please comment on the article here:** **SAS and R**

The post RStudio in the cloud with Amazon Lightsail and docker appeared first on All About Statistics.

]]>The post Efficiently Saving and Sharing Data in R appeared first on All About Statistics.

]]>After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format `dat`

file).

It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.

In this post I test four different file formats available to R users. These formats are comma separated values `csv`

(`write.csv()`

), object representation format as a ASCII `txt`

(`dput()`

), a serialized R object (`saveRDS()`

), and a Stata file (`write.dta()`

from the `foreign`

package). For reference, `rds`

files seem to be identical to `Rdata`

files except that they deal with only one object rather than potentially multiple.

In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:

**Numeric Formats**

- Index 1 to N - ex. 1,2,3,4,...
- Whole Numbers - ex. 30, 81, 73, 5, ...
- Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
- Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...

**Text Formats**

- String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
- String coded factor variables with 16 characters coded as strings
- String coded factor variables with 64 characters coded as strings
- Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
- Factor coded variables with 16 characters
- Factor coded variables with 64 characters
- String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
- String variables with random 16 characters
- String variables with random 64 characters

What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (`1:N`

).

In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.

As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.

Let's first look at space taken when saving uncompressed files.

**Figure 1: File Size**

Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The `dataframe`

object is the `data.frame`

composed off all of the variables. From the height of the `dataframe`

, we can see that `rds`

is overall the winner. Looking at the other variable values we can see only that `csv`

appear to consistently underperform for most file formats except for random strings.

**Figure 2: File Size Log scaled**

In Figure 2 we can see that `rds`

is consistently outperforming all of the other formats with the one exception of `index`

in which the `txt`

encoding simply reads `1:10000`

. Apparently even serializing to bytes can't beat that.

Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of `csv`

to compress factor strings dramatically penalizes the efficiency of `csv`

relative to the other formats.

But data is rarely shared in uncompressed formats. How does compression change things?

**Figure 3: Zipped File Sizes Logged**

We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as `rds`

. Comma delineated `csv`

files are a bit of an exception with factor variables suffering under `csv`

. Yet random strings perform slightly better under `csv`

than other formats. Interesting `rds`

files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.

So, should we stick with whatever format we prefer?

Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.

But an equally important concern, is how long it takes write and read different file formats.

In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.

**Figure 4: Reading and unzipping average speeds**

From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB `csv`

file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.

In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (`rds`

), files could be 100 times larger than those used in this simulation and still open in less than a minute.

Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.

This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.

Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in `rds`

formats in addition to the outdated formats they currently exist in.

And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.

Yet, R is free and its database format is public domain. Any user could download R, open a `rds`

or `Rdata`

file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.

**Please comment on the article here:** **Econometrics By Simulation**

The post Efficiently Saving and Sharing Data in R appeared first on All About Statistics.

]]>Shea Levy writes: You ended a post from last month [i.e., Feb.] with the injunction to not take the fact of a paper’s publication or citation status as meaning anything, and instead that we should “read each paper on its own.” Unfortunately, while I can usually follow e.g. the criticisms of a paper you might […]

The post How can you evaluate a research paper? appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post How can you evaluate a research paper? appeared first on All About Statistics.

]]>Shea Levy writes:

You ended a post from last month [i.e., Feb.] with the injunction to not take the fact of a paper’s publication or citation status as meaning anything, and instead that we should “read each paper on its own.” Unfortunately, while I can usually follow e.g. the criticisms of a paper you might post, I’m not confident in my ability to independently assess arbitrary new papers I find. Assuming, say, a semester of a biological sciences-focused undergrad stats course and a general willingness and ability to pick up any additional stats theory or practice, what should someone in the relevant fields do to get to the point where they can meaningfully evaluate each paper they come across?

My reply: That’s a tough one. My own view of research papers has become much more skeptical over the years. For example, I devoted several posts to the Dennis-the-Dentist paper without expressing any skepticism at all—and then Uri Simonsohn comes along and shoots it down. So it’s hard to know what to say. I mean, even as of 2007, I think I had a pretty good understanding of statistics and social science. And look at all the savvy people who got sucked into that Bem ESP thing—not that they thought Bem had proved ESP, but many people didn’t realize how bad that paper was, just on statistical grounds.

So what to do to independently assess new papers?

I think you have to go Bayesian. And by that I *don’t* mean you should be assessing your prior probability that the null hypothesis is true. I mean that you have to think about effect sizes, on one side, and about measurement, on the other.

It’s not always easy. For example, I found the claimed effect sizes for the Dennis/Dentist paper to be reasonable (indeed, I posted specifically on the topic). For that paper, the problem was in the measurement, or one might say the likelihood: the mapping from underlying quantity of interest to data.

Other times we get external information, such as the failed replications in ovulation-and-clothing, or power pose, or embodied cognition. But we should be able to do better, as all these papers had major problems which were apparent, even before the failed reps.

One cue which we’ve discussed a lot: if a paper’s claim relies on p-values, and they have lots of forking paths, you might just have to set the whole paper aside.

Medical research: I’ve heard there’s lots of cheating, lots of excluding patients who are doing well under the control condition, lots of ways to get people out of the study, lots of playing around with endpoints.

The trouble is, this is all just a guide to skepticism. But I’m not skeptical about *everything*.

And the solution can’t be to ask Gelman. There’s only one of me to go around! (Or two, if you count my sister.) And I make mistakes too!

So I’m not sure. I’ll throw the question to the commentariat. What do you say?

The post How can you evaluate a research paper? appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post How can you evaluate a research paper? appeared first on All About Statistics.

]]>