The post Incentives Matter (Congress and Wall Street edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Thomas Ferguson sends along this paper. From the summary:

Social scientists have traditionally struggled to identify clear links between political spending and congressional voting, and many journalists have embraced their skepticism. A giant stumbling block has been the challenge of measuring the labyrinthine ways money flows from investors, firms, and industries to particular candidates. Ferguson, Jorgensen, and Chen directly tackle that classic problem in this paper. Constructing new data sets that capture much larger swaths of political spending, they show direct links between political contributions to individual members of Congress and key floor votes . . .

They show that prior studies have missed important streams of political money, and, more importantly, they show in detail how past studies have underestimated the flow of political money into Congress. The authors employ a data set that attempts to bring together all forms of campaign contributions from any source— contributions to candidate campaign committees, party committees, 527s or “independent expenditures,” SuperPACs, etc., and aggregate them by final sources in a unified, systematic way. To test the influence of money on financial regulation votes, they analyze the U.S. House of Representatives voting on measures to weaken the Dodd-Frank financial reform bill. Taking care to control as many factors as possible that could influence floor votes, they focus most of their attention on representatives who originally voted in favor of the bill and subsequently to dismantle key provisions of it. Because these are the same representatives, belonging to the same political party, in substantially the same districts, many factors normally advanced to explain vote shifts are ruled out from the start. . . .

The authors test five votes from 2013 to 2015, finding the link between campaign contributions from the financial sector and switching to a pro-bank vote to be direct and substantial. The results indicate that for every $100,000 that Democratic representatives received from finance, the odds they would break with their party’s majority support for the Dodd-Frank legislation increased by 13.9 percent. Democratic representatives who voted in favor of finance often received $200,000–$300,000 from that sector, which raised the odds of switching by 25–40 percent. The authors also test whether representatives who left the House at the end of 2014 behaved differently. They find that these individuals were much more likely to break with their party and side with the banks. . . .

I had a quick question: how do you deal with the correlation/causation issue? The idea that Wall St is giving money to politicians who would already support them? That too is a big deal, of course, but it’s not quite the story Ferguson et al. are telling in the paper.

Ferguson responded:

We actually considered that at some length. That’s why we organized the main discussion on Wall Street and Dodd-Frank around looking at Democratic switchers — people who originally voted for passage (against Wall Street, that is), but then switched in one or more later votes to weaken. Nobody is in that particular regression who didn’t already vote against Wall Street once already, when it really counted.

I replied: Sure, but there’s still the correlation problem, in that one could argue that switchers are people whose latent preferences were closer to the middle, so they were just the ones who were more likely to shift following a change in the political weather.

Ferguson:

Conservatism is controlled for in the analysis, using a measure derived from that Congress. This isn’t going to the middle; it’s a tropism for money. The other obvious comment is that if they are really latent Wall Street lovers, they should be moving mostly in lockstep on the subsequent votes. If you look at our summary nos., you can see they weren’t. We could probably mine that point some more.

Short of administering the MMPPI for banks in advance, are you prepared to accept any empirical evidence? Voting against banks in the big one is pretty good, I think.

Me: I’m not sure, I’ll have to think about it. One answer, I think, is that if it’s just $ given to pre-existing supporters of Wall St., it’s still an issue, as the congressmembers are then getting asymmetrically rewarded (votes for Wall St get the reward, votes against don’t get the reward), and, as economists are always telling us, Incentives Matter.

Ferguson:

Remember those folks who turned on Swaps Push Out didn’t necessarily turn out for the banks on other votes. If it’s “weather” it’s a pretty strange weather.

The post Incentives Matter (Congress and Wall Street edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Weekly Roundup, 23 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>* Lots of people got involved in pushing Stan 2.16 and interfaces out the door; Sean Talts got the math library, Stan library (that’s the language, inference algorithms, and interface infrastructure), and CmdStan out, while Allen Riddell got PyStan 2.16 out and Ben Goodrich and Jonah Gabry are tackling RStan 2.16

* Stan 2.16 is the last series of releases that will not require C++11; let the coding fun begin!

* Ari Hartikainen (of Aalto University) joined the Stan dev team—he’s working with Allen Riddell on PyStan, where judging from the pull request traffic, he put in a lot of work on the 2.16 release. Welcome!

* Imad Ali’s working on adding more cool features to RStanArm including time series and spatial models; yesterday he and Mitzi were scheming to get intrinsic conditional autoregressive models in and I heard all those time series name flying around (like ARIMA)

* Michael Betancourt rearranged the Stan web site with some input from me and Andrew; Michael added more descriptive text and Sean Talts managed to get the redirects in so all of our links aren’t broken; let us know what you think

* Markus Ojala of Smartly wrote a case study on their blog, Tutorial: How We Productized Bayesian Revenue Estimation with Stan

* Mitzi Morris got in the pull request for adding compound assignment and arithmetic; this adds statements such as `n += 1`

.

* lots of chatter about characterization tests and a pull request from Daniel Lee to update some of update some of our our existing performance tests

* Roger Grosse from U.Toronto visited to tell us about his, Siddharth Ancha, and Daniel Roy’s 2016 NIPS paper on testing MCMC using bidirectional Monte Carlo sampling; we talked about how he modified Stan’s sampler to do annealed importance sampling

* GPU integration continues apace

* I got to listen in on Michael Betancourt and Maggie Lieu of the European Space Institute spend a couple days hashing out astrophysics models; Maggie would really like us to add integrals.

* Speaking of integration, Marco Inacio has updated his pull request; Michael’s worried there may be numerical instabilities, because trying to calculate arbitrary bounded integrals is not so easy in a lot of cases

* Andrew continues to lobby for being able to write priors directly into parameter declarations; for example, here’s what a hierarchical prior for `beta`

might look like

parameters { real mu ~ normal(0, 2); realsigma ~ student_t(4, 0, 2); vector[N] beta ~ normal(mu, sigma); }

* I got the go-ahead on adding foreach loops; Mitzi Morris will probably be coding them. We’re talking about

real ys[N]; ... for (y in ys) target += log_mix(lambda, normal_lpdf(y | mu[1], sigma[1]), normal_lpdf(y | mu[2], sigma[2]));

* Kalman filter case study by Jouni Helske was discussed on Discourse

* Rob Trangucci rewrote the Gaussian processes chapter of the Stan manual; I’m to blame for the first version, writing it as I was learning GPs. For some reason, it’s not up on the web page doc yet.

* This is a very ad hoc list. I’m sure I missed lots of good stuff, so feel free to either send updates to me directly for next week’s letter or add things to comments. This project’s now way too big for me to track all the activity!

The post Stan Weekly Roundup, 23 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Several errors and omissions occurred in the reporting of research and data in our paper: “How Descriptive Food Names Bias Sensory Perceptions in Restaurants,” Food Quality and Preference (2005) . . .

The dog ate my data. Damn gremlins. I hate when that happens.

As the saying goes, “Each year we publish 20+ new ideas in academic journals, and we appear in media around the world.” In all seriousness, the problem is not that they publish their ideas, the problem is that they are “changing or omitting data or results such that the research is not accurately represented in the research record.” And of course it’s not just a problem with Mr. Pizzagate or Mr. Gremlins or Mr. Evilicious or Mr. Politically Incorrect Sex Ratios: it’s all sorts of researchers who (a) don’t report what they actually did, and (b) refuse to reconsider their flimsy hypotheses in light of new theory or evidence.

The post Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Question about the secret weapon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Micah Wright writes:

I first encountered your explanation of secret weapon plots while I was browsing your blog in grad school, and later in your 2007 book with Jennifer Hill. I found them immediately compelling and intuitive, but I have been met with a lot of confusion and some skepticism when I’ve tried to use them. I’m uncertain as to whether it’s me that’s confused, or whether my audience doesn’t get it. I should note that my formal statistical training is somewhat limited—while I was able to take a couple of stats courses during my masters, I’ve had to learn quite a bit on the side, which makes me skeptical as to whether or not I actually understand what I’m doing.

My main question is this: when using the secret weapon, does it make sense to subset the data across any arbitrary variable of interest, as long as you want to see if the effects of other variables vary across its range? My specific case concerns tree growth (ring widths). I’m interested to see how the effect of competition (crowding and other indices) on growth varies at different temperatures, and if these patterns change in different locations (there are two locations). To do this, I subset the growth data in two steps: first by location, then by each degree of temperature, which I rounded to the nearest integer. I then ran the same linear model on each subset. The model had growth as the response, and competition variables as predictors, which were standardized. I’ve attached the resulting figure [see above], which plots the change in effect for each predictor over the range of temperature.

My reply: I like these graphs! In future you might try a 6 x K grid, where K is the number of different things you’re plotting. That is, right now you’re wasting one of your directions because your 2 x 3 grid doesn’t mean anything. These plots are fine, but if you have more information for each of these predictors, you can consider plotting the existing information as six little graphs stacked vertically and then you’ll have room for additional columns. In addition, you should make the tick marks much smaller, put the labels closer to the axes, and reduce the number of axis labels, especially on the vertical axes. For example, (0.0, 0.3, 0.6, 0.9) can be replaced by labels at 0, 0.5, 1.

Regarding the larger issue of, what is the secret weapon, as always I see it as an approximation to a full model that bridges the different analyses. It’s a sort of nonparametric analysis. You should be able to get better estimates by using some modeling, but a lot of that smoothing can be done visually anyway, so the secret weapon gets you most of the way there, and in my view it’s much much better than the usual alternative of fitting a single model to all the data *without* letting all the coefficients vary.

The post Question about the secret weapon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Developers Who Use Spaces Make More Money Than Those Who Use Tabs” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Rudy Malka writes:

I think you’ll enjoy this nice piece of pop regression by David Robinson: developers who use spaces make more money than those who use tabs. I’d like to know your opinion about it.

At the above link, Robinson discusses a survey that allows him to compare salaries of software developers who use tabs to those who use spaces. The key graph is above. Robinson found similar results after breaking down the data by country, job title, or computer language used, and it also showed up in a linear regression controlling in a simple way for a bunch of factors.

As Robinson put it in terms reminiscent of our Why Ask Why? paper:

This is certainly a surprising result, one that I didn’t expect to find when I started exploring the data. . . . I tried controlling for many other confounding factors within the survey data beyond those mentioned here, but it was difficult to make the effect shrink and basically impossible to make it disappear.

Speaking with the benefit of hindsight—that is, seeing Robinson’s results and assuming they are a correct representation of real survey data—it all makes sense to me. Tabs seem so amateurish, I much prefer spaces—2 spaces, not 4, please!!!—so from that perspective it makes sense to me that the kind of programmers who use tabs tend to be programmers with poor taste and thus, on average, of lower quality.

I just want to say one thing. Robinson writes, “Correlation is not causation, and we can never be sure that we’ve controlled for all the confounding factors present in a dataset.” But this isn’t quite the point. Or, to put it another way, I think he has the right instinct here but isn’t quite presenting the issue precisely. To see why, suppose the survey had only 2 questions: How much money do you make? and Do you use spaces or tabs? And suppose we had no other information on the respondents. And, for that matter, suppose there was no nonresponse and that we had a simple random sample of all programmers from some specified set of countries. In that case, we’d know for sure that there are no other confounding factors in the dataset, as the dataset is nothing but those two columns of numbers. But we’d still be able to come up with a zillion potential explanations.

To put it another way, the descriptive comparison is interesting in its own right, and we just should be careful about misusing causal language. Instead of saying, “using spaces instead of tabs leads to an 8.6% higher salary,” we could say, “comparing two otherwise similar programmers, the one who uses spaces has, on average, an 8.6% higher salary than the one who uses tabs.” That’s a bit of a mouthful—but such a mouthful is necessary to accurately describe the comparison that’s being made.

The post “Developers Who Use Spaces Make More Money Than Those Who Use Tabs” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Time-sharing Experiments for the Social Sciences appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak® Panel (see http:/tessexperiments.org for more information). In an effort to enable younger scholars to field larger-scale studies than what TESS normally conducts, we are pleased to announce a Special Competition for Young Investigators. While anyone can submit at any time through TESS’s regular proposal mechanism, this Special Competition is limited to graduate students and individuals who are who are no more than 3 years post-PhD. Winning projects will be allowed to be fielded at a size up to twice the usual budget as a regular TESS study. For more specifics on the special competition, see: http://tessexperiments.org/yic.html We will begin accepting proposals for the Special Competition on August 1, 2017, and the deadline is October 1, 2017. Full details about the competition are available at http://www.tessexperiments.org/yic.html. This page includes information about what is required of proposals and how to submit, and should be reviewed by anyone entering the competition.

The post Time-sharing Experiments for the Social Sciences appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A new paper in the prestigious journal PNAS contains a rather glaring blooper. . . . right there in the abstract, which states that “three neuropeptides (β-endorphin, oxytocin, and dopamine) play particularly important roles” in human sociality. But dopamine is not a neuropeptide. Neither are serotonin or testosterone, but throughout the paper, Pearce et al. refer to dopamine, serotonin and testosterone as ‘neuropeptides’. That’s just wrong. A neuropeptide is a peptide active in the brain, and a peptide in turn is the term for a molecule composed of a short chain of amino acids. Neuropeptides include oxytocin, vasopressin, and endorphins – which do feature in the paper. But dopamine and serotonin aren’t peptides, they’re monoamines, and testosterone isn’t either, it’s a steroid. This isn’t a matter of opinion, it’s basic chemistry.

The error isn’t just an isolated typo: ‘neuropeptide’ occurs 27 times in the paper, while the correct terms for the non-peptides are never used.

Neuroskeptic speculates on how this error got in:

It’s a simple mistake; presumably whoever wrote the paper saw oxytocin and vasopressin referred to as “neuropeptides” and thought that the term was a generic one meaning “signalling molecule.” That kind of mistake could happen to anyone, so we shouldn’t be too harsh on the authors . . .

The authors of the papers work in a psychology department so I guess they’re rusty on their organic chemistry.

Fair enough; I haven’t completed a chemistry class since 11th grade, and I didn’t know what a peptide is, either. Then again, I’m not writing articles on peptides for the National Academy of Sciences.

But how did this get through the review process? Let’s take a look at the published article:

Ahhhh, now I understand. The editor is Susan Fiske, notorious as the person who opened the gates of PPNAS for the articles on himmicanes, air rage, and ages ending in 9. I wonder who were the reviewers of this new paper. Nobody who knows what a peptide is, I guess. Or maybe they just read it very quickly, flipped through to the graphs and the conclusions, and didn’t read a lot of the words.

Did you catch that? Neuroskeptic refers to “the prestigious journal PNAS.” That’s PPNAS for short. This is fine, I guess. Maybe the science is ok. Based on a quick scan of the paper, I don’t think we should take a lot of the specific claims seriously, as they seem to based on the difference between “significant” and “non-significant.”

In particular, I’m not quite sure what is their support for the statement from the abstract that “each neuropeptide is quite specific in its domain of influence.” They’re rejecting various null hypotheses but I don’t know that this is supporting their substantive claims in the way that they’re saying.

I might be missing something here—I might be missing a lot—but in any case there seem to be some quality control problems at PPNAS. This should be no surprise: PPNAS is a huge journal, publishing over 3000 papers each year.

On their website they say, “PNAS publishes only the highest quality scientific research,” but this statement is simply false. I can’t really comment on this particular paper—it doesn’t seem like “the highest quality scientific research” to me, but, again, maybe I’m missing something big here. But I can assure you that the papers on himmicanes, air rage, and ages ending in 9 are *not* “the highest quality scientific research.” They’re not high quality research at all! What they are, is low-quality research that happens to be high-quality clickbait.

OK, let’s be fair. This is not a problem unique to PPNAS. The Lancet publishes crap papers, Psychological Science published crap papers, even JASA and APSR have their share of duds. Statistical Science, to its eternal shame, published that Bible Code paper in 1994. That’s fine, it’s how the system operates. Editors are only human.

But, really, do we have to make statements that we know are false? **Platitudes are fine but let’s avoid intentional untruths.**

**So, instead of “PNAS publishes only the highest quality scientific research,” how about this: “PNAS aims to publish only the highest quality scientific research.”** That’s fair, no?

**P.S.** Here’s a fun little graphics project: Redo Figure 1 as a lineplot. You’ll be able to show a lot more comparisons much more directly using lines rather than bars. The current grid of barplots is not the worst thing in the world—it’s much better than a table—but it could be much improved.

**P.P.S.** Just to be clear: (a) I don’t know anything about peptides so I’m offering no independent judgment of the paper in question; (b) whatever the quality is of this particular paper, does not affect my larger point that PPNAS publishes some really bad papers and so they should change their slogan to something more accurate.

**P.P.P.S.** The relevant Pubpeer page pointed to the following correction note that was posted on the PPNAS site after I wrote the above post but before it was posted:

The authors wish to note, “We used the term ‘neuropeptide’ in referring to the set of diverse neurochemicals that we examined in this study, some of which are not peptides; dopamine and serotonin are neurotransmitters and should be listed as such, and testosterone should be listed as a steroid. Our usage arose from our primary focus on the neuropeptides endorphin and oxytocin. Notwithstanding the biochemical differences between these neurochemicals, we note that these terminological issues have no implications for the significance of the findings reported in this paper.”

The post After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck through the rest of the year (and a few to begin 2018) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>- After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.
- “Developers Who Use Spaces Make More Money Than Those Who Use Tabs”
- Question about the secret weapon
- Incentives Matter (Congress and Wall Street edition)
- Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.
- Problems with the jargon the jargon “statistically significant” and “clinically significant”
- Capitalist science: The solution to the replication crisis?
- Bayesian, but not Bayesian enough
- Let’s stop talking about published research findings being true or false
- Plan 9 from PPNAS
- No, I’m not blocking you or deleting your comments!
- “Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.”
- “The Null Hypothesis Screening Fallacy”?
- What is a pull request?
- Turks need money after expensive weddings
- Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.”
- My unpublished papers
- Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories
- Night Hawk
- Why they aren’t behavioral economists: Three sociologists give their take on “mental accounting”
- Further criticism of social scientists and journalists jumping to conclusions based on mortality trends
- Daryl Bem and Arthur Conan Doyle
- Classical statisticians as Unitarians
- Slaying Song
- What is “overfitting,” exactly?
- Graphs as comparisons: A case study
- Should we continue not to trust the Turk? Another reminder of the importance of measurement
- “The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll
- His concern is that the authors don’t control for the position of games within a season.
- How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?
- “Bayes factor”: where the term came from, and some references to why I generally hate it
- A stunned Dyson
- Applying human factors research to statistical graphics
- Recently in the sister blog
- Adding a predictor can increase the residual variance!
- Died in the Wool
- “Statistics textbooks (including mine) are part of the problem, I think, in that we just set out ‘theta’ as a parameter to be estimated, without much reflection on the meaning of ‘theta’ in the real world.”
- An improved ending for The Martian
- Delegate at Large
- Iceland education gene trend kangaroo
- Reproducing biological research is harder than you’d think
- The fractal zealots
- Giving feedback indirectly by invoking a hypothetical reviewer
- It’s hard to know what to say about an observational comparison that doesn’t control for key differences between treatment and control groups, chili pepper edition
- PPNAS again: If it hadn’t been for the jet lag, would Junior have banged out 756 HRs in his career?
- Look. At. The. Data. (Hollywood action movies example)
- “This finding did not reach statistical significance, but it indicates a 94.6% probability that statins were responsible for the symptoms.”
- Wolfram on Golomb
- Irwin Shaw, John Updike, and Donald Trump
- What explains my lack of openness toward this research claim? Maybe my cortex is just too damn thick and wrinkled
- I love when I get these emails!
- Consider seniority of authors when criticizing published work?
- Does declawing cause harm?
- Bird fight! (Kroodsma vs. Podos)
- The Westlake Review
- “Social Media and Fake News in the 2016 Election”
- Also holding back progress are those who make mistakes and then label correct arguments as “nonsensical.”
- Just google “Despite limited statistical power”
- It is somewhat paradoxical that good stories tend to be anomalous, given that when it comes to statistical data, we generally want what is typical, not what is surprising. Our resolution of this paradox is . . .
- “Babbage was out to show that not only was the system closed, with a small group controlling access to the purse strings and the same individuals being selected over and again for the few scientific honours or paid positions that existed, but also that one of the chief beneficiaries . . . was undeserving.”
- Irish immigrants in the Civil War
- Mixture models in Stan: you can use log_mix()
- Don’t always give ’em what they want: Practicing scientists want certainty, but I don’t want to offer it to them!
- Cumulative residual plots seem like they could be useful
- Sucker MC’s keep falling for patterns in noise
- Nice interface, poor content
- “From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up.”
- Chris Moore, Guy Molyneux, Etan Green, and David Daniels on Bayesian umpires
- Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice
- “Mainstream medicine has its own share of unnecessary and unhelpful treatments”
- What are best practices for observational studies?
- The Groseclose endgame: Getting from here to there.
- Causal identification + observational study + multilevel model
- All cause and breast cancer specific mortality, by assignment to mammography or control
- Iterative importance sampling
- Rosenbaum (1999): Choice as an Alternative to Control in Observational Studies
- Gigo update (“electoral integrity project”)
- How to design and conduct a subgroup analysis?
- Local data, centralized data analysis, and local decision making
- Too much backscratching and happy talk: Junk science gets to share in the reputation of respected universities
- Selection bias in the reporting of shaky research: An example
- Self-study resources for Bayes and Stan?
- Looking for the bottom line
- “How conditioning on post-treatment variables can ruin your experiment and what to do about it”
- Trial by combat, law school style
- Causal inference using data from a non-representative sample
- Type M errors studied in the wild
- Type M errors in the wild—really the wild!
- Where does the discussion go?
- Maybe this paper is a parody, maybe it’s a semibluff
- As if the 2010s never happened
- Using black-box machine learning predictions as inputs to a Bayesian analysis
- It’s not enough to be a good person and to be conscientious. You also need good measurement. Cargo-cult science done very conscientiously doesn’t become good science, it just falls apart from its own contradictions.
- Air rage update
- Getting the right uncertainties when fitting multilevel models
- Chess records page
- Weisburd’s paradox in criminology: it can be explained using type M errors
- “Cheerleading with an agenda: how the press covers science”
- Automated Inference on Criminality Using High-tech GIGO Analysis
- Some ideas on using virtual reality for data visualization: I don’t really agree with the details here but it’s all worth discussing
- Contribute to this pubpeer discussion!
- For mortality rate junkies
- The “fish MRI” of international relations studies.
- “5 minutes? Really?”
- 2 quick calls
- Should we worry about rigged priors? A long discussion.
- I’m not on twitter
- I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief
- “Why bioRxiv can’t be the Central Service”
- Sudden Money
- The house is stronger than the foundations
- Please contribute to this list of the top 10 do’s and don’ts for doing better science
- Partial pooling with informative priors on the hierarchical variance parameters: The next frontier in multilevel modeling
- Does racquetball save lives?
- When do we want evidence-based change? Not “after peer review”
- “I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.”
- “Bayesian evidence synthesis”
- Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”
- Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment
- From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial.
- Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference
- “La critique est la vie de la science”: I kinda get annoyed when people set themselves up as the voice of reason but don’t ever get around to explaining what’s the unreasonable thing they dislike.
- How to discuss your research findings without getting into “hypothesis testing”?
- Does traffic congestion make men beat up their wives?
- The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting
- I think it’s great to have your work criticized by strangers online.
- In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried.
- If you want to know about basketball, who ya gonna trust, the Irene Blecker Rosenfeld Professor of Psychology at Cornell University and author of “The Wisest One in the Room: How You Can Benefit from Social Psychology’s Most Powerful Insights,” . . . or that poseur Phil Jackson??
- Quick Money
- An alternative to the superplot
- Where the money from Wiley Interdisciplinary Reviews went . . .
- Retract or correct, don’t delete or throw into the memory hole
- Using Mister P to get population estimates from respondent driven sampling
- “Americans Greatly Overestimate Percent Gay, Lesbian in U.S.”
- “It all reads like a classic case of faulty reasoning where the reasoner confuses the desirability of an outcome with the likelihood of that outcome.”
- Pseudoscience and the left/right whiplash
- The time reversal heuristic (priming and voting edition)
- The Night Riders
- Why you can’t simply estimate the hot hand using regression
- Stan to improve rice yields
- When people proudly take ridiculous positions
- “A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state,” and other notes on “Whither Science?” by Danko Antolovic
- Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.
- What should this student do? His bosses want him to p-hack and they don’t even know it!
- Fitting multilevel models when predictors and group effects correlate
- I hate that “Iron Law” thing
- High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”
- “What is a sandpit?”
- No no no no no on “The oldest human lived to 122. Why no person will likely break her record.”
- Tips when conveying your research to policymakers and the news media
- Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.
- Spatial models for demographic trends?
- A pivotal episode in the unfolding of the replication crisis
- We start by talking reproducible research, then we drift to a discussion of voter turnout
- Wine + Stan + Climate change = ?
- Stan is a probabilistic programming language
- Using output from a fitted machine learning algorithm as a predictor in a statistical model
- Poisoning the well with a within-person design? What’s the risk?
- “Dear Professor Gelman, I thought you would be interested in these awful graphs I found in the paper today.”
- I know less about this topic than I do about Freud.
- Driving a stake through that ages-ending-in-9 paper
- What’s the point of a robustness check?
- Oooh, I hate all talk of false positive, false negative, false discovery, etc.
- Trouble Ahead
- A new definition of the nerd?
- Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem
- Popular expert explains why communists can’t win chess championships!
- The four missing books of Lawrence Otis Graham
- “There was this prevalent, incestuous, backslapping research culture. The idea that their work should be criticized at all was anathema to them. Let alone that some punk should do it.”
- Loss of confidence
- “How to Assess Internet Cures Without Falling for Dangerous Pseudoscience”
- Ed Jaynes outta control!
- A reporter sent me a Jama paper and asked me what I thought . . .
- Workflow, baby, workflow
- Two steps forward, one step back
- Yes, you can do statistical inference from nonrandom samples. Which is a good thing, considering that nonrandom samples are pretty much all we’ve got.
- The Night Riders
- The piranha problem in social psychology / behavioral economics: The “take a pill” model of science eats itself
- Ready Money
- Stranger than fiction
- “The Billy Beane of murder”?
- Red doc, blue doc, rich doc, rich doc
- Working Class Postdoc
- “We wanted to reanalyze the dataset of Nelson et al. However, when we asked them for the data, they said they would only share the data if we were willing to include them as coauthors.”
- UNDER EMBARGO: the world’s most unexciting research finding
- Setting up a prior distribution in an experimental analysis
- Walk a Crooked MiIe
- It’s . . . spam-tastic!
- The failure of null hypothesis significance testing when studying incremental changes, and what to do about it
- Robust standard errors aren’t for me
- Stupid-ass statisticians don’t know what a goddam confidence interval is
- Forking paths plus lack of theory = No reason to believe any of this.
- Turn your scatterplots into elegant apparel and accessories!
- Your (Canadian) tax dollars at work

And a few to begin 2018:

- The Ponzi threshold and the Armstrong principle
- I’m with Errol: On flypaper, photography, science, and storytelling
- Politically extreme yet vital to the nation
- How does probabilistic computation differ in physics and statistics?
- “Each computer run would last 1,000-2,000 hours, and, because we didn’t really trust a program that ran so long, we ran it twice, and it verified that the results matched. I’m not sure I ever was present when a run finished.”

Enjoy.

We’ll also intersperse topical items as appropriate.

The post On deck through the rest of the year (and a few to begin 2018) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Not everyone’s aware of falsificationist Bayes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Daniel Lakens recently blogged about philosophies of science and how they relate to statistical philosophies. I thought it may be of interest to you. In particular, this statement:

From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories. Belief can be used to decide which questions to examine, but it can not be used to determine the truth-likeness of a theory.

My response, TLDR:

1) frequentism and NP require more subjectivity than they’re given credit for (assumptions, belief in perfectly known sampling distributions, Beta [and thus type-2 error ‘control’] requires subjective estimate of the alternative effect size)2) Bayesianism isn’t inherently more subjective, it just acknowledges uncertainty given the data [still data-driven!]

3) Popper probably wouldn’t like the NHST ritual, given that we use p-values to support hypotheses, not to refute an accepted hypothesis [the nil-hypothesis of 0 is not an accepted hypothesis in most cases]

4) Refuting falsifiable hypotheses can be done in Bayes, which is largely what Popper cared about anyway

5) Even in a NP or LRT framework, people don’t generally care about EXACT statistical hypotheses, they care about substantive hypotheses, which map to a range of statistical/estimate hypotheses, and YET people don’t test the /range/, they test point values; bayes can easily ‘test’ the hypothesized range.

My [Martin’s] full response is here.

I agree with everything that Martin writes above. And, for that matter, I agree with most of Lakens wrote too. The starting point for all of this is my 2011 article, Induction and deduction in Bayesian data analysis. Also relevant are my 2013 article with Shalizi, Philosophy and the practice of Bayesian statistics and our response to the ensuing discussion, and my recent article with Hennig, Beyond subjective and objective in statistics.

Lakens covers the same Popper-Lakatos ground that we do, although he (Lakens) doesn’t appear to be aware of the falsificationist view of Bayesian data analysis, as expressed in chapter 6 of BDA and the articles listed above. Lakens is stuck in a traditionalist view of Bayesian inference as based on subjectivity and belief, rather than what I consider a more modern approach of conditionality, where Bayesian inference works out the implications of a statistical model or system of assumptions, the better to allow us to reveal problems that motivate improvements and occasional wholesale replacements of our models.

Overall I’m glad Lakens wrote his post because he’s reminding people of important issues that are not handled well in traditional frequentist or subjective-Bayes approaches, and I’m glad that Martin filled in some of the gaps. The audience for all of this seems to be psychology researchers, so let me re-emphasize a point I’ve made many times, the distinction between statistical models and scientific models. A statistical model is necessarily specific, and we should avoid the all-too-common mistake of rejecting some uninteresting statistical model and taking this as evidence for a preferred scientific model. That way lies madness.

The post Not everyone’s aware of falsificationist Bayes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Breaking the dataset into little pieces and putting it back together again appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I was a little surprised that your blog post with the three smaller studies versus one larger study question received so many comments, and also that so many people seemed to come down on the side of three smaller studies. I understand that Stephen’s framing led to some confusion as well as practical concerns, but I thought the intent of the question was pretty straightforward.

At the risk of beating a dead horse, I wanted to try asking the question a different way: if you conducted a study (or your readers, if you want to put this on the blog), would you ever divide up the data into smaller chunks to see if a particular result appeared in each subset? Ignoring cases where you might want to examine qualitatively different groups, of course; would you ever try to make fundamentally homogeneous/equivalent subsets? Would you ever advise that someone else do so?

For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions. The blog comments, however, seem to come down on the side of this being a good practice. Are you (or your readers) going to start doing this?

My reply:

From a Bayesian standpoint, the result is the same, whether you consider all the data at once, or stir in the data one-third at a time. The problem would come if you make intermediate decisions that involve throwing away information, for example if you take parts of the data and just describe them as statistically significant or not.

The post Breaking the dataset into little pieces and putting it back together again appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Don’t say “improper prior.” Say “non-generative model.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In Bayesian Data Analysis, we write, “In general, we call a prior density p(θ) *proper* if it does not depend on data and integrates to 1.” This was a step forward from the usual understanding which is that a prior density is improper if an infinite integral.

But I’m not so thrilled with the term “proper” because it has different meanings for different people.

Then the other day I heard Dan Simpson and Mike Betancourt talking about “non-generative models,” and I thought, Yes! this is the perfect term! First, it’s unambiguous: a non-generative model is a model for which it is not possible to generate data. Second, it makes use of the existing term, “generative model,” hence no need to define a new concept of “proper prior.” Third, it’s a statement about the model as a whole, not just the prior.

I’ll explore the idea of a generative or non-generative model through some examples:

Classical iid model, y_i ~ normal(theta, 1), for i=1,…,n. This is not generative because there’s no rule for generating theta.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with uniform prior density, p(theta) proportional to 1 on the real line. This is not generative because you can’t draw theta from a uniform on the real line.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with data-based prior, theta ~ normal(y_bar, 10), where y_bar is the sample mean of y_1,…,y_n. This model is not generative because to generate theta, you need to know y, but you can’t generate y until you know theta.

In contrast, consider a Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with non-data-based prior, theta ~ normal(0, 10). This is generative: you draw theta from the prior, then draw y given theta.

Some subtleties do arise. For example, we’re implicitly conditioning on n. For the model to be fully generative, we’d need a prior distribution for n as well.

Similarly, for a regression model to be fully generative, you need a prior distribution on x.

Non-generative models have their uses; we should just recognize when we’re using them. I think the traditional classification of prior, labeling them as improper if they have infinite integral, does not capture the key aspects of the problem.

**P.S.** Also relevant is this comment, regarding some discussion of models for the n:

As in many problems, I think we get some clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis. So if we have a regression with outcomes y, predictors x, and sample size n, we can think of this as one of a larger class of problems, in which case it can make sense to think of n and x as varying across problems.

The issue is not so much whether n is a “random variable” in any particular study (although I will say that, in real studies, n typically is not precisely defined ahead of time, what with difficulties of recruitment, nonresponse, dropout, etc.) but rather that n can vary across the reference class of problems for which a model will be fit.

The post Don’t say “improper prior.” Say “non-generative model.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Where’d the $2500 come from? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Sometimes I read the New York Times “Well” articles on science and health. It’s a mixed bag, sometimes it’s quite good and sometimes not. I came across this yesterday:

What’s the Value of Exercise? $2,500

For people still struggling to make time for exercise, a new study offers a strong incentive: You’ll save $2,500 a year.

The savings, a result of reduced medical costs, don’t require much effort to accrue — just 30 minutes of walking five days a week is enough.

The findings come from an analysis of 26,239 men and women, published today in the Journal of the American Heart Association. . . .

I [Buchsbaum] thought: I wonder where the number came from? So I tracked down the paper referred to in the article (which was unhelpfully not linked or properly named).

I was horrified to find that the $2500 figure appears to be nowhere in the paper (see table 2). Moreover, the closest number I could find ($1900) was based on a regression model without covarying age, sex, ethnicity, income, or anything else. Of course older people exercise less and spend more on healthcare!

I sent the following email (see below) to the NYTimes author, but she has not responded.

At any rate, I thought this example of very high-profile science-blogging to be particularly egregious, so I thought I’d bring it to your attention.

The research article is Economic Impact of Moderate-Vigorous Physical Activity Among Those With and Without Established Cardiovascular Disease: 2012 Medical Expenditure Panel Survey, by Javier Valero-Elizondo, Joseph Salami, Chukwuemeka Osondu, Oluseye Ogunmoroti, Alejandro Arrieta, Erica Spatz, Adnan Younus, Jamal Rana, Salim Virani, Ron Blankstein, Michael Blaha, Emir Veledar, and Khurram Nasir.

And here’s Buchsbaum’s letter to Gretchen Reynolds, the author of that news article:

I very much enjoy your health articles for the New York Times. Sometimes I try and find the paper and examine the data, just for my own benefit.

After perusing the paper, I’m was not quite sure where the $2500 figure came from. In table 2 (see attached paper), the unadjusted expenditures are reported over all subjects.

non-optimal PA: $5397, optimal PA: $3443 for a difference of $1900.

This is close to $2500 but your number is higher.

However, remember, this is an *unadjusted model*. It does not account for age, sex, family income, race/ethnicity, insurance type, geographical location or comorbidity.

In other words, it’s a virtually useless model.

Lets look at Model 3, which does account for the above factors.

non-optimal PA: $4867, optimal PA: $4153 for a difference of $714

So $714 closer to the mark.

BUT, this includes ALL subjects, including those with cardiovascular disease (CVD).

If you look at people without CVD then the estimates depend on the cardiovascular risk profile (CRF). If you have an average or optimal profile then the difference is around $430 or $493. If you have a “poor” profile, then the difference is around $1060 (although the 95% confidence intervals overlapped, meaning the effect was not reliable).

What is my conclusion?

I’m afraid the title of your article is misleading since it is larger (by $600) than the $1900 estimate based on the meaningless unadjusted model! Even if the title was “What’s the Value of Exercise? $700”, it would still be misleading, because it implicitly assumes a causal relationship between exercise and expenditure.

Remember also that the adjusted variables are only the measures the authors happened to record. There are dozens of potentially other mediating variables which are related to both physical exercise and health expenditures. Including these other adjusting factors might further reduce the estimates.

Best Regards,

It’s just a news article so some oversimplification is perhaps unavoidable. But I do wonder where the $2500 number came from. I’m guessing it’s from some press release but I don’t know.

Also, I’m surprised the reporter didn’t respond to the email. But maybe New York Times reporters get too many emails to respond to, or even read. I should also emphasize that I did not read that news article or the scientific paper in detail, so I’m not endorsing (or disagreeing with) Buchsbaum’s claim. Here I’m just interested the general challenge of tracking down numbers like that $2500 that have no apparent source.

The post Where’d the $2500 come from? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Weekly Roundup, 16 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>- stan-dev GitHub organization: this is the home of all of our source repos; design discussions are on the Stan Wiki

- Stan Discourse Groups: this is the home of our user and developer lists (they’re all open); feel free to join the discussion—we try to be friendly and helpful in our responses, and there is a lot of statistical and computational expertise in the wings from our users, who are increasingly joining the discussion. By the way, thanks for that—it takes a huge load off us to get great answers from users to other user questions. We’re up to about 15 or so active discussion threads a day or thereabouts (active topics in the last 24 hours include AR(K) models, web site reorganization, ragged arrays, order statitic priors, new R packages built on top of Stan, docker images for Stan on AWS, and many more!)

OK, let’s get started with the weekly review, though this is a special summer double issue, just like the *New Yorker*.

**Your news here**: If you have any Stan news you’d like to share, please let me know at carp@alias-i.com (we’ll probably get a more standardized way to do this in the future).

**New web site**: Michael Betancourt redesigned the Stan web site; hopefully this will be easier to use. We’re no longer trying to track the literature. If you want to see the Stan literature in progress, do a search for “Stan Development Team” or “mc-stan.org” on Google Scholar; we can’t keep up! Do let us know either in an issue on GitHub for the web site or in the user group on Discourse if you have comments or suggestions.

**New user and developer lists**: We’ve shuttered our Google group and moved to Discourse for both our user and developer lists (they’re consolidated now in categories on one list). It’s easy to signup with GitHub or Google IDs and much easier to search and use online.

See Stan Discourse Groups and for the old discussions, Stan’s shuttered Google group for users and Stan’s shuttered Google group for developers“. We’re not removing any of the old content, but we are prohibiting new posts.

**GPU support**: Rok Cesnovar and Steve Bronder have been getting GPU support working for linear algebra operations. They’re starting with Cholesky decomposition because it’s a bottleneck for Gaussian process (GP) models and because it has the pleasant property of being quadratic in data and cubic in computation.

See math pull request 529

**Distributed computing support**: Sebastian Weber is leading the charge into distributed computing using the MPI framework (multi-core or multi-machine) by essentially coding up map-reduce for derivatives inside of Stan. Together with GPU support, distributed computing of derivatives will give us a TensorFlow-like flexibility to accelerate computations. Sebastian’s also looking into parallelizing the internals of the Boost and CVODES ordinary differential equation (ODE) solvers using OpenCL.

See math issue 101, math issue 551,

**Logging framework**: Daniel Lee added a logging framework to Stan to allow finer-grained control of

**Operands and partials**: Sean Talts finished the refactor of our underlying operands and partials data structure, which makes it much simpler to write custom derivative functions

See pull request 547

**Autodiff testing framework**: Bob Carpenter finished the first use case for a generalized autodiff tester to test all of our higher-order autodiff thoroughly

See math pull request 562

**C++11**: We’re all working toward the 2.16 release, which will be our last release before we open the gates of C++11 (and some of C++14). This is going to make our code a whole lot easier to write and maintain, and will open up awesome possibilities like having closures to define lambdas within the Stan language, as well as consolidating many of our uses of Boost into standard template library.

**Append arrays**: Ben Bales added signatures for `append_array`

, to work like our appends for vectors and matrices.

See pull request 554 and pull request 550

**ODE system size checks**: Sebastian Weber pushed a bug fix that cleans up ODE system size checks to avoid seg faults at run time.

See pull request 559

**RNG consistency in transformed data**: A while ago we relaced the generated-quantities-only nature of `_rng`

functions by allowing them in transformed data (so you can fit fake data generated wholly within Stan or represent posterior uncertainty of some other process, allowing “cut”-like models to be formulated as a two-stage process); Mitzi Morris just cleaned these up so we use the same RNG seed for all chains so that we can perform converence monitoring; multiple replications would then be done by running the whole multi-chain process multiple times.

See Stan pull request 2313

**NSF Grant: CI-SUSTAIN: Stan for the Long Run**: We (Bob Carpenter, Andrew Gelman, Michael Betancourt) were just awarded an NSF grant for Stan sustainability. This was a follow-on from the first Compute Resource Initiative (CRI) grant we got after building the system. Yea! This adds roughly a year of funding for the team at Columbia University. Our goal is to put in governance processes for sustaining the project as well as shore up all of our unit tests and documentation.

**Hiring**: We hired two full-time Stan staff at Columbia. Sean Talts joins as a developer at Columbia and Breck Baldwin as a business manager for the project, both at Columbia. Sean had already been working as a contractor for us, hence all the pull requests. (Pro tip: The best way to get a foot in the door for an open-source project is to submit a useful pull request.)

The post Stan Weekly Roundup, 16 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post SPEED: Parallelizing Stan using the Message Passing Interface (MPI) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Sebastian Weber writes:

Bayesian inference has to overcome tough computational challenges and thanks to Stan we now have a scalable MCMC sampler available. For a Stan model running NUTS, the computational cost is dominated by gradient calculations of the model log-density as a function of the parameters. While NUTS is scalable to huge parameter spaces, this scalability becomes more of a theoretical one as the computational cost explodes. Models which involve ordinary differential equations (ODE) are such an example, where the runtimes can be of the order of days.

The obvious speedup when using Stan is to run multiple chains at the same time on different computer cores. However, this cannot reduce the total runtime per chain, which requires within-chain parallelization.

Hence, a viable approach is to parallelize the gradient calculation within a chain. As many Bayesian models facilitate hierarchical models over groupings we can often calculate contributions to the log-likelihood separately for each of these groups.

Therefore, the concept of an embarrassingly parallel program can be applied in this setting, i.e. one can calculate these independent work chunks on separate CPU cores and then collect the results.

For reasons implied by Stan’s internals (the gradient calculation must not run in a threaded program) we are restricted in applicable techniques. One possibility is the Message Passing Interface (MPI) which spawns multiple CPU cores by firing off independent processes. A root process will send packets of work (sets of parameters) to the child nodes which do the work and then send back the results (function return values and the gradients). A first toy example shows dramatic speedups (3 ODEs, 7 parameters). That is, when going from 1 core runtime of 5.2h we can crank it down to just 17 minutes by using 20 cores (18x speedup) on a single machine with 20 cores. MPI scales also across machines and when throwing 40 cores at the problem we are down to 10 minutes which is “only” a 31x speedup (see the above plot).

Of course, the MPI approach works best on clusters with many CPU

cores. Overall, this is fantastic news for big models as this opens the door to scale out large problems onto clusters which are available nowadays in many research facilities.The source code for this prototype is on our github repository. This code should be regarded as working research code and we are currently working on bringing this feature into the main Stan distribution.

Wow. This is a big deal. There are lots of problems where this method will be useful.

**P.S.** What’s with the weird y-axis labels on that graph? I think it would work better to just go 1, 2, 4, 8, 16, 32 on both axes. I like the wall-time markings on the line, though; that helped me follow what was going on.

The post SPEED: Parallelizing Stan using the Message Passing Interface (MPI) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Several months ago, Jordan Anaya, Tim van der Zee, and Nick Brown reported that they’d uncovered 150 errors in 4 papers published by Brian Wansink, a Cornell University business school professor and who describes himself as a “world-renowned eating behavior expert for over 25 years.”

150 errors is pretty bad! I make mistakes myself and some of them get published, but one could easily go through an entire career publishing less than 150 mistakes. So many in a single paper is kind of amazing.

After the Anaya et al. paper came out, people dug into other papers of Wansink and his collaborators and found lots more errors.

Wansink later released a press release pointing to a website which he said contained data and code from the 4 published papers.

In that press release he described his lab as doing “great work,” which seems kinda weird to me, given that their published papers are of such low quality. Usually we would think that if a lab does great work, this would show up in its publications, but this did not seem to have happened in this case.

In particular, even if the papers in question had no data-reporting errors at all, we would have no reason to believe any of the scientific claims that were made therein, as these claims were based on p-values computed from comparisons selected from uncontrolled and abundant researcher degrees of freedom. These papers are exercises in noise mining, not “great work” at all, not even good work, not even acceptable work.

**The new paper**

As noted above, Wansink shared a document that he said contained the data from those studies. In a new paper, Anaya, van der Zee, and Brown analyzed this new dataset. They report some mistakes they (Anaya et al.) had made in their earlier paper, and many places where Wanink’s papers misreported his data and data collection protocols.

Some examples:

All four articles claim the study was conducted over a 2-week period, however the senior author’s blog post described the study as taking one month (Wansink, 2016), the senior author told Retraction Watch it was a two-month study (McCook, 2017b), a news article indicated the study was at least 3 weeks long (Lazarz, 2007), and the data release states the study took place from October 18 to December 8, 2007 (Wansink and Payne, 2007). Why the articles claimed the study only took two weeks when all the other reports indicate otherwise is a mystery.

Furthermore, articles 1, 2, and 4 all claim that the study took place in spring. For the Northern Hemisphere spring is defined as the months March, April, and May. However, the news report was dated November 18, 2007, and the data release states the study took place between October and December.

And this:

Article 1 states that the diners were asked to estimate how much they ate, while Article 3 states that the amount of pizza and salad eaten was unobtrusively observed, going so far as to say that appropriate subtractions were made for uneaten pizza and salad. Adding to the confusion Article 2 states:

“Unfortunately, given the field setting, we were not able to accurately measure consumption of non-pizza food items.”In Article 3 the tables included data for salad consumed, so this statement was clearly inaccurate.

And this:

Perhaps the most important question is why did this study take place? In the blog post the senior author did mention having a “Plan A” (Wansink, 2016), and in a Retraction Watch interview revealed that the original hypothesis was that people would eat more pizza if they paid more (McCook, 2017a). The origin of this “hypothesis” is likely a previous study from this lab, at a different pizza buffet, with nearly identical study design (Just and Wansink, 2011). In that study they found diners who paid more ate significantly more pizza, but the released data set for the present study actually suggests the opposite, that diners who paid less ate more. So was the goal of this study to replicate their earlier findings? And if so, did they find it concerning that not only did they not replicate their earlier result, but found the exact opposite? Did they not think this was worth reporting?

Another similarity between the two pizza studies is the focus on taste of the pizza. Article 1 specifically states:“Our reading of the literature leads us to hypothesize that one would rate pizza from an $8 pizza buffet as tasting better than the same pizza at a $4 buffet.”

Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature, because in that paper they found ratings for overall taste, taste of first slice, and taste of last slice to all be higher in the lower price group, albeit with different levels of significance (Just and Wansink, 2011). However, in the later study they again found the exact opposite, but did not comment on the discrepancy.

Anaya et al. summarize:

Of course, there is a parsimonious explanation for these contradictory results in two apparently similar studies, namely that one or both sets of results are the consequence of modeling noise. Given the poor quality of the released data from the more recent articles . . . it seems quite likely that this is the correct explanation for the second set of studies, at least.

And this:

No good theory, no good data, no good statistics, no problem. Again, see here for the full story.

**Not the worst of it**

And, remember, those 4 pizzagate papers are *not* the worst things Wansink has published. They’re only the first four articles that anyone bothered to examine carefully enough to see all the data problems.

There was this example dug up by Nick Brown:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they *do* seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

And this discovery by Tim van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.

And:

Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

There’s lots more at the link.

From the NIH guidelines on research misconduct:

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

The post Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Ride a Crooked Mile appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As many of us rely (in part) on p values when trying to make sense of the data, I am sending a link to a paper Patrick Heck and I published in Frontiers in Psychology. The goal of this work is not to fan the flames of the already overheated debate, but to provides some estimates about what p can and cannot do. Statistical inference will always require experience and good judgment regardless of which school of thought (Bayesian, frequentist, or other) we are leaning to.

I have three reactions.

**1.** I don’t think there’s any “overheated debate” about the p-value; it’s a method that has big problems and is part of the larger problem that is null hypothesis significance testing (see my article, The problems with p-values are not just with p-values); also p-values are widely misunderstood (see also here).

From a Bayesian point of view, p-values are most cleanly interpreted in the context of uniform prior distributions—but the setting of uniform priors, where there’s nothing special about zero, is the scenario where p-values are generally irrelevant.

So I don’t have much use for p-values. They still get used in practice—a lot—so there’s room for lots more articles explaining them to users, but I’m kinda tired of the topic.

**2.** I disagree with Krueger’s statement that “statistical inference will always require experience and good judgment.” For better or worse, lots of statistical inference is done using default methods by people with poor judgment and little if any relevant experience. Too bad, maybe, but that’s how it is.

Does statistical inference require experience and good judgment? No more than driving a car requires experience and good judgment. All you need is gas in the tank and the key in the ignition and you’re ready to go. The roads have all been paved and anyone can drive on them.

**3.** In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.

More generally I think that all the positive aspects of the p-value they discuss in their paper would be even more positive if researchers were to use the z-score and not ever bother with the misleading transformation into the so-called p-value. I’d much rather see people reporting z-scores of 1.5 or 2 or 2.5 than reporting p-values of 0.13, 0.05, and 0.01.

The post Ride a Crooked Mile appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Kaiser Fung’s data analysis bootcamp appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I asked Kaiser if he had anything else he wanted to share, and he wrote:

I think our major differentiation from other bootcamps out there includes:

a. There are lots of jobs in these other business units outside engineering and software development. Hiring managers in marketing, operations, servicing, etc. are looking for the ability to interpret and reason with data, and use data to solve business problems. Our broad-based curriculum caters to this need.

b. I don’t believe that coding is the end-all of data science. Coding schools teach people how to code but knowing what to code is more important. Therefore, our curriculum covers R, Python, and machine learning but also statistical reasoning, survey design, Excel, intro to marketing, intro to finance, etc.

c. We provide quality through small class size, in-person instruction and instructors who are industry practitioners. The average instructor has 10 years of industry experience, and is in a director or higher level position. These instructors know what hiring managers want since they are hiring managers themselves.

d. We are building a diverse class. We take social scientists, designers as well as STEM people. We just require some exposure to programming concepts and data analyses, and a good college degree.

The post Kaiser Fung’s data analysis bootcamp appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Statistical Challenges of Survey Sampling and Big Data (my remote talk in Bologna this Thurs, 15 June, 4:15pm) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Big Data need Big Model. Big Data are typically convenience samples, not random samples; observational comparisons, not controlled experiments; available data, not measurements designed for a particular study. As a result, it is necessary to adjust to extrapolate from sample to population, to match treatment to control group, and to generalize from observations to underlying constructs of interest. Big Data + Big Model = expensive computation, especially given that we do not know the best model ahead of time and thus must typically fit many models to understand what can be learned from any given dataset. We discuss Bayesian methods for constructing, fitting, checking, and improving such models.

It’ll be at the 5th Italian Conference on Survey Methodology, at the Department of Statistical Sciences of the University of Bologna. A low-carbon remote talk.

The post Statistical Challenges of Survey Sampling and Big Data (my remote talk in Bologna this Thurs, 15 June, 4:15pm) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Criminology corner: Type M error might explain Weisburd’s Paradox appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Torbjørn Skardhamar, Mikko Aaltonen, and I wrote this article to appear in the Journal of Quantitative Criminology:

Simple calculations seem to show that larger studies should have higher statistical power, but empirical meta-analyses of published work in criminology have found zero or weak correlations between sample size and estimated statistical power. This is “Weisburd’s paradox” and has been attributed by Weisburd, Petrosino, and Mason (1993) to a difficulty in maintaining quality control as studies get larger, and attributed by Nelson, Wooditch, and Dario (2014) to a negative correlation between sample sizes and the underlying sizes of the effects being measured. We argue against the necessity of both these explanations, instead suggesting that the apparent Weisburd paradox might be explainable as an artifact of systematic overestimation inherent in post-hoc power calculations, a bias that is large with small N. Speaking more generally, we recommend abandoning the use of statistical power as a measure of the strength of a study, because implicit in the definition of power is the bad idea of statistical significance as a research goal.

I’d never heard of Weisburd’s paradox before writing this article. What happened is that the journal editors contacted me suggesting the topic, I then read some of the literature and wrote my article, then some other journal editors didn’t think it was clear enough so we found a couple of criminologists to coauthor the paper and add some context, eventually producing the final version linked here. I hope it’s helpful to researchers in that field and more generally. I expect that similar patterns hold with published data in other social science fields and in medical research too.

The post Criminology corner: Type M error might explain Weisburd’s Paradox appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post PhD student fellowship opportunity! in Belgium! to work with us! on the multiverse and other projects on improving the reproducibility of psychological research!!! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Wolf Vanpaemel and Francis Tuerlinckx write:

We at the Quantitative Psychology and Individual Differences, KU Leuven, Belgium are looking for a PhD candidate. The goal of the PhD research is to develop and apply novel methodologies to increase the reproducibility of psychological science. More information can be found on the job website or by contacting us at wolf.vanpaemel@kuleuven.be or francis.tuerlinckx@kuleuven.be. The deadline for application is Monday June 26, 2017.

One of the themes a successful candidate may work on is the further development of the multiverse. I expect to be an active collaborator in this work.

So please apply to this one. We’d like to get the best possible person to be working on this exciting project.

The post PhD student fellowship opportunity! in Belgium! to work with us! on the multiverse and other projects on improving the reproducibility of psychological research!!! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why I’m not participating in the Transparent Psi Project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I would like to ask you to participate in the establishment of the expert consensus design of a large scale fully transparent replication of Bem’s (2011) ‘Feeling the future’ Experiment 1. Our initiative is called the ‘Transparent Psi Project’. [https://osf.io/jk2zf/wiki/home/] Our aim is to develop a consensus design that is mutually acceptable for both psi proponent and mainstream researchers, containing clear criteria for credibility.

I replied:

Thanks for the invitation. I am not so interested in this project because I think that all the preregistration in the world won’t solve the problem of small effect sizes and poor measurements. It is my impression from Bem’s work and others that the field of psi is plagued by noisy measurements and poorly specified theories. Sure, preregistration etc. would stop many of the problems–in particular, there’s no way that Bem would’ve seen 9 out of 9 statistically significant p-values, or whatever that was. But I can’t in good conscience recommend the spending of effort in this way. I think any serious work in this area would have to go beyond the phenomenological approach and perform more direct measurements, as for example here: http://marginalrevolution.com/marginalrevolution/2014/11/telepathy-over-the-internet.html . I’ve not actually read the paper linked there so this may be a bad example but the point is that one could possibly study such things scientifically with a physical model of the process. To just keep taking Bem-style measurements, though, I think that’s hopeless: it’s the kangaroo problem (http://andrewgelman.com/2015/04/21/feather-bathroom-scale-kangaroo/). Better to preregister than not, but better still not to waste time on this or similarly-hopeless problems (studying sex ratios in samples of size 3000, estimating correlations of monthly cycle on political attitudes using between-person comparisons, power pose, etc.). I recognize that some of these ideas, ESP included, had some legitimate a priori plausibility, but, at this point, a Bem-style experiment seems like a shot in the dark. And, of course, even with preregistration, there’s a 5% chance you’ll see something statistically significant just by chance, leading to further confusion. In summary, preregistration and consensus helps with the incentives, but all the incentives in the world are no substitute for good measurements. (See the discussion of “in many cases we are loath to recommend pre-registered replication” here: http://andrewgelman.com/2017/02/11/measurement-error-replication-crisis/).

Kekecs wrote back:

Thank you for your feedback. We fully realize the problem posed by small effect size. However, this problem in itself can be solved simply by throwing a larger sample at it. In fact based on our simulations we plan to collect 14,000-60,000 data points (700 – 3,000 participants) using bayesian analysis and optional stopping, aiming to reach a Bayes factor threshold of 60 or 1/60. Our simulations show that using these parameters we only have a p = 0.0004 false positive chance, so it is highly unlikely that we would accidentally generate more confusion on the field just by conducting the replication. On the contrary, by doing our study, we will effectively more than double the amount of total data accumulated so far by Bem´s and others studies using this paradigm, which should help with clarity on the field by introducing good quality, credible data.

You might be right though that the measurements itself is faulty, and that we cannot expect precognition to work in an environmentally invalid situation like this. But in reality, we don’t have any information on how precognition should works if it really does exist, so I am not sure what would be a better way of measuring it than seeing how effective are people at predict future events.

Our main goal here is not really to see whether precognition exists or not. The ultimate aim of our efforts is to do a proof of concept study where we will see whether it is possible to come to a consensus on criterion of acceptability and credibility in a field this divided, and to come up with ways in which we can negate all possibilities of questionable research practice. This approach can then be transferred to other fields as well.

I then responded:

I still think it’s hopeless. The problem (which I’ll say using generic units as I’m not familiar with the ESP experiment) is: suppose you have a huge sample size and can detect an effect of 0.003 (on some scale) with standard error 0.001. Statistically significant, preregistered, the whole deal. Fine. But then you could very well see an effect of -0.002 with different people, in a different setting. And -0.003 somewhere else. And 0.001 somewhere else. Etc. You’re talking about effects that are indistinguishable given various sources of leakage in the experiment.

I support your general goal but I recommend you choose a more promising topic than ESP or power pose or various other topics that get talked about so much.

Kekecs replied:

We are already committed to follow through with this particular setting. But I agree with you that our approach can be easily transferred to the research of other effects and we fully intend to do that.

If you put it that way, your question is all about construct validity. Whether we can detect the effect that we really want to detect, or are there other confounds that bias the measurement. In this particular experimental setting which is simple as stone (basically people are guessing about the outcomes of future coin flips) the types of bias that we can expect are more related to questionable research practices (QRPs) than anything else. The only way other types of bias, such as personal differences in ability (sampling bias), participant expectancy, and demand characteristics, etc., can have an effect is if there is truly an anomalous effect. For example if we detected an effect of 0.003 with 0.001 SE only because we accidentally sampled people with high psi abilities, our conclusion that there is a psi effect would still be true (although our effect size estimate would be slightly off).

That is why in this project we are focusing mainly on negating all possibilities of QRPs and full transparency. I am not sure what other types of leakage can we have in this particular experiment if we addressed all possible QRPs. Would you care to elaborate?

I responded:

Just in answer to that last question: I’m not sure what other types of leakage might exist—it’s my impression that Bem’s experiments had various problems, so I guess it depends how exact a replication you’re talking about. My real point, though, is if we think ESP exists at all, then an effect that’s +0.003 on Monday and -0.002 on Tuesday and +0.001 on Wednesday probably isn’t so interesting. This becomes clearer if we move the domain away from possible null phenomena such as ESP or homeopathy, to things like social priming, which presumably has some effect, but which varies so much by person and by context to be generally unpredictable and indistinguishable from noise. I don’t think ESP is such a good model for psychology research because it’s one of the few things people study that really could be zero.

And then Kekecs closed out the discussion:

In response, I find doing this effort on the field of ESP interesting exactly because the effect could potentially be zero. Positive findings have an overwhelming dominance in both psi literature, and social sciences literature in general. In the case of most other social science research, it is a theoretical possibility (but unrealistic) that researchers just get lucky all the time and they always ask the right questions, that is why they are so effective in finding positive effects. Again, this is obviously cannot be true for the entirety of the literature, but for each topic studied individually, it can be quite probable that there is an effect if ever so small, which blurs the picture about publication bias and other types of bias in the literature. However, it may be that there is no ESP effect at all. In that case, we would have a field where the effect of bias in research can be studied in its purest form.

From another perspective, precognition in particular is a perfect research topic exactly because these designs by their nature are very well protected from the usual threats to internal validity, at least in the positive direction. It is hard to see what could make a person perform better at predicting the outcome of a state of the art random number generator if there is no psi effect. Bias can always be introduced by different questionable research practices (QRPs), but if we are able to design a study completely immune the QRPs, there is no real possibility for bias toward type I error. Of course, if the effect really exists, all the usual threats to validity can have an influence (for example, it is possible that people can get “psi fatigue” if they perform a lot of trials, or that events and contextual features, or even expectancy can have an effect on performance), but we cannot make a type I error in that case, because the effect exists, we can only make errors in estimating the size of the effect, or a type II error.

So understanding what is underlying the dominance of positive effects in ESP research is very important. If there is no effect, psi literature can serve as a case study for bias in its purest form, which can help us understand it in other research fields. On the other hand, if we find an effect when all QRPs are controlled for, we may need to really rethink our current paradigm.

I continue to think that the study of ESP is irrelevant for psychology, both for substantive reasons—there is no serious underlying theory or clear evidence for ESP, it’s all just hope and intuition—and for methodological reasons, in that zero is a real possibility. In contrast, even silly topics such as power pose and embodied cognition seem to me to have some relevance to psychology and also involve the real challenge that there are no zeroes. Standing in an unusual position for two minutes will have *some* effect on your thinking and behavior; the debate is what are the *consistent* effects, if any. That’s my take, anyway; but I wanted to share Kekecs’s view too, given all the effort he’s putting into this project.

The post Why I’m not participating in the Transparent Psi Project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Financial anomalies are contingent on being unknown appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In retrospect, the anomalies literature is a prime target for p-hacking. First, for decades, the literature is purely empirical in nature, with little theoretical guidance. Second, with trillions of dollars invested in anomalies-based strategies in the U.S.market alone, the financial interest is overwhelming. Third, more significant results make a bigger splash, and are more likely to lead to publications as well as promotion, tenure, and prestige in academia. As a result, armies of academics and practitioners engage in searching for anomalies, and the anomalies literature is most likely one of the biggest areas in finance and accounting. Finally, as we explain later, empiricists have much flexibility in sample criteria, variable definitions, and empirical methods, which are all tools of p-hacking in chasing statistical significance.

Falk writes:

A weakness in this study is that the use of a common data period obscures the fact that financial anomalies are contingent on being unknown: known (true) anomalies will be arbitraged away so that they no longer exist. Their methodology continues to estimate many of these anomalies after the results of the studies were public knowledge and heavily scrutinized. This should attenuate the results. (It would be interesting to see if the results weakened the earlier the study was published. On a low-hanging fruit theory, it should be just the opposite.) It’s as if Power Pose worked until Amy Cuddy wrote about it, at which point everyone wised up and the effect went away. Effects like that are really hard to replicate.

Falk’s comment, about financial anomalies being contingent on being unknown, reminds me of something: In finance (so I’m told), when someone has a great idea, they keep it secret and try to milk all the advantage out of it that they can. This also happens in some fields of science: we’ve all heard of bio labs that refuse to share their data or their experimental techniques because they want to squeeze out a couple more papers in Nature and Cell. Given all the government funding involved, that’s not cool, but it’s how it goes. But in statistics, when think we have a good idea, we put it out there for free, we scream about it and get angry that other people aren’t using our wonderful methods and our amazing free software. Funny, that.

**P.S.** For an image, I went and googled *cat anomaly*. I recommend you don’t do that. The pictures were really disturbing to me.

The post Financial anomalies are contingent on being unknown appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post UK election summary appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Conservative party got 42% of the vote, Labour got 40% of the vote, and all the other parties received 18% between them. The Conservatives ended up with 51.5% of the two-party vote, just a bit less than Hillary Clinton’s share last November.

In the previous U.K. general election, two years ago, Conservative beat Labour, 37%-30%, that’s 55% of the two-party vote.

The time before, the Conservatives received 36%, compared to Labour’s 29%. The Conservatives again had received 55% of the two-party vote.

As with the Clinton-Trump presidential election and the “Brexit” election in the U.K. last year, the estimates from the polls turned out to give pretty good forecasts.

The predictions were not perfect—the 318-262 split in parliament was not quite the 302-269 that was predicted, and the estimated 42-38 vote split didn’t quite predict the 43.5-41.0 split that actually happened (those latter figures, for Great Britain only, come from the Yougov post-election summary). And the accuracy of the seat forecast has to be attributed in part to luck, given the wide predictive uncertainty bounds (based on pre-election polls, the Conservatives were forecast to win between 269 and 334 seats). The predictions were done using Mister P and Stan.

The Brexit and Clinton-Trump poll forecasts looked bad at the time because they got the outcome wrong, but as forecasts of public opinion they were solid, only off by a percentage point or two in each case. In general we’d expect polls to do better in two-party races or, more generally, in elections with two clear options, because then there are fewer reasons for prospective voters to change their opinions. In most parts of the U.K., this 2017 election was a two-party affair, hence it should be no surprise that the final polls were accurate (after suitable adjustment for nonresponse), even if, again, there was some luck that they were as close as shown in these graphs by Jack Blumenau:

**P.S.** I like Yougov and some of our research is supported by Yougov, but I’m kinda baffled cos when I googled I found this page by Anthony Wells, which estimates 42% for the Conservatives, 35% for Labour, and a prediction of “an increased Conservative majority in the Commons,” which seems to contradict their page that I linked to above, with that prediction of a hung parliament. That’s the forecast I take seriously because it used MRP, but then it makes me wonder why their “Final call” was different. Once you have a model and a series of polls, why throw all that away when making your final call?

The post UK election summary appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The (Lance) Armstrong Principle appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The (Lance) Armstrong Principle appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Bombshell” statistical evidence for research misconduct, and what to do about it? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s Brown:

[Carlisle] claims that he has found statistical evidence that a surprisingly high proportion of randomised controlled trials (RCTs) contain data patterns that cannot have arisen by chance. . . . the implication is that some percentage of these impossible numbers are the result of fraud. . . .

I thought I’d spend some time trying to understand exactly what Carlisle did. This post is a summary of what I’ve found out so far. I offer it in the hope that it may help some people to develop their own understanding of this interesting forensic technique, and perhaps as part of the ongoing debate about the limitations of such “post publication analysis” techniques . . .

I agree with Brown that these things are worth studying. The funny thing is, it’s hard for me to get excited about this particular story, even though Brown, who I respect, calls it a “bombshell” that he anticipates will “have quite an impact.”

There are two reasons this new paper doesn’t excite me.

1. Dog bites man. By now, we know there’s lots of research misconduct in published papers. I use “misconduct” rather than “fraud” because from, the user’s perspective, I don’t really care so much whether Brian Wansink, for example, was fabricating data tables, or had students make up raw data, or was counting his carrots in three different ways, or was incompetent in data management, or was actually trying his best all along and just didn’t realize that it can be detrimental to scientific progress to be fast and loose with your data. Or some combination of all of these. Clarke’s Law.

Anyway, the point is, it’s no longer news when someone goes into a literature of p-value-based papers in a field with noisy data, and finds that people have been “manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.” At this point, it’s to be expected.

2. As Stalin may have said, “When one man dies it’s a tragedy. When thousands die it’s statistics.” Similarly, the story of Satoshi Kanazawa or Brian Wansink or Daryl Bem has human interest. And even the stories without direct human interest have some sociological interest, one might say. For example, I can’t even remember who wrote the himmicanes paper or the ages-ending-in-9 paper, but in each case I’m still interested in the interplay between the plausible-but-oh-so-flexible theory, the weak data analysis, the poor performance of the science establishment, and the media hype. This new paper by Carlisle, though: it’s so general, so it’s hard to grab onto the specifics of any single paper or set of papers. Also, for me, medical research is less interesting than social science.

Finally, I want to briefly discuss the current and future reactions to this study. I did a quick google and found it was covered on Retraction Watch, where Ivan Oransky quotes Andrew Klein, editor of Anaesthesia, as saying:

No doubt some of the data issues identified will be due to simple errors that can be easily corrected such as typos or decimal points in the wrong place, or incorrect descriptions of statistical analyses. It is important to clarify and correct these in the first instance. Other data issues will be more complex and will require close inspection/re-analysis of the original data.

This is all fine, and, sure, simple typos should just be corrected. But . . . if a paper has real mistakes I think the entire paper should be flagged as suspect. If the authors have so little control over their data and methods, then we may have no good reason to believe their claims about what their data and methods imply about the external world.

One of the frustrating things about the Richard Tol saga was that we became aware of more and more errors in his published article, but the journal never retracted it. Or, to take a more mild case, Cuddy, Norton, and Fiske published a paper with a bunch of errors. Fiske assures us that correction of the errors doesn’t change the paper’s substantive conclusions, and maybe that’s true and maybe it’s not. But . . . why should we believe her? On what evidence should we believe the claims of a paper where the data are mishandled?

To put it another way, I think it’s unfortunate that retractions and corrections are considered to be such a big deal. If a paper has errors in its representation of data or research procedures, that should be enough for the journal to want to put a big fat WARNING on it. That’s fine, it’s not so horrible. I’ve published mistakes too. Publishing mistakes doesn’t mean you have to be a bad person, nobody’s perfect.

So, if Anaesthesia and other journals wants to correct incorrect descriptions of statistical analyses, numbers that don’t add up, etc., that’s fine. But I hope that when making these corrections—and when identifying suspicious patterns in reported data—they also put some watermark on the article so that future readers will know to be suspicious. Maybe something like this:

The authors of the present paper were not careful with their data. Their main claims were supported by t statistics reported as 5.03 and 11.14, but the actual values were 1.8 and 3.3.

Or whatever. The burden of proof should not be on the people who discovered the error to demonstrate that it’s consequential. Rather, the revelation of the error provides information about the quality of the data collection and analysis. And, again, I say this as a person who’d published erroneous claims myself.

The post “Bombshell” statistical evidence for research misconduct, and what to do about it? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Workshop on reproducibility in machine learning appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My colleagues and I are organizing a workshop on reproducibility and replication for the International Conference on Machine Learning (ICML). I’ve read some of your blog posts on the replication crisis in the social sciences and it seems like this workshop might be something that you’d be interested in.

We have three main goals in holding a workshop on reproducing and replicating results:

1. Provide a venue in Machine Learning for publishing replications, both successful and unsuccessful. This helps to give credit and visibility to researchers who work on replicating results as well as researchers whose results are replicated.

2. A place to share new ideas about software and tools for making research more reproducible.

3. A forum for discussing how reproducing research and replication effects different parts of the machine learning community. For example, what does it mean to reproduce the results of a recommendations engine which interacts with live humans?

I agree that this is a super-important topic because the fields of statistical methodology and machine learning are full of hype. Lots of algorithms that work in the test examples but then fail in new problems. This happens even with my own papers!

The post Workshop on reproducibility in machine learning appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post No conf intervals? No problem (if you got replication). appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What I’m saying is, use the secret weapon.

The post No conf intervals? No problem (if you got replication). appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and media exposure appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>But what I want to say here is that even serious research is subject to exaggeration and distortion, partly through the public relations machine and partly because of basic statistics. The push to find and publicize so-called statistically significant results leads to overestimation of effect sizes (type M errors), and crude default statistical models lead to broad claims of general effects based on data obtained from poor measurements and nonrepresentative samples.

One example we’ve discussed a lot is that claim of the effectiveness of early childhood intervention, based on a small-sample study from Jamaica. This study is *not* “junk science.” It’s a serious research project with real-world implications. But the results still got exaggerated. My point here is not to pick on those researchers. No, it’s the opposite: *even top researchers exaggerate in this way* so we should be concerned in general.

What to do here? I think we need to proceed on three tracks:

1. Think more carefully about data collection when designing these studies. Traditionally, design is all about sample size, not enough about measurement.

2. In the analysis, use Bayesian inference and multilevel modeling to partially pool estimated effect sizes, thus giving more stable and reasonable output.

3. When looking at the published literature, use some sort of Edlin factor to interpret the claims being made based on biased analyses.

The above remarks are general, indeed it was inspired by yesterday’s discussion about the design and analysis of psychology experiments, as I think there’s some misunderstanding in which people don’t see where assumptions are coming into various statistical analyses (see for example this comment).

A big big problem here, I think, is that many people seem to have the impression that, if you have a randomized experiment (or its quasi-randomized equivalent), then comparisons in your data can be given general interpretation in the outside world, with the only concern being “statistical significance.” But that view is not correct. You can have a completely clean randomized experiment, but if your measurements are not good enough, you can’t make general claims at all. Indeed, standard methods yield overestimates of effect sizes.

And, again, this is not just a problem with junk science. Naive overinterpretation of results from randomized comparisons: this is a problem with lots of serious work too in the human sciences.

The post The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and media exposure appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How has my advice to psychology researchers changed since 2013? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>– Analyze all your data.

– Present all your comparisons.

– Make your data public.

And, for journal editors, I wrote, “if a paper is nothing special, you don’t have to publish it in your flagship journal.”

**Changes since then?**

The above advice is fine, but it’s missing something—a big something—regarding design and data collection. So let me two more tips, arguably the most important pieces of advice of all:

– Put in the effort to take accurate measurements. Without accurate measurements (low bias, low variance, and a large enough sample size), you’re drawing dead. That’s what happened with the beauty-and-sex-ratio study, the ovulation-and-clothing study, the ovulation-and-voting study, the fat arms study, etc etc. All the analysis and shared data and preregistration in the world won’t save you if your data don’t closely address the questions you’re trying to answer.

– Do within-person comparisons where possible; that is, cross-over designs. Don’t worry about poisoning the well; that’s the least of your worries. Generally it’s much more important to get those direct comparisons.

The post How has my advice to psychology researchers changed since 2013? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A quote from William James that could’ve come from Robert Benchley or S. J. Perelman or Dorothy Parker appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Is life worth living? It all depends on the liver.

The post A quote from William James that could’ve come from Robert Benchley or S. J. Perelman or Dorothy Parker appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Using external C++ functions with PyStan & radial velocity exoplanets appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I [Mackey] demonstrate how to use a custom C++ function in a Stan model using the Python interface PyStan. This was previously only possible using the R interface RStan (see an example here) so I hacked PyStan to make this possible in Python as well. . . .

I have some existing C++ code that implements my model and its derivatives so I don’t want to have to re-write the model in the Stan language. Furthermore, part of the model requires solving a transcendental equation numerically; it’s not obvious that applying autodiff to an iterative solver is a great idea, but the analytic gradients are trivial to evaluate.

The example that we’ll use is fitting radial velocity observations of an exoplanet. In particular, we’ll fit recent observations of 51 Peg b, the first exoplanet discovered around a main sequence star.

An exoplanet! Cool.

Mackey continues with tons of details. Great stuff.

I have some issues with his Stan model; in particular he uses priors with hard constraints which I think is generally a bad idea. For example, he has a parameter with a uniform (-10, 5) prior. I can’t imagine this is the right thing to do. From basic recommendations it would be better to do something like normal (-2.5, 7.5), but really I have a feeling he could do a lot more regularization here. The uniform prior might work in the particular example that he was using, but in general it would be safer to control the inference a bit more. Mackey’s got a bunch of these uniform priors in his code and I think he should look carefully at all of them.

The real point of Mackey’s post, though, is that he’s hacking Stan to solve his problem. And that’s great.

The post Using external C++ functions with PyStan & radial velocity exoplanets appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A collection of quotes from William James that all could’ve come from . . . Bill James! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Faith means belief in something concerning which doubt is theoretically possible.

A chain is no stronger than its weakest link, and life is after all a chain.

A great many people think they are thinking when they are merely rearranging their prejudices.

A man has as many social selves as there are individuals who recognize him.

Acceptance of what has happened is the first step to overcoming the consequences of any misfortune.

An act has no ethical quality whatever unless it be chosen out of several all equally possible.

Be willing to have it so. Acceptance of what has happened is the first step to overcoming the consequences of any misfortune.

Belief creates the actual fact.

Compared to what we ought to be, we are half awake.

Do something everyday for no other reason than you would rather not do it, so that when the hour of dire need draws nigh, it may find you not unnerved and untrained to stand the test.

Great emergencies and crises show us how much greater our vital resources are than we had supposed.

Human beings can alter their lives by altering their attitudes of mind.

If any organism fails to fulfill its potentialities, it becomes sick.

If you want a quality, act as if you already had it.

In the dim background of mind we know what we ought to be doing but somehow we cannot start.

Individuality is founded in feeling; and the recesses of feeling, the darker, blinder strata of character, are the only places in the world in which we catch real fact in the making, and directly perceive how events happen, and how work is actually done.

It is only by risking our persons from one hour to another that we live at all. And often enough our faith beforehand in an uncertified result is the only thing that makes the result come true.

It is our attitude at the beginning of a difficult task which, more than anything else, will affect its successful outcome.

It is wrong always, everywhere, and for everyone, to believe anything upon insufficient evidence.

No matter how full a reservoir of maxims one may possess, and no matter how good one’s sentiments may be, if one has not taken advantage of every concrete opportunity to act, one’s character may remain entirely unaffected for the better.

Nothing is so fatiguing as the eternal hanging on of an uncompleted task.

Some are more Bill-James-like than others, but, as a whole, this list is kind of amazing.

The post A collection of quotes from William James that all could’ve come from . . . Bill James! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey—here are some tips on communicating data and statistics! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s the rough course plan. I’ll tinker with it between now and September but this is the basic idea. (The course listing is here, but that online description is out of date; the course plan linked above is more accurate.)

Here are the topics for the 13 weeks of the course:

1. Introducing yourself and telling a story

2. Principles of statistical graphics

3. Teaching

4. Making effective graphs

5. Communicating variation and uncertainty

6. Displaying fitted models

7. Giving a presentation

8. Dynamic graphics

9. Writing

10. Collaboration and the scientific community

11. Data processing and programming

12. Student projects

13. Student projects

Communication is central to your job as a quantitative researcher. Our goal in this course is for you to improve at all aspects of statistical communication, including writing, public speaking, teaching, informal conversation and collaboration, programming, and graphics.

Always keep in mind your *goals* and your *audience*.

Never forget, as one of our blog commenters reminds us: your closest collaborator is you six months ago . . . and she doesn’t reply to email!

The course is intended for Ph.D. students from all departments on campus; it is also open to some masters and undergraduate students who have particular interest in the topic.

This is my favorite course ever. As a student, you’ll get practice in all sorts of useful skills that are central to data and statistics, and you’ll participate in fast-moving conversations with fellow students with different backgrounds and experiences. In the interstices, you’ll learn all sorts of important ideas and methods in statistical design and analysis, that you’d never learn anywhere else. It’s the course where we first introduced statistics diaries. It’s the course where (a prototype version of) ShinyStan was one of the final projects!

You don’t want to miss this one.

The class will meet twice a week.

The post Hey—here are some tips on communicating data and statistics! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post You’ll never guess this one quick trick to diagnose problems with your graphs and then make improvements appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s the story. This post from several years ago shows a confusing and misleading pair of pie charts from a Kenyan election:

The quick reaction would be to say, ha ha, pie charts. But that’s not my point here. Sure, pie charts have problems and I think they’re almost never the right way to share data. But sometimes they’re not so bad.

To see the problem with the above display, we have to go back to first principles. And, with graphs, the first principle is always to consider what comparisons you would like to display, and what comparisons does the graph facilitate.

In this example, the main goal seems to be to compare the official results to the new exit poll. Thus, there are four comparisons to be made in parallel. A second goal is to compare the four categories within each dataset.

What about the pie charts? What comparisons do they easily allow?

The most salient point of any pie chart is that the percentages add up to 100%. So the first thing the graphs do is allow each of the proportions to be visually compared to the reference points of 100%, 50%, and 0%, and also 25% and 75%. We can see pretty clearly that in each dataset there are no candidates who received more than 50%, and there are two candidates who received between 25% and 50%.

But the pair of pie charts do not make it at all easy to compare each candidate from official results to the new poll. Part of that is perverse design choices, as the locations of the four “slices” have been permuted from one pie to another. But even if that had not been done—even if the four categories had been kept in pretty much the same position in each graph, and even if the coloring had been kept consistent—it would still be very difficult to compare angles of the different pies.

Basically the only way you can do it is to compare the numbers. And in that case a table would be clearer, as the relevant pairs could be written side by side. I’d prefer a dot plot.

Why, then, the pie? Part of this must be sheer convenience: whoever made these plots happened to know to make pies. But I think it’s more than that. I think the real appeal of the pie is that it graphically shows the percentages adding up to 100. And, sure, that’s something, but in this case it doesn’t really help with what I think is the key comparison of interest.

I had the same problem with Florence Nightingale’s clock plot. People think that graph is the coolest thing in the world—the wheel of time and all that—but it doesn’t facilitate the time-series comparisons that are the real goal in that example.

So, again, the point is not simply Don’t do pie charts (although I would endorse that message).

Rather, the point is: When making graphs, think about the comparisons you want readers to be able to visualize, and then evaluate possible displays based on their performance in making such comparisons clear.

The post You’ll never guess this one quick trick to diagnose problems with your graphs and then make improvements appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post U.K. news article congratulates YouGov on using modern methods in polling inference appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mike Betancourt pointed me to this news article by Alan Travis that is refreshingly positive regarding the use of sophisticated statistical methods in analyzing opinion polls.

Here’s Travis:

Leading pollsters have described YouGov’s “shock poll” predicting a hung parliament on 8 June as “brave” and the decision by the Times to splash it on its front page as “even braver”.

It is certainly rare for a polling company to produce a seats prediction. They usually leave that to psephologists and political scientists.

Good catch: YouGov’s chief scientist is Doug Rivers, who’s a political scientist. And the new method they’re using is multilevel regression and poststratification (MRP or Mister P), which was developed by . . . me! And I’m a political scientist too.

Travis continues:

But it is even more unusual for a company to suddenly employ a new polling model 10 days before a British general election.

Good point! That really would be an unusual practice. Fortunately, So, sure, MRP is not so new—our first paper on the method (“Poststratification into many categories using hierarchical logistic regression,” with Tom Little) was published 20 years ago.

Travis then supplies some details of YouGov’s forecast:

The Tories are in line to lose 20 seats, giving them 310, and Labour is set to gain 30, giving it 257 . . . But as Stephan Shakespeare, YouGov’s chief executive, notes in an accompanying analysis, that is only a central projection that “allows for a wide range of error” and he concedes: “However, these are just the midpoints and, as with any model, there is leeway either side. The Tories could end up with as many as 345 and Labour as few as 223, leading to an increased Conservative majority.” . . . The Times says the projection means the Tories “could get as many as 345 seats on a good night . . . and as few as 274 on a bad night”. That is a pretty wide range.

It’s good to see Travis explaining that realistic forecasts have wide ranges of uncertainty.

He continues with some details:

The methodology involved is described as “multi-level regression and post-stratification” analysis and is based on a substitute for traditional constituency polling, which it regards as “prohibitively expensive”. Shakespeare claimed YouGov tested it during last year’s EU referendum campaign and it produced leave leads every time. What a shame YouGov did not feel like sharing it with voters while their own published referendum polls showed a remain lead right up to polling day.

Travis missed a beat on this one! YouGov *did* share their MRP estimates during the EU referendum campaign. Ummm . . . let me google that . . . here it is: “YouGov uses Mister P for Brexit poll”: it’s an article by Ben Lauderdale and Doug Rivers which includes this graph showing Leave in the lead:

OK, it’s good we got that settled.

Travis concludes:

In an industry already suffering from an existential crisis of public confidence it is indeed a “brave” decision to come up with this one 10 days before a general election and promise to publish its results several times more before polling day.

“Brave” is perhaps overstating things, but, yes, I agree with Travis that presenting MRP estimates is the honorable way to go. Using the best available methods based on the best available knowledge, and giving appropriately wide uncertainty intervals: that’s how to do it.

It’s indeed a shame that Travis was not aware that YouGov had released a report with MRP in the lead-up to the Brexit vote. Also Travis, coming from the U.K., might not have realized that MRP was used in U.S. polls too. For example there was this Florida poll from September, 2016, which was analyzed by four different groups:

Charles Franklin, who estimated Clinton +3

Patrick Ruffini, who estimated Clinton +1

New York Times, who estimated Clinton +1

Sam Corbett-Davies, David Rothschild, and me, who estimated Trump +1. We used MRP.

That said, I don’t have any crystal ball. Shortly before the election I was promoting a forecast that gave Clinton a 92% chance of winning the election. That forecast *didn’t* use MRP; it aggregated information from state polls without appropriately accounting for expected levels of systematic error. I should’ve used MRP myself—I would’ve, but I wasn’t working with the raw data. In this upcoming U.K. election, YouGov does have the raw polling data so of course they are using MRP; it would at this point be a strange choice not to.

Anyway, it’s cool that Alan Travis conveyed some subtleties of polling to the British newspaper audience, even if he did miss one or two things.

Full disclosure: YouGov gives some financial support to the Stan project.

**P.S.** Thanks to Steven Johnson for the photo at the top of this post.

**P.P.S.** More on yougov’s use of MRP here.

The post U.K. news article congratulates YouGov on using modern methods in polling inference appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Another serious error in my published work! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s the background. When I talk about my serious published errors, I talk about my false theorem, I talk about my empirical analysis that was invalidated by miscoded data, I talk my election maps whose flaws were pointed out by an angry blogger, and I talk about my near-miss regarding Portugal.

OK, fine. But then I was going through old blog posts and I’d found a published error of mine that I’d completely forgotten about! It was a statement in one of my most influential papers—just a small part of the paper, it didn’t change our main results, but we really were wrong, and arguably our mistake misled people. I’m glad that later researchers were suspicious of our statement, checked it, and pointed out how we’d been wrong.

The post Another serious error in my published work! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More graphs of mortality trends appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In late March you released a series of plots visualizing mortality rates over time by race and gender. For almost a year now, we’ve been working on a similar project and have compiled all of our findings into an R shiny web app here, with a preprint of our first manuscript here. I’m also attaching our preprint’s web appendix as it fails to load on bioRxiv at the moment. The web app takes about 30 seconds to load, but if you can stand the wait, you’ll be able to interact with the app to investigate changes to life expectancy over time by state, race, and cause of death.In the spirit of eliciting peer review, I would be happy to have you and your readers have a look at the app and/or the preprint.

In the spirit of providing feedback, I started clicking through and first came across this graph:

Uh oh! Something went terribly wrong here. I reloaded the page and got this:

I like certain aspects of the visual layout but to my eye there’s just too much going on with the colors, the dotted lines, the multiple columns, all happening at once. Also I don’t like how the top and bottom graphs blur together: visually it doesn’t work at all for me. I think they could have a much cleaner display here.

The next one I also think is just way too tricky:

I find myself doing lots of mental arithmetic trying to add and subtract and compare these different numbers.

In any case, I think Riddell and her colleagues are doing it exactly right: they’re putting their graphs out there and getting comments so they can do better. I certainly don’t think my own graphs are perfect and I’m glad to see various people taking their cracks at making these data accessible in different ways.

The post More graphs of mortality trends appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hello, world! Stan, PyMC3, and Edward appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>No, I’m not going to take sides—I’m on a fact-finding mission. We (the Stan development team) have been trying to figure out whether we want to develop a more “pythonic” interface to graphical modeling in Stan. By the way, if anyone knows any guides to what people mean by “pythonic”, please let me know—I’m looking for something like Bloch’s *Effective Java* or Myers’s *Effective C++*. Rather than reinventing the wheel, I’ve been trying to wrap my head around how PyMC3 and Edward work. This post just highlights some of my observations. Please use the comments to let me know if I’ve misunderstood or misrepresented something in PyMC3 or Edward; I really want to understand what they’re doing more clearly. If I’m way off base, I’ll update the main post.

**Bayesian linear regression**

I’ll use a simple linear regression in all three frameworks to give you an idea of what it looks like. That is, we’ll use the sampling distribution

and the priors

, restricted to

In Edward, I went with their example prior, which is a lognormal on variance,

, again restricted to .

**Hello, world! Stan**

data { intN; vector[N] y; matrix[N, 2] x; } parameters { real alpha; vector[2] beta; real<lower=0> sigma; } model { alpha ~ normal(0, 10); beta ~ normal(0, 10); sigma ~ normal(0, 1); y ~ normal(alpha + x * beta, sigma); }

**Hello, world! Edward**

import edward as ed import numpy as np import tensorflow as tf from edward.models import Normal X = tf.placeholder(tf.float32, [N, 2]) sigma = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([])))) beta = Normal(loc=tf.zeros(2), scale=tf.ones(2)) alpha = Normal(loc=tf.zeros(1), scale=tf.ones(1)) y = Normal(loc=ed.dot(X, beta) + alpha, scale=sigma * tf.ones(N))

I’m not 100% sure this would work—I borrowed pieces of examples from their Supervised Learning (Regression) tutorial and their Linear Mixed Effects Models tutorial.

**Hello, world! PyMC3**

This is copied directly from the official Getting Started with PyMC3 tutorial:

from pymc3 import Model, Normal, HalfNormal basic_model = Model() with basic_model: # Priors for unknown model parameters alpha = Normal('alpha', mu=0, sd=10) beta = Normal('beta', mu=0, sd=10, shape=2) sigma = HalfNormal('sigma', sd=1) # Expected value of outcome mu = alpha + beta[0]*X1 + beta[1]*X2 # Likelihood (sampling distribution) of observations Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

It’s not quite standalone as is—there are free variables `Y, X1, X2`. In the tutorial, they make sure the necessary data variables (`Y, X1, X2`) are defined in the environment before attempting to define the model. So they start their tutorial by running this simulation first:

import numpy as np import matplotlib.pyplot as plt # Initialize random number generator np.random.seed(123) # True parameter values alpha, sigma = 1, 1 beta = [1, 2.5] # Size of dataset size = 100 # Predictor variable X1 = np.random.randn(size) X2 = np.random.randn(size) * 0.2 # Simulate outcome variable Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma

So you really need to consider the definitions of `Y, X1, X2` as part of the model specification. And keep in mind that `alpha, sigma, beta` here are not the same variables as defined in the model scope previously; they are masked in the `basic_model` environment by the definitions of those variables there.

**Model size**

Both Edward and PyMC3 model definitions are substantially shorter than Stan’s. That’s largely because of Stan’s standalone static type definitions—the actual model density is the line-for-line similar in all three interfaces.

What’s happening in both PyMC3 and Edward is that the distribution functions are defining the shapes and sizes of the variables. And Python itself is dynamically typed, so variables just get their types from what is assigned to them.

**Joint density model abstraction and data binding**

In both Stan and Edward, the program defining a model defines a joint log density that acts as a function from data sets to concrete posterior densities. In both Stan and Edward, the language distinguishes data variables from parameter values and provides an object-level representation of data variables.

In PyMC3, the data is included as simple Python types in the model objects as the graph is built. So to get a model abstract, you’d have to write a function that takes the variables as arguments then returns the model instantiated with data. The definition of the deterministic node `mu` here is in terms of the actual data vectors `X1` and `X2`—these aren’t placeholders, their values are used from the containing environment. The definition of the stochastic node `Y_obs` explicitly includes actual data through `observed=Y`. This maneuver is necessary because they couldn’t assign a stochastic node to `Y` (well, they could, seeing as it’s Python, but it wouldn’t do what a naive user might expect). The parameter values from the simulation (`alpha, beta, sigma`) are not used in the model definition—rather, they’re masked (redefined) in the scope of the `basic_model` using the with statement.

I wonder how much of this behavior of PyMC3 is due to building on top of Theano (the symbolic differentiation package they use for gradients). I couldn’t figure out how to do code generation autodiff or symbolic autodiff any other way back when I was thinking about it and it’s one of the main reasons we went with templated autodiff instead.

Like Stan, Edward defines a function rather than binding in the data directly to the model. The `tf.placeholder()` business is defining lazy arguments to the model (a lambda abstraction in more formal terms). As with Stan, Edward then lets you plug in data for the placeholders to produce an instantiated model implementing the posterior density determined by conditioning on that data. (Here’s the TensorFlow doc on placeholder.)

In BUGS and JAGS, the model definition didn’t specify what variables are data and which is parameters, but it is fixed when the model is run with data by inspecting the data.

The compilation bottleneck for Stan is when the program is compiled into the joint density (a C++ executable in CmdStan, an Rcpp object in RStan, or a Cython object in Python). Adding the data is fast as it only binds the data in the joint density to create a concrete posterior density; in implementation terms, we construct an instance of an immutable C++ class from the data. I’m pretty sure Edward would work the same way. In PyMC3, the compilation down to Theano must only happen after the data is provided; I don’t know how long that takes (seems like forever sometimes in Stan—we really need to work on speeding up compilation).

**Variable sizes and constraints inferred from distributions**

In PyMC3, `shape=2` is what determines that `beta` is a 2-vector.

The PyMC3 program also explicitly uses the half-normal distribution because they implicitly use the sampling distribution to define constraints on the parameters, so that they can use the same kind of underlying unconstraining transforms as Stan under the hood in order to run HMC on an unconstrained space.

Edward doesn’t seem to support constraints (at least not in their intro tutorials—they transform variables manually, which is easy for lognormal, but substantially harder elsewhere). At least that’s what I’m guessing is the reason behind their formulation of the prior on scale, `tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))`.

Edward is also different than PyMC3 and Stan in that it broadcasts up the parameters so that they are all the same size. That’s the purpose of the size in `scale=tf.ones(2)` in the expression defining `beta` and also the purpose of the multiplication in `scale=sigma * tf.ones(N)` in the expression defining `y`.

Stan and PyMC3 use more standard vectorized functions that broadcast internally. Also, Stan, like BUGS and JAGS, allows truncations for univariate distributions. It means Stan can get away without a separate half-normal distribution. I don’t know if truncation is possible in either Edward or PyMC3.

**Non-trivial use of embedding?**

One nice aspect of embedded languages is that you can use all the language environment tools and can use the language itself. One commonly mentioned benefit is autocompletion in the REPL environment (interactive Python interpreter). We can do a bit of that in Stan in emacs and in Rstudio for R, but it’s hardly the smooth embedding of PyMC3 or Edward.

I couldn’t find examples in either Edward or PyMC3 that make non-trivial use of the embedding in Python. All the examples are just scripts (sequences of Python statements). I’d like to see an example in which they take advantage of being embedded in Python to build something like a hierarchical model component that could be used with other models. We can’t do that with a function in Stan because functions can’t introduce new parameters. I don’t quite know enough Python to pull off an example myself, but it should be possible. I asked a few of the PyMC3 developers and never got a concrete reference, so please comment if you have a working example somewhere I can cite.

**Type checking and data types**

Stan enforces static typing on all of its variables. It then provides a matrix arithmetic style that is like that of R and MATLAB. So you just multiply a matrix times a vector and you get a vector. Argument types like requiring a matrix argument as the covariance parameter in a multivariate normal are enforced statically, too. Size constraints and content constraints like positive definiteness are enforced at run time.

With PyMC3 and Edward, we’re in a slightly different situation. Although Python itself is dynamically typed, the random variables in these languages resolve as class instances in their respective libraries. I don’t know how they do error checking—hopefully at object construction time rather than when the model is executed.

Stan has Cholesky factor parameters for correlation and covariance matrices, simplex parameters, ordered and unit-vector parameters. I’m not sure if or how these would be handled in PyMC3 or Edward. Having the Cholesky factors is key for efficient multivariate covariance estimation that’s numerically stable.

One place Stan is lacking is in sparse and ragged data strutures or tuple data structures for multiple returns. We have a single sparse operation (sparse matrix times dense vector, which is the one you need for efficient sparse GLM predictor matrices), but no direct sparse data types. We require users to code their own raggedness in long (melted) form. Is there a way to do any of that more easily in PyMC3 or Edward?

Stan has rather clunky handling of missing data compared to BUGS or JAGS (two languages I know pretty well, unlike PyMC3 and Edward!). PyMC3 gives you a halfway house in that you pass in a mask if the data that’s missing is missing-at-random from an observation vector or matrix (Python doesn’t have the undefined (NA) data structure from R). I’ve been trying to figure out how to make this easier in Stan from the get go.

**Whose functions?**

Stan uses its own math libraries and all functions resolve to the math lib equivalents (these often delegate to Eigen or Boost) which bottom out in Stan’s automatic differentiation. PyMC3 and Edward functions need to bottom out in Theano and TensorFlow functions to allow analytic derivatives and automatic differentiation respectively. I know that Theano uses NumPy, but I’m not sure if that’s also the case with TensorFlow (there seem to be multiple options for data representations in Edward).

Both PyMC3 and Edward used a class named `Normal` to represent a variable with a normal sampling distrbution in the model. These classes are not the same in the two interfaces, nor are they the normal distributions built into TensorFlow (`tf.Normal`, not `ed.Normal`) or into SciPy (`scipy.stats.norm`).

Thus none of PyStan, PyMC3, or Edward can just call arbitrary Python functions as parts of their models. I don’t know where the languages stand in terms of functions available. Stan has a library of linear algebra, probability, differential equation, and general math functions listed in the back of our manual, but I’m not sure where to find a list of functions or distributions supported in PyMC3 or Edward (partly because I think some of this delegates to Theano and TensorFlow).

**Extending with user-defined functions**

Stan provides a language for writing new functions in the Stan language. I couldn’t see how you’d define new desnities in either PyMC3 or Edward directly in their APIs. Of course, you could use functions to put models together (that’s what I expected to see exemplified in their tutorials).

PyMC3 has instructions for adding new functions, but they don’t work for autodiff (presumably you’d need to go into Theano to do that the way you have to go into C++ to add a new underlying distribution as a Stan built-in). So you’re presumably either out of luck or you try to use something like the zero-trick of BUGS and JAGS.

I have idea what you’d do in Edward to write a new density function.

In all three languages, you can go deep and add new differentiable functions to the underlying math library. I don’t know if that requires C++ in either Theano or Tensorflow, or if you could do it in Python (perhaps with some reduced efficiency).

**Minor nitpicks with PyMC3**

I’m not a big fan of duplicating all the variable names in quotes. My guess is that there’s not enough reflection inside the PyMC3 model class to avoid it.

The PyMC3 argument naming `mu, sd` bothers me because I’m a neat freak like every other low-level API designer. For consistency, the naming should be one of

`mu, sigma``location, scale``mean, sd`

The first one seems opaque. The second one is what I’d have used and what Edward uses. The last choice is what R uses, but I don’t like it because it confuses the parameter with a property of a random variable.

**Efficiency and sampling**

This post isn’t about efficiency of the model’s ability to calculate values and derivatives. In Stan (and presumably the other interfaces if they’re at all efficient), that time is all taken up doing gradient calculations.

I’m more concerned with user efficiency and clarity in writing models, especially when you have a bunch of models you want to build in a sequence as you’re doing an applied project.

Until Edward implements something like NUTS, it’s not going to be very effective at solving general purpose practical problems with full Bayes because HMC is so hard to tune in general (see the Hoffman and Gelman NUTS paper in JMLR for a sequence of striking examples that even downplay how hard the tuning is by collapsing one dimension of the grid that was searched for tuning parameters) and other approaches are so slow to mix. There’s no fundamental obstacle other than that NUTS is a fussy recursive algorithm; I just don’t think the Edward devs care much about doing practical MCMC.

One thing that’s very nice about PyMC3 (don’t know much about inference in Edward) is that you can piece together block samplers. The rest of the PyMC3 interface is very nice for doing lots of things that wind up being pretty clunky in Stan. Some of these design issues are very relevant for deciding if we want to try to embed a graphical modeling language for Stan in Python.

**What might an embedded Stan graphical modeling language look like?**

We could define a Python API the following way, which would make the way a model is built up in the embedded language very similar to how it’s built up in Stan, but with quite a few limitations arising from the restriction to graphical modeling.

joint = StanModel(); with joint: N = data(integer()) y = data(vector(N)) x = data(matrix(N, 2)) alpha = parameter(real()) beta = parameter(vector(2)) sigma = parameter(real(lower=0)) alpha.normal(0, 10) beta.normal(0, 10) sigma.normal(0, 1, lower=0) y.normal(alpha + x * beta, sigma) data = {N:10, y:np.rand.randn(N), x:np.rand.randn(shape=(N, 2))} posterior = model.condition(data) sample = posterior.sample(...)

That’d require a lot of using statements, or you could prefix all those functions/constructors with “st.” or maybe something else (I’m have much less experience with Python than R).

And all the arguments would have names, but do we really want to write `normal(loc=mu, scale=sigma)` rather than `normal(mu, sigma)`?

**Other thoughts?**

I’ll leave the floor open at this point. As you can see, I didn’t do a very deep dive into PyMC3 or Edward and am happy to clarify vague descriptions or correct misconceptions in how these languages would be used in practice. Like I said up front, we’re thinking about adding a graphical modeling language to Stan, so I want to understand all of this better. No better way to hold my feet to the fire than blog about it!

NIMBLE is an API that’s very much like PyMC in R. The big drawback there is that they don’t have autodiff (so it’s like PyMC, not PyMC3). Without autodiff or symbolic diff, it’s pretty much impossible to implement HMC or L-BFGS or gradient descent.

The post Hello, world! Stan, PyMC3, and Edward appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Click-through graphics: A demonstration visualization project for someone appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We discuss a shiny information visualization and propose “the click-through solution”: Start with a visually grabby graphic like the one on the linked page, something that takes advantage of some mystery to suck the viewer in. Then click and get a suite of statistical graphs that allow more direct visual comparisons of the different countries and different sectors of the economy. Then click again to get a spreadsheet with all the numbers and a list of sources.

I’m always talking about this click-through solution but I don’t have any great examples of it. If anyone’s interested, I’d love to see it in this case. Just set it up as a webpage with three steps—the original infograph, the set of statistical graphics, and the spreadsheet. Put it all together, send it to me, and I’ll post it.

This could be a great example for future data journalists and researchers to emulate.

The post Click-through graphics: A demonstration visualization project for someone appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan programmer position at Columbia appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Staff Associate must have training in computer science, statistics, or a related field, be knowledgeable in the following programs: C++ programming, R package development, or Python package development. Experience in web development and Parallel, distributed, and high-performance computing. The incumbent should have excellent writing and editing skills, as well as, excellent communication, interpersonal, analytical, and organizational skills.

You can apply for the job here.

The post Stan programmer position at Columbia appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Try asking someone in the real estate business appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dear Professor Gelman,

Hi. My name is ** and I am a 5th grade student at ** School in **, NY. I am in Mr. **’s class and we are working on our graduating project called Capstone! I am studying how how President Trump’s business life affects his job as President. I know you are an expert in this field and I am looking to interview someone to help me understand more. I chose this topic because I was fascinated by the election and I remained hooked on the news. Also, the more I read, the stranger the story gets.. I am hoping you can answer my questions. Thank you so much for taking the time to read my email:

1.

How do you think his business will be affected by his views on mexicans, muslims, etc?

2.

Do you think that his business will still be doing well when he gets back?

3.

How do you think his business ties are affecting his policies?

4.

Based on his foreign policies do you think he could possibly start a war, if so who?

5.

How do you think his business personality is affecting the way he governs?

I responded to the kid as follows:

I don’t really know the answer to these questions. I think that if you ask someone in the real estate business, they will be better able to answer all these questions (except for #4). Good luck!

The post Try asking someone in the real estate business appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Come to Seattle to work with us on Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Our colleague Jon Wakefield in the Department of Biostatistics at the University of Washington is interested in supervising a 2-year postdoc through this training program.

We’re interested in finding someone who would with Jon and another faculty member (who is assigned on the basis of interests) on exciting projects in spatio-temporal modeling and the environmental health sciences; the successful applicant will also be working with us as a core Stan developer. There are different ways this project can go, based on your interests and expertise, but we emphasize the training aspects. Before applying to the program, we recommend you first email me with a CV and brief cover letter. This position is only open to US citizens and green card holders.

**P.S.** Steven Johnson sent in the above picture of a cat who would like to come inside and start working on Stan full time.

The post Come to Seattle to work with us on Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post StanCon 2018 is live! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>*This post is by Mike.*

We had so much fun at StanCon 2017 that we decided to do it again!

This year’s conference will take place over three days, from Wednesday January 10, 2018 to Friday January 12, 2018, at the beautiful Asilomar Conference Grounds in Pacific Grove, California. In addition to talks and open discussion, this year we’ll also have dedicated time for collaborative Stan coding with other attendees and the Stan dev team.

Detailed information about registration and accommodation at Asilomar, including fees and instructions, can be found on the event website. Early registration ends on Friday November 10, 2017 and no registrations will be accepted after Wednesday December 20, 2017.

This year we are going to try to support as many student scholarships as we can — if you are a student who would love to come but may not have the funding then don’t hesitate to submit a short application!

Contributed talks will proceed as last year, with each submission consisting of self-contained knitr or Jupyter notebooks that will be made publicly available after the conference. Last year’s contributed talks were awesome and we can’t wait to see what users will submit this year. For details on how to submit see the submission website. The final deadline for submissions is Saturday September 16, 2017 5:00:00 AM GMT.

Finally, we are actively looking for sponsors! If you are interested in supporting StanCon 2018, or know someone who might be, then please contact the organizing committee.

I’ll keep an eye on this post to answer any questions that you might have, and otherwise I hope to see everyone in January!

The post StanCon 2018 is live! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Static sensitivity analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>After this discussion, I pointed Ryan Giordano, Tamara Broderick, and Michael Jordan to Figure 4 of this paper with Bois and Jiang as an example of “static sensitivity analysis.” I’ve never really followed up on this idea but I think it could be useful for many problems.

Giordano replied:

Here’s a copy of Basu’s robustness paper, contemporaneous with your 1996 paper, which we talked about. I think it’s a nice, easy-to-understand way to get at what you were also aiming for. In the spirit of your diagnostic graphs, I think a better use of the idea is to report a bunch of “slopes” rather than just look for the worst-case direction (there’s no reason to think the unit ball means anything when you’re comparing a large number of different prior parameters), but the basic idea of replacing derivatives with covariances seems like a good one to me.

I like this and I’d like to incorporate it into our statistical workflow.

**P.S.** The “static” in static sensitivity analysis refers to the idea that we’re doing the computations using the results of a single fitted model, rather than performing sensitivity analysis by re-fitting the model several times under different assumptions.

The post Static sensitivity analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post All the things we have to do that we don’t really need to do: The social cost of junk science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>– is presented as having an empirical basis—

– but where the empirical support comes from a series of statistical analyses—that is, no clear pattern in any individual case but only in averages—

– where the evidence is a set of p-values that are subject to forking paths—

– and where a design analysis suggests large type M and type S errors—

– where replications are nonexisting, or equivocal, or clearly negative—

– where theory is general enough that it can support empirical claims from any direction—

– and the theory has some real-world implication on how we do or should live our lives—

– and the result is one or more publications in prestigious general-science or field journals—

– with respectful publicity by the likes of NPR and the New York Times—

– and out-and-out hype by the likes of Ted, Gladwell, and Freakonomics.

Before going on, let me emphasize that science is not a linear process, and mistakes will slip in, even under best practices. There can always be problems with data collection and analysis, and conditions can change, so that an effect can be present today but not next year. And even a claim that is supported by poor research can still happen to be correct. Also there’s no clean distinction between good science and junk science.

So here I’m talking about junk science that, however it was viewed by its practitioners when it was being done, can in retrospect be viewed as flawed, with those flaws being inherent in the original design and analysis. I’m not just talking about good ideas that happened not to work out, and I recognize that even bad ideas can stimulate later work of higher quality.

In our earlier discussions of the consequences of junk science, we’ve talked about the waste of resources as researchers pursue blind alleys, going in random directions based on their overinterpretation of chance patterns in data; we’ve talked about the waste of effort of peer reviewers, replicators, etc.; we’ve talked about the harm done by people trying therapies that don’t work, and also the opportunity cost of various good ideas that don’t get tried out because they’re lost in the noise, crowded out by the latest miracle claim.

On the other side, there’s the idea that bad science could still have some positive effects: fake miracle cures can still give people hope; advice on topics such as “mindfulness” could still motivate people to focus on their goals, even if the particular treatments being tried are no better than placebo; and, more generally, the openness to possible bad ideas also allows openness to unproven but good new ideas.

So it’s hard to say.

But more recently I was thinking about a different cost of junk science, which is as a drag on the economy.

The example I was thinking about in particular was an argument by Ujjval Vyas, which seemed plausible to me, that there’s this thing called “evidence-based design” in architecture which, despite its name, doesn’t seem to have much to do with evidence. Vyas writes:

The field is at such a low level that it is not worth mentioning in many ways except that it is deeply embedded in a $1T industry for building and construction as well as codes and regulations based on this junk. . . .

And here’s from the wikipedia page on evidence-based design:

The Evidence Based Design Accreditation and Certification (EDAC) program was introduced in 2009 by The Center for Health Design to provide internationally recognized accreditation and promote the use of EBD in healthcare building projects, making EBD an accepted and credible approach to improving healthcare outcomes. The EDAC identifies those experienced in EBD and teaches about the research process: identifying, hypothesizing, implementing, gathering and reporting data associated with a healthcare project.

So this is a different cost from those discussed before. It’s a sort of tax that goes into every hospital that gets built. Somebody, somewhere, has to pay for these Evidence Based Design consultants, and somebody has to pay extra to build the buildings just so.

For any particular case, the advice probably seems innocuous. For example, “People recover from surgery faster if they have a view of nature out the window.” Sure, why not? Nature’s a good thing. But add up this and various other rules and recommendations and requirements, and it sounds like you’re driving up the cost of the building, not to mention the money and effort that gets spent filling out forms, paying the salaries and hotel bills of consultants, etc. It’s a kind of rent-seeking bureaucracy, even if all or many of the people involved are completely sincere.

OK, I don’t know anything about evidence-based design in architecture so maybe I’m missing a few tricks in this particular example. I still think the general point holds, that one hidden cost of junk science is that it fills the world with a bureaucracy of consultants and payoffs and requirements.

The post All the things we have to do that we don’t really need to do: The social cost of junk science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Mike Bostock graphs federal income tax brackets and tax rates, and I connect to some general principles of statistical graphics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Regarding the Vox graph on federal tax brackets, here is a quick-and-dirty visualization of effective tax rates for a given taxable income and year.

However, there is a big caveat: estimating the effective tax rate based on actual income is much harder since it depends on the claimed deductions. This could be estimated empirically, but the IRS doesn’t publish the data (AFAIK).

Bostock writes:

I’ve recreated the graphic [by Alvin Chang for Vox, criticized in my earlier post] below, substituting a log scale for the y-axis. It readily conveys the Reagan-era simplification of tax brackets, as well as the disappearance of tax brackets for the ultra-rich. (In 1936, the highest tax bracket applied to those making more than $83M in 2013-equivalent dollars!)

Yet fewer tax brackets do not imply the overall tax code is simpler; if anything, the tax code continues to get more complex. And looking only at bracket thresholds does not consider the effective rate at different income levels. . . . It is hard to estimate effective tax rates, especially now, because they depend greatly on the amount of itemized deductions. But ignoring that substantial caveat—and that this analysis only considers federal-reported income and not capital gains, the alternative minimum tax, and countless other forms of state and local taxes—we can compute the effective federal income tax rate for a given taxable income (after any deductions) and a given year.

Amounts are in 2013-equivalent dollars when filing as the head-of-household.

Good job.

Here are some relevant principles of statistical graphics:

1. Static graphs can do a lot. Dynamic graphics are fine, but in some settings they do little more than add confusion.

2. The log transform really works.

3. No need to try to cram all the information into one graph. Bostock made one graph of tax *brackets*, another of tax *rates*. Someone could come along and make a third graph including other taxes, not just federal income tax.

Also, I don’t think graphics need to be so big. I display Bostock’s graphs above in a more compressed format than were on his page. I think that’s fine; actually I think these smaller versions are easier to read because I can see the whole graph more clearly in my visual field. In general I recommend that people make their graphs smaller, which implies that their labels should be larger relative to the original graphs. For Bostock, I’d actually recommend just putting x-axis labels every 20 years, percentage labels at every 25%, and income labels at 1, 3, 10, 30, etc. Some of this is a matter of taste, but I do think there are general issues of readability, and tradeoffs in that more labels make it harder to see the big picture but easier to identify exactly what is happening when.

The post Mike Bostock graphs federal income tax brackets and tax rates, and I connect to some general principles of statistical graphics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Taxes and data visualization appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Vox has a graph of tax rates over time.

Their visualizations do convey that tax rates for high earners have declined over time and tax brackets are fewer now, but it seems like there are more appealing and intuitive ways to display that.

I agree. This visualization reminds me a lot of the data visualizations that Antony Unwin criticized in our paper from a few years ago: a lot of effort seems to have gone into making an attractive display (in this case a dynamic visualization) particularly appropriate for this particular configuration of *data*, without a clear sense of how the graph helps address the *questions* that might be asked of those data, in this case trends in differential tax rates.

**P.S.** More here.

The post Taxes and data visualization appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Above is my talk at the 2017 New York R conference. Look, no slides!

The talk went well. I think the video would be more appealing to listen to if they’d mixed in more of the crowd noise. Then you’d hear people laughing at all the right spots.

**P.S.** Here’s my 2016 NYR talk, and my 2015 NYR talk.

Damn! I’m giving away all my material for free. I’ll have to come up with some entirely new bits when they call me up to give that Ted talk.

The post Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Support for presidential candidates at elite law firms in 2012 and 2016 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Thought these data were extreme enough to be of general interest.

OK, before you click on the link, here’s the story: Campos looked up the presidential campaign contributions at 11 top law firms. (I’m not sure where his data came from; maybe the same source as here?) Guess what percentage of contributions went to Mitt Romney in 2012? What about Donald Trump in 2016?

Make your guesses, then click on the link above to find out the answer.

The numbers are indeed striking, and I have nothing to add—really there’s nothing I can say at all, given that no data or link have been supplied. I do, however, wonder what would happen if we took the people in the comments section at the above-linked post, and put them in the same room as the commenters at Marginal Revolution. Matter and anti-matter (or maybe it’s the other way around). I can’t even imagine.

**P.S.** Campos added a link to the data in his post.

The post Support for presidential candidates at elite law firms in 2012 and 2016 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Visualizing your fitted Stan model using ShinyStan without interfering with your Rstudio session appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>But it turns out that it doesn’t have to be that way. Imad explains:

You can open up a new session via the RStudio menu bar (Session >> New Session), which should have the same working directory as wherever you were prior to running launch_shinystan(). This will let you work on whatever you need to work on while simultaneously running ShinyStan (albeit via two RStudio sessions).

OK, good to know.

The post Visualizing your fitted Stan model using ShinyStan without interfering with your Rstudio session appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>