The post Statistics and data science, again appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>(1) Is html markdown or some other formatting script usable in comments? If so, what are the tags I may use?

(2) What are your views on the role of statistics in the evolution of the various folds of convergent science? For example, upon us there is this rather nebulously defined thing called data science which roughly equates to an amalgam of comp sci and stats. I tend to interpret it as somewhat computational stats, somewhat AI – or maybe it’s just an evolution of data mining, I don’t know yet.

The comp sci camp is marketing their brand in this area quite well. But is it me or is the statistical community perhaps taking a bit of a back seat? This feels more like a bit of competition than an invitation to converge ideas (so where does computational and graphical stats sit, then?). I had a brief conversation with Prof. Joe Blitzstein about this some time ago, and he expressed a number of concerns as well.

There are those that believe statisticians as a collective group have become less important in the work to translate data into meaning/knowledge than this unicorn (for now) ‘data scientist’. I believe part of the reason is that this new discipline has quite a bit of historical public ‘glitter’ and a few other shiny objects embedded (the term ‘big data’ for one, and a larger paycheck for another – quite the economically driven discipline), though it hasn’t in my opinion been fully defined. So what do you believe will become of it? Is it fad’ish or the storming/norming stage of a particular convergence between comp sci and stats? Or more?

I don’t see much official discussion by either groups like the ASA or any of the data science group/society outcroppings. What do you make of this? As much as I see the need for convergence in bayesian and frequentist philosophies in statistics, I see the need for convergence between the statistical and comp sci camps to solve problems which each alone may not necessarily be good at. Is that even a reasonable expectation?

Physics has been doing this quite rapidly in recent yrs. (convergence with medicine, biology, finance, economics, etc), and it has integrated fairly seamlessly I believe. I could be making the wrong comparison here, but it would seem to me statistics converges with, well….everything, and in no small way. Yet it seems to me there are forces ‘placing’ statistics as a whole into somewhat of a basement, or at least at the cusp of an identity crisis.

My reply:

(1) You can use html in comments. I’m not sure which are allowed and which aren’t, but, as I tell everyone, if your comment gets caught in the spam filter, just drop me an email. It happens.

(2) As I’ve written before (see also here), I think statistics (and machine learning, which I take to be equivalent to statistical inference but done by different people) is the least important part of data science. But I do think statistics has a lot to contribute, in design of data collection, model building, inference, and a general approach to making assumptions and evaluating their implications. Beyond this, I don’t really know what to say. I agree these things are important but somehow I feel I lack the big picture—having only enough of this picture to recognize that I lack it!

Perhaps others have more useful thoughts to add.

The post Statistics and data science, again appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Statistical analysis of monthly rates of events in around 20 hospitals and over a period of about 10 years shows that respiratory arrest, though about five times less frequent than cardio-respiratory arrest, is a common occurrence in the Emergency Department of a typical smaller UK hospital.

He has lots of detailed and commonsensical (but hardly routine) data analysis. Those of you who read my recent posts (here and here) on the World Cup might be interested in this, because it has lots of practical details of statistical analysis but of a different sort (and it’s a topic that’s more important than soccer). The analysis is interesting and clearly written.

But the subject seems pretty specialized, no? Why am I sharing with you an analysis of respiratory arrests (whatever that is) in emergency departments? The background is a possibly huge miscarriage of justice.

Here’s how Gill told it to me in an email:

I’m wondering if you can do anything to help Ben Geen – a British nurse sentenced for 30 years for a heap of crimes that were not committed by anyone. Show the world that in the good hands, statistics can actually be useful!

http://bengeen.wordpress.com/

There are important statistical issues and they need to be made known to the public.

There is in fact an international epidemic of falsely accused health care serial killers.

In Netherlands: Lucia de Berk

In the UK: Colin Norris, Ben Geen

In Canada: Susan Nelles

In the US: https://en.wikipedia.org/wiki/Ann_Arbor_Hospital_MurdersSo I [Gill] got involved in the Ben Geen case (asked by defence lawyer to write an expert statistical report). This is it

http://arxiv.org/abs/1407.2731

I became convinced that most of what the media repeated (snippets of over the top info from the prosecution, out of context, misinterpreted) was lies, and actually the real evidence was overwhelmingly strong that Ben was completely, totally innocent.

Here’s the scientific side of the story. It’s connected to the law of small numbers (Poisson and super Poisson variation) and to the Baader-Meinhof effect (observer bias). And to the psycho-social dynamics in a present-day, dysfunctional (financially threatened, badly run) hospital.

In the UK Colin Norris and Ben Geen are in 30 year sentences and absolutely clearly, they are completely innocent. Since no one was murdered there never will be a confession by the true murderer. Because there were no murders there won’t ever be new evidence pointing in a different direction. There will never be a new fact so the system will never allow the cases to be reviewed. Since the medical profession was complicent in putting those guys away no medical doctor will ever say a word to compromise his esteemed colleagues.

What is going on? why this international epidemic of falsely accused “health care serial killers”?

Answer: in the UK: the scare which followed Shipman triggered increased paranoia in the National Health Service. Already stressed, overburdened, underfunded … managers, nurses, specialists all with different interests, under one roof in a hospital … different social classes, lack of communication

So here’s the ingredients for a Lucia / Ben / Colin:

(1) a dysfunctional hospital (chaos, stress, short-cuts being taken)

(2) a nurse who is different from the other nurses. Stands up in the crowd. Different sex or age or class. More than average intelligence. Big mouth, critical.

(3) something goes wrong. Someone dies and everyone is surprised. (Why surprised: because of wrong diagnosis, disinformation, ….)

(4) Something clicks in someone’s mind (a paranoid doctor) and the link is made between the scary nurse and the event

(5) Something else clicks in … we had a lot more cases like that recently (eg. the seasonal bump in respiratory arrests. 7 this month but usually 0, 1 or 2)

(6) The spectre of a serial killer has now taken possession of the minds of the first doctor who got alarmed and he or she rapidly spreads the virus to his close colleagues. They start looking at the other recent cases and letting their minds fall back to other odd things which happened in recent months and stuck in their minds. The scary nurse also stuck in their mind and they connect the two. They go trawling and soon they have 20 or 30 “incidents” which are now bothering them. They check each one for any sign of involvement of the scary nurse and if he’s involved the incident quickly takes on a very sinister look. On the other hand if he was on a week’s vacation then obviously everything must have been OK and the case is forgotten.

(7) Another conference, gather some dossiers – half a dozen very suspicious cases to report to the police to begin with. The process of “retelling” the medical history of these “star cases” has already started. Everyone who was involved and does know something about the screw-ups and mistakes says nothing about them but confirms the fears of the others. That’s a relief – there was a killer around, it wasn’t my prescription mistake or oversight of some complicating condition. The dossiers which will go to the police (and importantly, the layman’s summary, written by the coordinating doctor) does contain “truth” but not the *whole truth*. And there is lots of truth which is not even in hospital dossiers (culture of lying, of covering up for mistakes).

(8) The police are called it, the arrest, there is of course an announcement inside the hospital and there has to be an announcement to the press. Now of course the director of the hospital is in control – probably misinformed by his doctors, obviously having to show his “damage control” capacities and to minimize any bad PR for his hospital. The whole thing explodes out of control and the media feeding frenzy starts. Witch hunt, and then witch trial.Then of course there is also the bad luck. The *syringe*, in Ben’s case, which clinches his guilt to anyone who nowadays does a quick Google search.

This is what Wendy Hesketh (a lawyer who is writing a book on the topic) wrote to me:

“I agree with your view on the “politics” behind incidences of death in the medical arena; that there is a culture endorsing collective lying”

“Inquries into medico-crime or medical malpractice in the UK see to have been commandeered for political purposes too: rather than investigate the scale of the actual problem at hand; or learn lessons on how to avoid it in future, the inquiries seem designed only to push through current health policy”

“The “Establishment” want the public to believe that, since the Shipman case, it is now easier to detect when a health professional kills (or sexually assaults) a patient. It’s good if the public think there will never be “another Shipman” and Ben Geen and Colin Norris being jailed for 30 years apiece sent out that message; as has the string of doctors convicted of sexual assault but statistics have shown that a GP would have to have a killing rate to rival Shipman’s in order to have any chance of coming to the attention of the criminal justice system. In fact, the case of Northumberland GP, Dr. David Moor, who openly admitted in the media to killing (sorry, “helping to die”) around 300 patients in the media (he wasn’t “caught”) reflects this. I argue in my book that it is not easier to detect a medico-killer now since Shipman, but it is much more difficult for an innocent person to defend themselves once accused of medico-murder.”

Indeed, the rate of serial killers in the UK’s National Health Service must be tiny and if there are good ones around they won’t even be noticed.

Yet is is so so easy in a failing health care organization for the suspicion to arise that there is one around. And once the chances are aligned and the triggering event has happened there is no going back. The thing snowballs. The “victim” has no chance.Chance events are clustered!!! Pure chance gives little bunches of tightly clustered events with big gaps between them. When chances are changing (e.g. seasonal variation, changes in hospital policies, staffing, new personel with new habits when filling in woefully inadequate diagnosis forms) then the phenomenon is stronger still!

eg three airliners crashed within a couple of days this week!!!

http://www.bbc.co.uk/news/magazine-28481060

How odd is a cluster of cases? Well by the law of *small* numbers (Poisson and even super-Poisson variation – Poisson means pure chance .. super-Poisson means pure chance but with the “chance per day” slowly varying in time) “short intervals between crashes are more likely than long ones”. (actually – very short, and very long, intervals, are both common. Pure chance means that accidents are *not* uniformly spread out in time. They are clustered. Big gap, cluster, biggish gap, smallish cluster… that’s pure randomness!!!)

Then there is the Baader-Meinhof phenomenon

http://www.psmag.com/culture/theres-a-name-for-that-the-baader-meinhof-phenomenon-59670/

[I replaced an earlier link for this which pointed to a flaky news site -- AG]

“Baader-Meinhof is the phenomenon where one happens upon some obscure piece of information– often an unfamiliar word or name– and soon afterwards encounters the same subject again, often repeatedly. Anytime the phrase “That’s so weird, I just heard about that the other day” would be appropriate, the utterer is hip-deep in Baader-Meinhof.”Another name for this is *observer bias*. You (a medical doctor having to fill in a diagnose for a patient in a standard form, which is totally inadequate for the complexity of medicine) saw one case which they had to give a rather unusual label to, and the next weeks that “unusual diagnosis” will suddenly come up several times.

Well, Professor Jane Hutton (Warwick university, UK) wrote all these things in her expert report for the appeal 6 years ago but the judge said that such kind of statistical evidence “is barely more than common sense” so refused the request for her to tell this common sense out loud in court.

OK, it’s me again. I haven’t looked at this case in any detail and so can’t add anything of substance to Gill’s analysis. But what I will say is, if Gill is correct, this example demonstrates both the dangers and the potential of statistics. The danger because it is statistical analysis that has been used to convict Geen (both in court and in the “court of public opinion” as measured by what can be found on Google). The potential because a careful statistical analysis reveals the problems with the case (again, I’m relying on Gill’s report here; that is, my comments here are conditional on Gill’s report being reasonable).

Just to be clear, I’m *not* not saying that statistical arguments cannot or should not be used to make public health decisions. Indeed, I was involved last year in a case in which the local public health department made a recommendation based on statistical evidence, and this recommendation was questioned, and I (at the request of the public health department, and for no compensation) wrote a brief concurrence saying why I did not agree with the critics. So I am not saying that any statistical argument can be shot down, or that the (inevitable) reliance of any argument on assumptions makes that argument suspect. What I am doing is passing along Richard Gill’s analysis in this particular case, where he has found it possible for people to draw conclusions from noise, to the extent of, in his view, sending an innocent person to jail for 30 years.

The post The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post SciLua 2 includes NUTS appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>According to the author of SciLua, Stefano Peluchetti:

Should be quite similar to your [Stan's] implementation with some differences in the adaptation strategy.

If you have time to have a look and give it a try I would appreciate your feedback. My website includes the LuaJIT binaries for 3 arches [Windows, Mac OS X, Linux], and the Xsys and Sci libraries needed to run NUTS.

I’ve also been working on a higher-level interface with some test models and datasets (the user can specify a simple script to describe the model, but it’s always Lua syntax, not a custom DSL).

Please notice that multi-pass forward automatic differentiation is used (I experimented with a single-pass, tape based, version but the speed-up was not big and for high-dimensional problems one wants to use reverse diff in any case, on my TODO).

For complex models you might want to increase loopunroll a little, and maybe callunroll as well. Feel free to ask any question.

I haven’t had time to try it myself. Looks like it’s implemented using a just-in-time compiler (JIT), like Julia, with models specified in its own language, Lua.

The post SciLua 2 includes NUTS appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Yummy Mr. P! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Chris Skovron writes:

A colleague sent the attached image from Indonesia.

For whatever reason, it seems appropriate that Mr. P is a delicious salty snack with the tagline “good times.”

Indeed. MRP has made the New York Times and Indonesian snack food. What more can we ask for?

The post Yummy Mr. P! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A linguist has a question about sampling when the goal is causal inference from observational data appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m a PhD student of cognitive neuroscience at Tufts, and a question came recently with my colleagues about the difficulty of random sampling in cases of highly controlled stimulus sets, and I thought I would drop a line to see if you had any reading suggestions for us.

Let’s say I wanted to disentangle the effects of word length from word frequency on the speed at which people can discriminate words from pseudowords, controlling for nuisance factors (say, part of speech, age of acquisition, and orthographic neighborhood size – the number of other words in the language that differ from the given word by only one letter). My sample can be a couple hundred words from the English language.

What’s the best way to handle the nuisance factors without compromising random sampling? There are programs that can automatically find the most closely-matched subsets of larger databases (if I bin frequency and word length into categories for a factorial experimental design), but what are the consequences of having experimental item essentially be a fixed factor? Would it be preferable to just take a random sample of the English language, then use a heirarchical regression to deal with the nuisance factors first? Are there measures I can use to determine the quantified extent to which chosen sampling rules (e.g. “nuisance factors must not significantly differ between conditions”) constrain random sampling? How would I know when my constraints start to really become a problem for later generalization?

Another way to ask the same question would be how to handle correlated variables of interest like word length and frequency during sampling. Would it be appropriate to find a sample in which word length and frequency are orthogonal (e.g. if I wrote a script to take a large number of random samples of words and use the one where the two variables of interest are the least correlated)? Or would it be preferable to just take a random sample and try to deal with the collinearity after the fact?

I replied:

I don’t have any great answers except to say that in this case I don’t know that it makes sense to think of word length *or* word frequency as a “treatment” in the statistical sense of the word. To see this, consider the potential-outcomes formulation (or, as Don Rubin would put it, “the RCM”). Suppose you want the treatment to be “increase word length by one letter.” How do you do this? You need to switch in a new word. But the effect will depend on which word you choose. I guess what I’m saying is, you can see how the speed of discrimination varies by word length and word frequency, and you might find a model that predicts well, in which case maybe the sample of words you use might not matter much. But if you don’t have a model with high predictive power, then I doubt there’s a unique right way to define your sample and your population; it will probably depend on what questions you are asking.

Delaney-Busch then followed up:

For clarification, this isn’t actually an experiment I was planning to run – I thought it would be a simple example that would help illustrate my general dilemna when it comes to psycholinguistics.

Your point on treatments is well-taken, though perhaps hard to avoid in research on language processing. It’s actually one of the reasons I’m concerned with the tension between potential collinearity and random sampling in cases where two or more variable correlate in a population. Theoretically, with a large random sample, I should be able to model the random effects of item in the same way I could model the random effects of subject in a between-subjects experiment. But I feel caught between a rock and a hard place when on the one hand, a random sample of words would almost certainly be collinear in the variables of interest, but on the other hand, sampling rules (such as “general a large number of potential samples and keep the one that is the least collinear”) undermines the ability to treat item as an actual random effect.

If you’d like, I would find it quite helpful to hear how you address this issue in the sampling of participants for your own research. Let’s say you were interested in teasing apart the effects of two correlated variables – education and median income – on some sort of political attitude. Would you prefer to sample randomly and just deal with the collinearity, or constrain your sample such that recruited participants had orthogonal education and median income factors? How much constraint would you accept on your sample before you start to worry about generalization (i.e. worry that you are simply measuring the fixed effect of different specific individuals), and is there any way to measure what effect your constraints have on your statistical inferences/tests?

The post A linguist has a question about sampling when the goal is causal inference from observational data appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan NYC Meetup – Thurs, July 31 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>

The third session will focus on using the Stan language. If you’re bringing a laptop, please come with RStan, PyStan, or CmdStan already installed.

We’re going to focus less on the math and more on the usage of the Stan language. We’ll cover:

• Stan language blocks

• Data types

• Sampling statements

• Vectorization

The post Stan NYC Meetup – Thurs, July 31 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035?

**Wed:** Statistics and data science, again

**Thurs:** The health policy innovation center: how best to move from pilot studies to large-scale practice?

**Fri:** The “scientific surprise” two-step

**Sat, Sun:** As Chris Hedges would say: It’s too hot for words

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan 2.4, New and Improved appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here are the release notes with a list of what’s new and improved:

New Features ------------ * L-BFGS optimization (now the default) * completed higher-order autodiff (added all probability functions, matrix functions, and matrix operators); tested up to 3rd order * enhanced effective range of normal_cdf to prevent underflow/overflow * added von Mises RNG * added ability to use scalars in all element-wise operations * allow matrix division for mixing scalars and matrices * vectorization of outcome variates in multivariate normal with efficiency boosts * generalization of multivariate normal to allow rwo vectors as means and Reorganization -------------- * move bin/print and bin/stanc to CmdStan; no longer generating main when compiling model from Stan C++ New Developer ------------- * Added Jeffrey Arnold as core Stan developer * Added Mitzi Morris as core Stan developer Bug Fixes --------- * modified error messages so that they're all 1-indexed instead of 0-indexed * fixed double print out of times in commands * const added to iterators to allow VS2008 compiles * fix boundary conditions on ordered tests * fix for pow as ^ syntax to catch illegal use of vectors (which aren't supported) * allow zero-length inputs to multi_normal and multi_student_t with appropriate log prob (i.e., 0) * fixed bug in inverse-Wishart RNG to match MCMCPack results with slightly asymmetric inputs * fixed problem with compiling user-defined function twice * fixed problem with int-only parameters for user-defined functions * fixed NaN init problems for user-defined functions * added check that user variable doesn't conflict with user function + doc * disallow void argument types in user-defined functions Code Cleanup and Efficiency Improvements ---------------------------------------- * removed main() from models generated from C++ Stan (they are now available only in CmdStan); removed no_main command options * reserve vector sizes for saving for sample recorder * removing many instances of std::cout from API (more still to go) * removed non-functional Nesterov optimization option * optimization code refactoring for testing ease * better constant handling in von Mises distribution * removed tabs from all source files * massive re-org of testing to remove redundant files and allow traits-based specializations, plus fixed for 1-indexing Testing ------- * added tests for log_softmax, multiply_lower_tri_self_transpose, tcrossprod * break out function signature tests into individual files, add many more * enhanced cholesky factor tests for round trip transforms and derivatives * extensive unit testing added for optimization * remove use of std::cout in all tests Example Models -------------- * lots of cleanup in links and models in ARM examples * added BUGS litter example with more stable priors than in the BUGS version (the original model doesn't fit well in BUGS as is, either) Documentation ------------- * add infix operators to manual * categorical_logit sampling statement * Cholesky factor with unit diagonal transform example * example of using linear regression for prediction/forecasting with notes * clarified some relations of naive Bayes to clustering vs. classification and relation to non-identifiability * new advice on multivariate priors for quad_form_diag * fix typo in multiply_lower_self_transpose (thanks to Alexey Stukalov) * fix formatting of reserved names in manual * fixed typo and clarified effective sample size doc

The post Stan 2.4, New and Improved appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan found using directed search appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>X and I did some “Sampling Through Adaptive Neighborhoods” ourselves the other day and checked out the nearby grave of Stanislaw Ulam, who is buried with his wife, Françoise Aron, and others of her family.

The above image of Stanislaw and Françoise Ulam comes from this charming mini-biography from Roland Brasseur, which I found here.

P.S. The Cimetière Montparnasse is full of famous people, including the great political scientist Raymond Aron (who perhaps has no close relation to Françoise Aron, given that his grave is listed as being in a different block of the cemetery), but the only other one we saw that day was Henri Poincaré.

The post Stan found using directed search appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post NYC workshop 22 Aug on open source machine learning systems appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>8:55am: Introduction

9:00am: Liblinear by CJ Lin.

9:30am: Vowpal Wabbit and Learning to Search (John Langford and Hal Daumé III)

10:00am: Break w/ refreshments

10:30am: Torch

11:00am: Theano and PyLearn (Bart van Merrienboer and Frederic Bastien)

11:30am: Stan (Bob Carpenter)

Noon: Lunch on your own

1:00pm: Summary: The systems as they exist

1:20pm: Panel: Can projects coordinate?

2:00pm: Break w/ refreshments

2:30pm: Discussion time for individual projects

4:00pm: Liblinear future plans

4:10pm: Vowpal Wabbit future plans

4:20pm: Torch future plans

4:30pm: Theano and PyLearn future plans

4:40pm: Stan future plans

4:50pm: Conclusion

It’s open to all (I think) as long as you let them know you’ll be coming; just email msrnycrsvp@microsoft.com.

The post NYC workshop 22 Aug on open source machine learning systems appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “An Experience with a Registered Replication Project” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I came across this blog entry, “An Experience with a Registered Replication Project,” and thought that you would find this interesting.

It’s written by Simone Schnall, a social psychologist who is the first author of an oft-cited Psych Science(!) paper (“Cleanliness reduces the severity of moral judgments”) that a group of researchers from Michigan State failed to replicate as part of a big replication project.

Schnall writes about her experience as the subject of a “failed recplication”. I like the tone of her blog entry. She discusses some issues that she has with how the replication of her work, and the publication of that work in a special issue of a journal was handled. A lot of what she writes is very reasonable.

Schnall believes that her finding did not replicate because there were ceiling effects in a lot of the items in the (direct) replication of her study. So far so good; people can mistakes in analyzing data from replication studies too. But then she writes the following:

It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.

She may be talking about the peer-review process here, it’s not entirely clear to me from the context. But what am I to think of her own data-analysis skills and strategies if it takes her “considerable time and effort” to discover the ceiling effects in the data? Looking at the distribution of variables is the very first thing that any researcher should do when they start analyzing their data. Step 1: exploratory data analysis! If she’s right about how this aspect of the data accounts for the failure to replicate her original finding, of course that makes for an interesting story—we need to be critical of failed replications too. But the thought that there may be a substantial number of scientists who don’t look in detail at basic aspects of their data before they start working with aggregated data makes me shake my head in disbelief.

My reply:

On the details of the ceiling effect, all I can say is that I’ve made a lot of mistakes in data analysis myself, so if it really took Schnall longer than it should’ve to discover this aspect of her data, I wouldn’t be so hard on her. Exploratory analysis is always a good idea but it’s still easy to miss things.

But speaking more generally about the issues of scientific communciation, I disagree with Schnall’s implication that authors of published papers should have some special privileges regarding the discussion of their work. Think about all the researchers who did studies where they made no dramatic claims, found no statistical significance, and then didn’t get published? Why don’t they get special treatment too? I think a big problem with the current system is that it puts published work on a plateau where it is difficult to dislodge. For example, “That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to.” My response to this is that, in many areas of research, peer reviewers do not seem to be deserving of that trust. Peer review often seems to be a matter of researchers approving other papers in their subfield, and accepting claims of statistical significance despite all the problems of multiple comparisons and implausible effect size estimates.

Schnall quotes Daniel Kahneman who writes that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I have mixed feelings about this. Maybe Kahneman is correct about *replication* per se, where there is a well-defined goal to get the exact but I don’t think it should apply more generally to *criticism*.

Regarding the specifics of her blog post, she has a lot of discussion about how the replication is different from her study. This is all fine, but the flip side of this is that why should her original study be considered representative of the general population? Once you accept that effect sizes were vary, this calls into question the paradigm of making quite general claims based on a study of a particular sample. Finally, I was disappointed that not anywhere in her blog does she consider the possibility that maybe, just maybe, her findings were spurious. She writes, “I have worked in this area for almost 20 years and am confident that my results are robust.” The problem is that such confidence can be self-reinforcing: once you think you’ve found something, you can keep finding it. The rules of statistical significance give a researcher enough “outs” that he or she can persist in a dead-end research paradigm for a long time. Just to be clear, I’m *not* saying that’s what’s happening here—I know nothing about Schnall’s work—but it can happen. Consider, for example, the work of Satoshi Kanazawa.

That said, I understand and sympathize with Schnall’s annoyance at some of the criticisms she’s received. I’ve had similar feelings to criticisms of my work: sometimes people point out actual and serious errors, sometimes mistakes I’ve made of over-generalization, other times I’ve misconstrued the literature in some way. Just recently my colleagues and I made some changes in this paper because someone pointed out some small ways in which it was misleading. Luckily he told us while the paper was still being revised for publication. Other times I learn about problems later and need to issue a correction. But sometimes I do get misinformed criticisms, even published papers where someone misunderstands my work and slams it. And I don’t like it. So I see where she’s coming from.

Again, from a statistical perspective, my biggest problem with what Schnall writes is that she doesn’t seem to consider, at all, the possibility that her original results could be spurious and arise from some combination of measurement error and capitalization on noise which, looked at a certain way, might include some statistically significant comparisons. I worry that she is thinking in a binary way: her earlier paper was a success, the replication was a failure, so she seeks to explain the differences. There are certainly differences between any two experiments on the same general phenomenon—that’s why we do meta-analysis—but, as Kahneman and Tversky demonstrated four decades ago, researchers tend to systematically underestimate the importance of variation in interpreting experimental results.

Following up, Salverda writes:

In the meantime, I came across two papers. One is by Schnall (in press, I believe), a response to the paper that failed to replicate her work.

And the people who failed to replicate her work [David Johnson, Felix Cheung, and Brent Donnellan] wrote a paper that addresses her criticism.

The post “An Experience with a Registered Replication Project” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post If it was good enough for Martin Luther King and Laurence Tribe . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>P.S. I miss the old days when people would point me to bad graphs.

The post If it was good enough for Martin Luther King and Laurence Tribe . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post NFL players keep getting bigger and bigger appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Aleks points us to this beautiful dynamic graph by Noah Veltman showing the heights and weights of NFL players over time. The color is pretty but I think I’d prefer something simpler, just one dot per player (with some jittering to handle the discrete reporting of heights and weights). In any case, it’s a great graph. Click on the link to see it in action.

P.S. Even better, once we move to a dynamic scatterplot, would be to use different colors for different positions, and to allow the reader of the graph to highlight different positions. On the linked page, Veltman writes, “the blob separating into multiple groups in the 1990s . . . likely reflects increased specialization of body type by position.” But we should be able to see this directly, no need for speculation, right?

The post NFL players keep getting bigger and bigger appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A world without statistics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What would be missing, in a world without statistics?

Science would be pretty much ok. Newton didn’t need statistics for his theories of gravity, motion, and light, nor did Einstein need statistics for the theory of relativity. Thermodynamics and quantum mechanics are fundamentally statistical, but lots of progress could’ve been made in these areas without statistics. The second law of thermodynamics is an observable fact, ditto the two-slit experiment and various experimental results revealing the nature of the atom. The A-bomb and, almost certainly, the H-bomb, maybe these would never have been invented without statistics, but on balance I think most people would feel that the world would be a better place without these particular scientific developments. Without statistics, we could forget about discovering the Hibbs boson etc, but that doesn’t seem like such a loss for humanity.

At a more applied level, statistics helped to win World War 2, most notably in cracking the Enigma code but also in various operations-research efforts. And it’s my impression that “our” statistics were better than “their” statistics. So that’s something.

Where would civilian technology be without statistics? I’m not sure. I don’t have a sense of how necessary statistics was for quantum theory. In a world without statistics, would the study of quantum physics have progressed far enough so that transistors were invented? This one, I don’t know. And without statistics we wouldn’t have modern quality control, so maybe we’d still be driving around in AMC Gremlins and the like. Scary thought, but not a huge deal, I’d think. No transistors, though, that would make a difference in my life. No transistors, no blogging! And I guess we could also forget about various unequivocally beneficial technological innovations such as modern pacemakers, hearing aids, cochlear implants, and Clippy.

Modern biomedicine uses lots and lots of statistics, but would medicine be so much worse without it? I don’t think so, at least not yet. You don’t need statistics to see that penicillin works, nor to see that mosquitos transmit disease and that nets keep the mosquitos out. Without statistics, I assume that various mistakes would get into the system, various ineffective treatments that people think are effective, etc. But on balance I doubt these would be huge mistakes, and the big ones would eventually get caught, with careful record-keeping even without statistical inference and adjustments. Without statistics, biologists would not be able to sequence the gene, and I assume they’d be much slower at developing tools such as tests that allow you to check for chromosomal abnormalities in amnio. I doubt all these things add up to much yet, but I guess there’s promise for the future. Statistics is also necessary for a lot of drug development—right now my colleagues and I are working on a pharmacodynamic model of dosing—but, again, without any of this, it’s not clear the world would be *so* much different.

The Poverty Lab team use statistics and randomized experiments to see what works to help the lives of poor people around the world. That’s cool but I’m not ultimately convinced this all makes a difference in the big picture. Or, to put it another way, I suspect that the statistical validation serves mostly as a way to build political consensus for economic policies that will be effective in sharing the wealth. By demonstrating in a scientific way that Treatment X is effective, this supports the idea that there is a way to help the sort of people who live in what Nicholas Wade would describe as “tribal” societies. So, sure, fine, but in this case the benefits of the statistical methods are somewhat indirect.

Without statistics, we wouldn’t have most of the papers in “Psychological Science,” but I could handle that. Piaget didn’t need any statistics, and I think the modern successors of Piaget could’ve done pretty much what they’ve done without statistics, just by careful observation of major transitions.

Careful observation and precise measurement can be done, with or without statistical methods. Indeed, researchers often use statistics as a *substitute* for careful observation and precise measurement. That is a horrible thing to do, and if you have a clear understanding of statistical theory, you can see why. But statistics is hard, and lots of researchers (and journal editors, news reporters, etc.) don’t have that understanding. When statistics is used as a substitute for, rather than an adjunct to, scientific measurement, we get problems.

OK, here’s another one: no statistics, no psychometrics. That’s too bad but one could make the argument that, on the whole, psychometrics has done more harm than good (value-added assessment, anyone?). Don’t get me wrong—I like psychometrics, and a strong argument could be made that it’s done more good than harm—but my point here is that the net benefit is not clear; a case would have to be made.

Polling. Can’t do it well without statistics. But, would a world without polling be so horrible? Much as I hate to admit it, I don’t think so. Don’t get me wrong, I think polling is on balance a good thing—I agree with George Gallup that measurement of public opinion is an important part of the modern democratic process—but I wouldn’t want to hang too much of the benefits of statistics on this one use, given that I expect lots of people would argue that opinion polls do more harm than good in politics.

**The alternative to good statistics is . . .**

Perhaps the most important benefits of statistics come not from the direct use of statistical methods in science and technology, but rather in helping us learn about the world. Statisticians from Francis Galton and Ronald Fisher onward have used statistics to give us a much deeper understanding of human and biological variation. I can’t see how any non-statistical, mechanistic model of the world could reproduce that level of understanding. Forget about p-values, Bayesian inference, and the rest: here I’m simply talking about the nature of correlation and variation.

For a more humble example, consider Bill James. Baseball is a silly example, sure, but the point is to see how much understanding has been gained in this area through statistical measurement and comparison. As James so memorably wrote, the alternative to good statistics is not “no statistics,” it’s “bad statistics.” James wrote about baseball commentators who would make asinine arguments which they would back up by picking out numbers without context. In politics, the equivalent might be a proudly humanistic pundit such as New York Times columnist David Brooks supporting his views by just making up numbers or featuring various “too good to be true” statistics and not checking them.

So here’s one benefit to the formal study of statistics: Without any statistics, there still would be *numbers*, along with people trying to interpret them.

Could governments and large businesses be managed well without statistics? I’m not sure. Given that half the U.S. Congress seems willing to shut down the government from time to time, it’s not clear than any agreement on the numbers will have much to do with political action. Similarly, all the statistics in the world don’t seem to be stopping the euro-zone from drifting. But maybe things would be much worse without a common core of statistical agreement. I don’t know; unfortunately this seems like the sort of causal question that is too difficult for statistics to answer.

Finally, one way that statistics is potentially having a huge impact in our lives is through the measurement of global warming and all the rest. But I’m guessing that a lot of this could be done with a pre-statistical understanding. The basic physics is already there, as would be the careful measurements. Statistical modeling is certainly relevant to the study of climate change—if you’re trying to reconstruct historical climate conditions from tree-ring data, it’s tough enough to do it *with* statistical modeling, I can’t imagine how it could be done otherwise—but the basic patterns of carbon dioxide, temperature, melting ice, etc., are apparent in any case. And, even with statistics, much uncertainty remains.

**Summary**

When I started writing this post, I was thinking that statistics doesn’t really matter, but I think that’s because I was focusing on some of the more highly-publicized but less beneficial applications of statistics: the use of statistical experimentation and inference to get p-values for tabloid-bait scientific papers, or for Google, Amazon, etc., to perfect their techniques for squeezing money out of their customers or, even at best, to test a medical treatment that increases survival rate for some rare disease by 2 percentage points. But statistics is central to how we think about the world. I still think that statistics is much less central to our lives than, say, chemistry. But it ain’t nothing.

The post A world without statistics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Battle of the cozy comedians: What’s Alan Bennett’s problem with Stewart Lee? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>When in London awhile ago I picked up the book, “How I Escaped My Certain Fate: The Life and Deaths of a Stand-Up Comedian,” by Stewart Lee. I’d never heard of the guy but the book was sitting there, it had good blurbs, and from a quick flip-through it looked interesting. Now that I’ve read the whole thing, I can confirm that it really is interesting. I recommend it. Along with transcripts of some of his comedy routines—which aren’t particularly funny most of the time, at least not on the page—he has lots of discussion of what works and what doesn’t work on stage and how he wants to communicate with his audience. It all reminds me a lot of the things I think about when giving statistics talks. I mean, sure, Lee is much much more of a pro than I’ll ever be, but a lot of his issues resonate with me too. In particular there’s the idea of wanting a laugh but not a cheap laugh (which in a technical talk corresponds to the goal of transmitting the excitement and importance of one’s work without lapsing into Ted-talk hype) and various tactics of engaging the audience.

Also the idea that there is no single optimal style, that your approach to presentation, like a diaper, needs to be regularly changed to stay fresh.

Lee’s book was also interesting because he gives off a regular-guy vibe, sort of like the essayist David Owen, who gives the impression of being an earnest person, not a deep or particularly quick thinker, more like a gentle guide who can plod along with the reader at his or her own pace. He’s not a true original like George Carlin or brilliant like Chris Rock, more of a guy who’s doing his best every day, and with a pleasant self-awareness that elevates his work.

So that was that. But then one day I read this offhand remark from Alan Bennett:=

Peter [Cook] . . . was already in 1960 established as a successful sketch writer for revues in the West End. This meant that at that time he had no wish to offend an audience and shied away from sketches that did. It was only later in his career that, as his humour became more anarchic and audiences in their turn more fawning and in on the joke, he ceased to care. Showbiz dies hard and in these toothless stand-up days I think Peter might just have liked Jeremy Hardy but would have drawn the line at Stewart Lee.

I can’t be sure, but it sounds like Bennett considers Lee to be a bit tacky. Just as Greg Mankiw used his late grandmother as a mouthpiece for his distaste for Sonia Sotomayor, Bennett seems to be using his late colleague Cook to diss Lee.

I honestly have no idea what’s going on here. To my American eyes, Lee and Bennett seem very similar: two cozy left-wing English comedy writer/performers, successful but self-deprecating . . . really it’s hard for me to see much difference. OK, Bennett is gay while Lee is a sensitive heterosexual, but that can’t be the whole story. There must be something else going on: maybe Lee is too “middlebrow” for Bennett? Or maybe it’s the opposite, that Bennett sees Lee as one of those kids who doesn’t know how it’s really done?

Could any of our English readers inform me on this one? It’s no big deal but I hate being baffled like this.

The post Battle of the cozy comedians: What’s Alan Bennett’s problem with Stewart Lee? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Skepticism about a published claim regarding income inequality and happiness appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I read your Chance article (disproving that no one reads Chance!) re communicating about flawed psychological research. And I know from your other writings of your continuing good fight against misleading quantitative work. I think you and your students might be interested on my recent critique of a 2011 paper published in Psychological Science, “Income Inequality and Happiness” by Shigehiro Oishi, et al. The critique is here.

The blog post demonstrates that treating ordinal numbers with respect along with an eye to robustness leads to contrary conclusions – and a more interesting conjecture. If nothing else for your students, the post is an example of how an applied statistician thinks.

I have emailed the three authors and editor. I don’t expect to see a retraction. But maybe someone will pick up on the recommendations. We’ll see.

De Libero’s critique is worth reading. Lots of interesting points, could be a good example for a statistics class, if the instructor is looking for something that, unlike typical textbook analyses, does not have a simple clean story. Also there seem to be problem with this paper published in Psychological Science, but that’s hardly news. . . .

P.S. In the old days I would’ve crossposted this on the sister blog. But now they don’t like running duplicate material, and so I thought it better to post in this space, since here we get good discussions in the comments.

The post Skepticism about a published claim regarding income inequality and happiness appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Battle of the cozy comedians: What’s Alan Bennett’s problem with Stewart Lee?

**Wed:** A world without statistics

**Thurs:** NFL players keep getting bigger and bigger

**Fri:** “An Experience with a Registered Replication Project”

**Sat, Sun:** As Chris Hedges would say: Don’t worry, baby

The post On deck for the rest of the summer appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And, scheduled for Labor Day:

- Bad Statistics: Ignore or Call Out?

Enjoy your summer! Unless you live in the southern hemisphere, in which case, Merry Christmas.

The post On deck for the rest of the summer appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Differences between econometrics and statistics: From varying treatment effects to utilities, economists seem to like models that are fixed in stone, while statisticians tend to be more comfortable with variation appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I had an interesting discussion with Peter Dorman (whose work on assessing the value of a life we discussed in this space a few years ago).

The conversation started when Peter wrote me about his recent success using hierarchical modeling for risk analysis. He wrote, “Where have they [hierarchical models] been all my life? In decades of reading and periodically doing econometrics, I’ve never come across this method.”

I replied that it’s my impression that economists are trained to focus on estimating a single quantity of interest, whereas multilevel modeling is appropriate for estimating many parameters. Economists *should* care about variation, of course; indeed, variation could well be said to be at the core of economics, as without variation of some sort there would be no economic exchanges. There are good reasons for focusing on point estimation of single parameters—in particular, if it’s hard to estimate a main effect, it is typically even more difficult to estimate interactions—but if variations are important, I think it’s important to model and estimate them.

Awhile later, Peter sent me this note:

I’ve been mulling the question about economists’ obsession with average effects and posted this on EconoSpeak. I could have said much more but decided to save it for another day. In particular, while the issue of representative agents has come up in the context of macroeconomic models, I wonder how many noneconomists — and even how many economists — are aware that the same approach is used more or less universally in applied micro. The “model” portion of a typical micro paper has an optimization model for a single agent or perhaps a very small number of interacting agents, and the properties of the model are used to justify the empirical specification. This predisposes economists to look for a single effect that variations in one factor have on variations in another. But the deeper question is why these models are so appealing to economists but less attractive (yes?) to researchers in other disciplines.

I responded:

There is the so-called folk theorem which I think is typically used as a justification for modeling variation using a common model. But more generally economists seem to like their models and then give after-the-fact justification. My favorite example is modeling uncertainty aversion using a nonlinear utility function for money, in fact in many places risk aversion is _defined_ as a nonlinear utility function for money. This makes no sense on any reasonable scale (see, for example, section 5 of this little paper from 1998, but the general principle has been well-known forever, I’m sure), indeed the very concept of a utility function for money becomes, like a rainbow, impossible to see if you try to get too close to it—but economists continue to use it as their default model. This bothers me. I don’t think it’s like physicists starting by teaching mechanics with a no-friction model and then adding friction. I think it’s more like, ummm, I dunno, doing astronomy with Ptolemy’s model and epicycles. The fundamentals of the model are not approximations to something real, they’re just fictions.

Peter answered:

So my deep theory goes like this: the vision behind all of neoclassical economics post 1870 is a unified normative-positive theory. The theory of choice (positive) is at the same time a theory of social optimality. This is extremely convenient, of course. The problem, which has only grown over time, is that the assumptions needed for this convergence, the central role assigned to utility (which is where positive and normative meet) and its maximization, either devolve into tautology or are vulnerable to disconfirmation. I suspect that this is unavoidable in a theory that attempts to be logically deductive, but isn’t blessed, as physics is, by the highly ordered nature of the object of study. (Physics really does seem to obey the laws of physics, mostly.)

I’ve come to feel that utility is the original sin, so to speak. I really had to do some soul-searching when I wrote my econ textbooks, since if I said hostile things about utility no one would use them. I decided to self-censor: it’s simply not a battle that can be won on the textbook front. Rather, I’ve come to think that the way to go at it is to demonstrate that it is still possible to do normatively meaningful work without utility — to show there’s an alternative. I’m convinced that economists will not be willing to give this up as long as they think that doing so means they can’t use economics to argue for what other people should or shouldn’t do. (This also has connections to the way economists see their work in relation to other approaches to policy, but that’s still another topic.)

And I’ve been thinking more about your risk/uncertainty example. Your approach is to look for regularity in the data (observed choices) which best explains and predicts. I’m with you. But economists want a model of choice behavior based on subjective judgments of whether one is “better off”, since without this they lose the normative dimension. This is a costly constraint.

There is an interesting study to be written — maybe someone has already written it — on the response by economists to the flood of evidence for hyperbolic discounting. This has not affected the use of observed interest rates for present value calculation in applied work, and choice-theoretic (positive) arguments are still enlisted to justify the practice. Yet, to a reasonable observer, the normative model has diverged dramatically from its positive twin. This looks like an interesting case of anomaly management.

Lots to think about here (also related to this earlier discussion).

The post Differences between econometrics and statistics: From varying treatment effects to utilities, economists seem to like models that are fixed in stone, while statisticians tend to be more comfortable with variation appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Ethics and statistics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As hard as it is to do, I thought it was good to try and define what exactly makes for an ethical violation. Your third point noted that it needed to break some sort of rule. Could you elaborate on this idea in the context of statistical rules? From my understanding, most statistical rules are not 0 or 1, but somewhere in between. (Removing an outlier comes to mind as an example).

He was responding to my statement that “An ethics problem arises when you are considering an action that (a) benefits you or some cause you support, (b) hurts or reduces benefits to others, and (c) violates some rule.”

I thought the bit about violating a rule was necessary because it’s generally considered acceptable to try to get more for yourself, if you’re doing so within the context of an accepted set of rules. Here I wasn’t thinking so much of statistical rules (for example, the idea that for statistical significance you need p=0.05 not p=0.06) but rather social rules. But maybe there’s more to be said on this.

The big new idea in my talk (which, unfortunately, I didn’t get to during the 20 minutes that were allocated to me) is near the end of the presentation, when I suggest that mainstream statistical methods (Bayes included) can themselves be unethical. Maybe this will the subject of a future Chance column.

P.S. One difficulty in posting slides is that they can be misleading without the accompanying speech. In particular, near the end of the slides I show the notorious third-degree polynomial regression discontinuity fit, under the headline, “Find the ethical problem!” Just to be clear, let me explain that I think the ethical problem here is not with the people who did the analysis and made the graph; rather, I think the ethical problem arises in our scientific publication system itself, which rewards dramatic claims based on statistical significance and dis-incentivizes more realistic, sober assessment of evidence. Also contributing to the ethical problem has been the publication of papers recommending something as goofy as this sort of high-degree polynomial fit.

The post Ethics and statistics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The Europeans and Australians were too eager to believe in renal denervation” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The story, though, is not boring. Paul Alper writes:

I just came across this in the NYT.

Here is the NEJM article itself:

And here is the editorial in the NEJM:

The gist is that on the basis of previous studies without a control arm, renal denervation was thought to be a blockbuster treatment for those suffering from very high blood pressure. The randomized clinical trial with a sham procedure as the control (placebo) found that the effect seems to be mainly psychological. I suppose the moral of the story is that unless there is a control arm, enthusiasm must be tempered. Note too that the U.S. FDA comes out appearing properly slow and skeptical while the Europeans and Australians were too eager to believe in renal denervation.

Paul followed up with this:

Our local newspaper today had an AP article about drugs which lower cholesterol but speculated that despite the lower cholesterol, the lowering of mortality by the drugs are yet to be proved.

To see whether you have already blogged about “surrogate criteria,” I googled *surrogate Gelman* and came up with this and this, which contains this sentence:

The objective of NRL’s research is to develop tissue surrogate materials that simulate the mechanical and acoustical properties of biological tissues. These are then assembled into an experimental test system of the human thorax, called “GelMan,” for assessing blunt forces and blast dynamics.

Hey, I used to work at NRL!

Paul continued:

While I still don’t know what a “GelMan” is, I also found this by a Gelman which has to do with vomiting.

I assume he was referring to my encounter with the Barfblog.

Paul continued:

Note the reference to a “Fisher,” but not R.A.

Just for giggles, I googled *surrogate Alper* and sure enough found many web sites for conceiving a new life and dealing with the end of life by people who have Alper as a last name. One wonders if *surrogate any-name-what-so-ever* would also produce a bunch of google hits.

Hmmm, let’s see what comes up:

Cool—it works!

P.S. I would’ve given this post the title, “GelMan Thoracic Surrogate for Underwater Threat Neutralization,” but that wouldn’t have been boring enough.

The post “The Europeans and Australians were too eager to believe in renal denervation” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan World Cup update appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**0. Background**

As you might recall, the estimated team abilities were reasonable, but the model did not fit the data, in that when I re-simulated the game outcomes using the retrospectively fitted parameters, my simulations were much close than the actual games. To put it another way, many more than 1/20 of the games fell outside their 95% predictive intervals.

**1. Re-fitting on the original scale**

This was buggin me. In some way, the original post, which concluded with “my model sucks,” made an excellent teaching point. Still and all, it was a bummer.

So, last night as I was falling asleep, I had the idea of re-fitting the model on the original scale. Maybe the square-root transformation was compressing the data so much that the model couldn’t fit. I wasn’t sure how this could be happening but it seemed worth trying out.

So, the new Stan program, worldcup_raw_matt.stan:

data { int nteams; int ngames; vector[nteams] prior_score; int team1[ngames]; int team2[ngames]; vector[ngames] score1; vector[ngames] score2; real df; } transformed data { vector[ngames] dif; dif <- score1 - score2; } parameters { real b; realsigma_a; real sigma_y; vector[nteams] eta_a; } transformed parameters { vector[nteams] a; a <- b*prior_score + sigma_a*eta_a; } model { eta_a ~ normal(0,1); for (i in 1:ngames) dif[i] ~ student_t(df, a[team1[i]]-a[team2[i]], sigma_y); }

Just the same as the old model but without the square root stuff.

And then I appended to my R script some code to fit the model and display the estimates and residuals:

# New model 15 Jul 2014: Linear model on origional (not square root) scale fit <- stan_run("worldcup_raw_matt.stan", data=data, chains=4, iter=5000) print(fit) sims <- extract(fit) a_sims <- sims$a a_hat <- colMeans(a_sims) a_se <- sqrt(colVars(a_sims)) library ("arm") png ("worldcup7.png", height=500, width=500) coefplot (rev(a_hat), rev(a_se), CI=1, varnames=rev(teams), main="Team quality (estimate +/- 1 s.e.)\n", cex.var=.9, mar=c(0,4,5.1,2)) dev.off() a_sims <- sims$a sigma_y_sims <- sims$sigma_y nsims <- length(sigma_y_sims) random_outcome <- array(NA, c(nsims,ngames)) for (s in 1:nsims){ random_outcome[s,] <- (a_sims[s,team1] - a_sims[s,team2]) + rt(ngames,df)*sigma_y_sims[s] } sim_quantiles <- array(NA,c(ngames,2)) for (i in 1:ngames){ sim_quantiles[i,] <- quantile(random_outcome[,i], c(.025,.975)) } png ("worldcup8.png", height=1000, width=500) coefplot ((score1 - score2)[new_order]*flip, sds=rep(0, ngames), lower.conf.bounds=sim_quantiles[new_order,1]*flip, upper.conf.bounds=sim_quantiles[new_order,2]*flip, varnames=ifelse(flip==1, paste(teams[team1[new_order]], "vs.", teams[team2[new_order]]), paste(teams[team2[new_order]], "vs.", teams[team1[new_order]])), main="Game score differentials\ncompared to 95% predictive interval from model\n", mar=c(0,7,6,2), xlim=c(-6,6)) dev.off ()

And here's what I got:

And this looks just fine, indeed in many ways better than before, not just the bit about the model fit but also the team ability parameters can now be directly interpretable. According to the model fit, Brazil, Argentina, and Germany are estimated to be 1 goal better than the average team (in expectation), with Australia, Honduras, and Cameroon being 1 goal worse than the average.

**2. Debugging**

But this bothered me in another way. Could those square-root-scale predictions have been that bad? I can't believe it. Back to the code. I look carefully at the transformation in the Stan model:

transformed data { vector[ngames] dif; vector[ngames] sqrt_dif; dif <- score1 - score2; for (i in 1:ngames) sqrt_dif[i] <- (step(dif[i]) - .5)*sqrt(fabs(dif[i])); }

D'oh! That last line is wrong, it's missing a factor of 2. Stan doesn't have a sign() function so I hacked something together using "step(dif[i]) - .5". But this difference takes on the value +.5 if dif is positive or -.5 if dif is negative (zero doesn't really matter because it all gets multiplied by abs(dif) anyway). Nonononononononono.

Nononononononononononononononono.

Nonononononononononononononononononononononono.

Damn.

OK, I fix the code:

transformed data { vector[ngames] dif; vector[ngames] sqrt_dif; dif <- score1 - score2; for (i in 1:ngames) sqrt_dif[i] <- 2*(step(dif[i]) - .5)*sqrt(fabs(dif[i])); }

I rerun my R script from scratch. Stan crashes R. Interesting---I'll have to track this down. But not right now.

I restart R and run. Here are the results from the fitted model on the square root scale:

And here are the predictions and the game outcomes:

I'm not quite sure about this last graph but I gotta go now so I'll post, maybe will look at the code later if I have time.

**3. Conclusions**

My original intuition, that I could estimate team abilities by modeling score differentials on the square root scale, seems to have been correct. In my previous post I'd reported big problems with predictions, but that's because I'd dropped a factor of 2 in my code. These things happen. Modeling the score differences on the original scale seems reasonable too. It's easy to make mistakes, and it's good to check one's model in various ways. I'm happy that in my previous post I was able to note that the model was wrong, even if at the time I hadn't yet found the bug. It's much better to be wrong and know you're wrong than to be wrong and not realize it.

Finally, now that the model fits ok, one could investigate all sorts of thing by expanding it in various ways. But that's a topic for another day (not soon, I think).

The post Stan World Cup update appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Building on theories used to describe magnets, scientists have put together a model that captures something very different . . .” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It’s time to tell you all: Hari Seldon never existed.

Here’s what I think of these stories of physicists who discover the laws of society. I think they’re comparable to the stories we used to hear about, of the eccentric inventor who built a car engine that got 200 mpg, or ran on water, but was then suppressed by GM because they couldn’t handle the competition. Or, if we want go Heinlein, some guy who builds a time machine in his garage out of random spare parts.

Or maybe I should put it in a way that everyone can understand. Look at Richard Feynman, master swordsman and physicist, the guy who was so much smarter and more charming than everyone else, the guy who bragged that when he took up biology as a hobby he could do better than the biologists, who bragged that when he took up painting as a hobby he was an awesome painter. Even *Richard Feynman* did not think he could discover laws of social behavior. Sure, maybe he had better things to do. But, just maybe, maybe . . . he realized it wasn’t gonna happen.

I was reminded of this recently when Peter Dodds pointed me to this news article by Nathan Collins, entitled “Physics Predicts U.S. Voting Patterns”:

Building on theories used to describe magnets, scientists have put together a model that captures something very different: voting patterns in U.S. presidential elections. . . . For example, a person who voted Republican in one election and lives in a politically neutral county but works in a heavily Democratic county would likely vote for a Democrat in the next election. . . . the researchers focused on the distribution of Republican margins of victory across U.S. counties as well as how correlations between two counties’ vote shares changed with the distance separating them, quantities more commonly used to describe the transition from a demagnetized block of iron to a magnetized one. . . . the researchers predicted a bell curve distribution of county-level margins of victory and surprisingly long-range correlations between counties; that suggests that some counties, at least, could feel the effects of social pressures in counties on the other side of the nation, they report this month in Physical Review Letters. What’s more, those patterns were a close match to the actual election data . . .

I disagree with this last claim. Their model seems to reproduce a normal distribution of county-level vote shares with spatial correlation. All the stuff about magnets and social interactions and a “noisy diffusive process” and “anysotropic coupling topology” are pretty much irrelevant.

The link above goes to an abstract only, but a quick search revealed the preprint, Is the Voter Model a model for voters?, by Juan Fernández-Gracia, Krzysztof Suchecki, José Ramasco, Maxi San Miguel, and Víctor Eguíluz. So you can read it yourself if you’d like (but I don’t recommend it).

Its a bit sad to see *Science* and *Physical Review Letters* falling for this sort of thing, but I guess the story is so appealing they just can’t help it. It’s a great narrative, that the laws of social nature are just out there, waiting to be figured out by some physicist.

**The big picture**

Maybe it would help to explain what I’m *not* saying here. I’m not saying that outsiders can’t or shouldn’t make contributions to social science, nor am I saying that complicated mathematical models from physical science shouldn’t be used to model social behavior, voting included.

Indeed, I myself am a former physicist, and in 2002 I published a paper (with Jonathan Katz and Francis Tuerlinckx) entitled, “The Mathematics and Statistics of Voting Power,” reproducing certain features of elections using stochastic processes defined on a tree of voters. In retrospect, I think maybe this paper could’ve been published in a physics journal! One of the models even has a phase transition—it’s based on the Ising model from quantum physics—and we have tons of math (see, for example, pages 430 and 432 of the linked paper). So it’s not that it can’t be done, but the work has to be evaluated on its own terms. The physics connection isn’t enough on its own.

To get back to my introduction to this post, there’s sometimes a pattern in the science media of naive beliefs about what mathematical modeling can do. We’ve seen it in the uncritical celebration by Malcolm Gladwell of the uncalibrated claims by John Gottman that he could use mathematical models to predict divorces. We even see it in some of the media treatment of Nate Silver, when people act as if he has some secret formula rather than (as Nate himself insists) simply using good data, sensible models, and hard work. And a few years earlier, the celebration of Steve Levitt as a guy who’s statistical sumo wrestling skills could be used to catch terrorists. Again, let me be clear: Nate Silver and Steve Levitt do excellent work; I’m not criticizing them, rather I’m criticizing the attitude that they’re doing something magic.

The post “Building on theories used to describe magnets, scientists have put together a model that captures something very different . . .” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Questions about “Too Good to Be True”

**Wed:** “The Europeans and Australians were too eager to believe in renal denervation”

**Thurs:** Ethics and statistics

**Fri:** Differences between econometrics and statistics: From varying treatment effects to utilities, economists seem to like models that are fixed in stone, while statisticians tend to be more comfortable with variation

**Sat, Sun:** As Jonah Lehrer would say: Something is happening but you don’t know what it is

The post Stan London Meetup 16 July appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Stan Development Team is happy to announce the first Stan London Meetup,

Wednesday, July 16th, 6-8 PM

Bentham House, Seminar Room 4

4-8 Endsleigh Gardens, London, WC1H 0EGNominally the plan is to begin with a casual introduction to Stan and then break out into discussion based on the interests of the attendees. Bring statistical questions, broken models you’d like to debug, and working models you’d like to share.

If you’re planning on attending please send me an email to RSVP so that in the rare case of excessive interest we can move to a bigger room. Look forward to meeting everyone!

The Stan London Meetup is committed to maintaining an inclusive, respectful, and safe environment for all attendees. If you encounter any inconsiderate behavior, please report it to the organizers.

“Bentham House, 4-8 Endsleigh Gardens,” huh? Can’t get much more London than that.

The post Stan London Meetup 16 July appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan goes to the World Cup appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It didn’t work so well, but I guess that’s part of the story too.

All the data and code are here.

**Act 1: The model, the data, and the fit**

My idea was as follows: I’d fit a simple linear item response model, using the score differentials as data (ignoring the shoot-outs). I also have this feeling that when the game is not close the extra goals don’t provide as much information so I’ll fit the model on the square-root scale.

The model is simple: if team i and team j are playing and score y_i and y_j goals, respectively, then the data point for this game is y_ij = sign(y_i-y_j)*sqrt|y_i-y_j|, and the data model is:

y_ij ~ normal(a_i-a_j, sigma_y),

where a_i and a_j are the ability parameters (as they say in psychometrics) for the two teams and sigma_y is a scale parameter estimated from the data. But then before fitting the model I was thinking of occasional outliers such as that Germany-Brazil match so I decided that a t model could make more sense:

y_ij ~ t_df (a_i-a_j, sigma_y),

setting the degrees of freedom to df=7 which has been occasionally recommended as a robust alternative to the normal.

(Bob Carpenter hates this sort of notation because I’m using underbar (_) in two different ways: For y_i, y_j, y_ij, a_i, and a_j, the subscripts represent indexes (in this case, integers between 1 and 32). But for sigma_y, the subscript “y” is a name indicating that this is a scale parameter for the random variable y. Really for complete consistency we should use the notation sigma_”y” to clarify that this subscript is the name, not the value, of a variable.)

[Ed. (Bob again): How about y[i] and y[i,j] for the indexing? Then the notation’s unambiguous. I found the subscripts in Gelman and Hill’s regression book very confusing.]

Getting back to the model . . . There weren’t so many World Cup games so I augmented the dataset by partially pooling the ability parameters toward a prior estimate. I used the well-publicized estimates given by Nate Silver from last month, based on something called the Soccer Power Index. Nate never says exactly how this index is calculated so I can’t go back to the original data, but in any case I’m just trying to do something basic here so I’ll just use the ranking of the 32 teams that he provides in his post, with Brazil at the top (getting a “prior score” of 32) and Australia at the bottom (with a “prior score” of 1). For simplicity in interpretation of the parameters in the model I rescale the prior score to have mean 0 and standard deviation 1/2.

The model for the abilities is then simply,

a_i ~ N(mu + b*prior_score_i, sigma_a).

(I’m using the Stan notation where the second argument to the normal distribution is the scale, not the squared scale. In BDA we use the squared scale as is traditional in statistics.)

Actually, though, all we care about are the relative, not the absolute, team abilities, so we can just set mu=0, so that the model is,

a_i ~ N(b*prior_score_i, sigma_a).

At this point we should probably add weakly informative priors for b, sigma_a, and sigma_y, but I didn’t bother. I can always go back and add these to stabilize the inferences, but 32 teams should be enough to estimate these parameters so I don’t think it will be necessary.

And now I fit the model in Stan, which isn’t hard at all. Here’s the Stan program worldcup.stan:

data { int nteams; int ngames; vector[nteams] prior_score; int team1[ngames]; int team2[ngames]; vector[ngames] score1; vector[ngames] score2; real df; } transformed data { vector[ngames] dif; vector[ngames] sqrt_dif; dif <- score1 - score2; for (i in 1:ngames) sqrt_dif[i] <- (step(dif[i]) - .5)*sqrt(fabs(dif[i])); } parameters { real b; realsigma_a; real sigma_y; vector[nteams] a; } model { a ~ normal(b*prior_score, sigma_a); for (i in 1:ngames) sqrt_dif[i] ~ student_t(df, a[team1[i]]-a[team2[i]], sigma_y); }

(Sorry all the indentation got lost. I’d really like to display my code and computer output directly but for some reason the *code* tag in html compresses whitespace. So annoying.)

[Ed. (aka, Bob): use the "pre" tag rather than the "code" tag. I just fixed it for this blog post.]

I didn’t get the model working right away—I had a few typos and conceptual errors (mostly dealing with the signed square root), which I discovered via running the model, putting in print statements, checking results, etc. The above is what I ended up with. The whole process of writing the model and fixing it up took 20 minutes, maybe?

And here’s the R script:

library("rstan") teams <- as.vector(unlist(read.table("soccerpowerindex.txt", header=FALSE))) nteams <- length(teams) prior_score <- rev(1:nteams) prior_score <- (prior_score - mean(prior_score))/(2*sd(prior_score)) data2012 <- read.table ("worldcup2012.txt", header=FALSE) ngames <- nrow (data2012) team1 <- match (as.vector(data2012[[1]]), teams) score1 <- as.vector(data2012[[2]]) team2 <- match (as.vector(data2012[[3]]), teams) score2 <- as.vector(data2012[[4]]) df <- 7 data <- c("nteams","ngames","team1","score1","team2","score2","prior_score","df") fit <- stan_run("worldcup.stan", data=data, chains=4, iter=2000) print(fit)

(I'm using stan_run() which is a convenient function from Sebastian that saves the compiled model and checks for it so I don't need to recompile each time. We're planning to incorporate this, and much more along these lines, into rstan 3.)

This model fits ok and gives reasonable estimates:

Inference for Stan model: worldcup. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat b 0.46 0.00 0.10 0.26 0.40 0.46 0.53 0.66 1292 1 sigma_a 0.15 0.00 0.06 0.05 0.10 0.14 0.19 0.29 267 1 sigma_y 0.42 0.00 0.05 0.33 0.38 0.41 0.45 0.53 1188 1 a[1] 0.35 0.00 0.13 0.08 0.27 0.36 0.44 0.59 4000 1 a[2] 0.39 0.00 0.12 0.16 0.32 0.39 0.47 0.66 4000 1 a[3] 0.44 0.00 0.14 0.19 0.34 0.43 0.53 0.74 1101 1 a[4] 0.19 0.00 0.17 -0.19 0.10 0.22 0.31 0.47 1146 1 a[5] 0.29 0.00 0.14 0.02 0.20 0.29 0.37 0.56 4000 1 a[6] 0.30 0.00 0.13 0.06 0.22 0.29 0.37 0.56 4000 1 a[7] 0.33 0.00 0.13 0.09 0.24 0.32 0.41 0.62 1327 1 a[8] 0.16 0.00 0.14 -0.16 0.08 0.17 0.25 0.41 4000 1 a[9] 0.06 0.00 0.15 -0.29 -0.03 0.08 0.16 0.32 1001 1 a[10] 0.20 0.00 0.12 -0.02 0.12 0.19 0.27 0.46 4000 1 a[11] 0.27 0.01 0.14 0.04 0.17 0.25 0.36 0.58 746 1 a[12] 0.06 0.00 0.14 -0.23 -0.01 0.07 0.14 0.33 4000 1 a[13] 0.06 0.00 0.13 -0.21 -0.02 0.06 0.14 0.31 4000 1 a[14] 0.03 0.00 0.13 -0.26 -0.05 0.04 0.11 0.29 4000 1 a[15] -0.02 0.00 0.14 -0.31 -0.09 -0.01 0.07 0.24 4000 1 a[16] -0.06 0.00 0.14 -0.36 -0.14 -0.05 0.03 0.18 4000 1 a[17] -0.05 0.00 0.13 -0.33 -0.13 -0.04 0.03 0.21 4000 1 a[18] 0.00 0.00 0.13 -0.25 -0.08 -0.01 0.07 0.27 4000 1 a[19] -0.04 0.00 0.13 -0.28 -0.11 -0.04 0.04 0.22 4000 1 a[20] 0.00 0.00 0.13 -0.24 -0.09 -0.02 0.08 0.29 1431 1 a[21] -0.14 0.00 0.14 -0.43 -0.22 -0.14 -0.06 0.14 4000 1 a[22] -0.13 0.00 0.13 -0.37 -0.21 -0.13 -0.05 0.15 4000 1 a[23] -0.17 0.00 0.13 -0.45 -0.25 -0.17 -0.09 0.09 4000 1 a[24] -0.16 0.00 0.13 -0.40 -0.24 -0.16 -0.08 0.12 4000 1 a[25] -0.26 0.00 0.14 -0.56 -0.34 -0.25 -0.17 0.01 4000 1 a[26] -0.06 0.01 0.16 -0.31 -0.17 -0.08 0.05 0.28 658 1 a[27] -0.30 0.00 0.14 -0.59 -0.38 -0.29 -0.21 -0.03 4000 1 a[28] -0.39 0.00 0.15 -0.72 -0.48 -0.38 -0.29 -0.12 1294 1 a[29] -0.30 0.00 0.14 -0.57 -0.39 -0.31 -0.22 -0.02 4000 1 a[30] -0.41 0.00 0.14 -0.72 -0.50 -0.40 -0.32 -0.16 1641 1 a[31] -0.25 0.00 0.15 -0.50 -0.35 -0.27 -0.15 0.08 937 1 a[32] -0.40 0.00 0.14 -0.69 -0.48 -0.39 -0.31 -0.13 4000 1 lp__ 64.42 0.86 12.06 44.83 56.13 62.52 71.09 93.28 196 1

Really, 2000 iterations are overkill. I have to get out of the habit of running so long straight out of the box. In this case it doesn't matter---it took about 3 seconds to do all 4 chains---but in general it makes sense to start with a smaller number such as 100 which is still long enough to reveal major problems.

In any case, with hierarchical models it's better general practice to use the Matt trick so I recode the model as worldcup_matt.stan, which looks just like the model above except for the following:

parameters { real b; realsigma_a; real sigma_y; vector[nteams] eta_a; } transformed parameters { vector[nteams] a; a <- b*prior_score + sigma_a*eta_a; } model { eta_a ~ normal(0,1); for (i in 1:ngames) sqrt_dif[i] ~ student_t(df, a[team1[i]]-a[team2[i]], sigma_y); }

We fit this in R:

fit <- stan_run("worldcup_matt.stan", data=data, chains=4, iter=100) print(fit)

And here's what we get:

Inference for Stan model: worldcup_matt. 4 chains, each with iter=100; warmup=50; thin=1; post-warmup draws per chain=50, total post-warmup draws=200. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat b 0.46 0.01 0.10 0.27 0.39 0.45 0.53 0.65 200 1.01 sigma_a 0.13 0.01 0.07 0.01 0.08 0.13 0.18 0.28 55 1.05 sigma_y 0.43 0.00 0.05 0.35 0.38 0.42 0.46 0.54 136 0.99 eta_a[1] -0.24 0.07 0.92 -1.90 -0.89 -0.23 0.35 1.81 200 0.99 eta_a[2] 0.10 0.06 0.83 -1.48 -0.32 0.17 0.65 1.52 200 1.00 eta_a[3] 0.55 0.06 0.82 -0.99 0.00 0.55 1.08 2.06 200 1.00 eta_a[4] -0.54 0.08 1.10 -2.36 -1.33 -0.58 0.18 1.61 200 1.01 eta_a[5] 0.05 0.07 0.99 -1.64 -0.68 0.05 0.70 1.80 200 0.99 eta_a[6] 0.18 0.06 0.79 -1.36 -0.32 0.17 0.75 1.51 200 0.99 eta_a[7] 0.52 0.07 0.94 -1.30 -0.05 0.62 1.10 2.25 200 0.98 eta_a[8] -0.27 0.06 0.89 -1.84 -0.95 -0.26 0.36 1.45 200 0.98 eta_a[9] -0.72 0.06 0.79 -2.25 -1.28 -0.74 -0.17 0.82 200 0.99 eta_a[10] 0.18 0.06 0.86 -1.60 -0.38 0.21 0.81 1.78 200 0.99 eta_a[11] 0.71 0.06 0.90 -0.98 0.08 0.79 1.29 2.54 200 1.00 eta_a[12] -0.39 0.06 0.89 -2.16 -0.97 -0.36 0.19 1.27 200 0.99 eta_a[13] -0.18 0.06 0.90 -1.77 -0.78 -0.18 0.44 1.55 200 1.00 eta_a[14] -0.25 0.06 0.90 -1.86 -0.87 -0.28 0.32 1.52 200 0.99 eta_a[15] -0.33 0.07 0.99 -2.27 -1.06 -0.33 0.42 1.58 200 0.99 eta_a[16] -0.43 0.06 0.83 -1.78 -1.06 -0.50 0.22 1.13 200 1.00 eta_a[17] -0.16 0.07 0.94 -1.90 -0.84 -0.20 0.51 1.65 200 0.98 eta_a[18] 0.18 0.05 0.76 -1.20 -0.28 0.10 0.69 1.59 200 0.99 eta_a[19] 0.12 0.07 0.94 -1.63 -0.47 0.07 0.73 2.10 200 1.00 eta_a[20] 0.41 0.07 0.92 -1.47 -0.24 0.40 0.90 2.19 200 1.01 eta_a[21] -0.08 0.06 0.85 -1.58 -0.68 -0.10 0.49 1.58 200 0.98 eta_a[22] 0.03 0.06 0.82 -1.58 -0.58 0.05 0.58 1.55 200 1.00 eta_a[23] -0.12 0.06 0.83 -1.86 -0.58 -0.14 0.49 1.40 200 0.99 eta_a[24] 0.11 0.07 0.94 -1.62 -0.54 0.12 0.78 1.91 200 0.99 eta_a[25] -0.25 0.07 1.01 -2.09 -0.94 -0.27 0.37 1.99 200 0.99 eta_a[26] 0.98 0.07 0.97 -0.99 0.37 1.04 1.61 2.87 200 1.00 eta_a[27] -0.14 0.07 0.97 -2.05 -0.79 -0.16 0.55 1.78 200 0.98 eta_a[28] -0.61 0.07 0.99 -2.95 -1.20 -0.53 0.05 1.18 200 0.99 eta_a[29] 0.07 0.06 0.89 -1.55 -0.46 0.09 0.64 1.84 200 0.99 eta_a[30] -0.38 0.06 0.89 -1.95 -1.05 -0.37 0.25 1.38 200 1.00 eta_a[31] 0.65 0.07 0.92 -1.22 0.01 0.76 1.35 2.26 200 1.00 eta_a[32] -0.09 0.06 0.81 -1.72 -0.61 -0.12 0.46 1.36 200 0.99 a[1] 0.35 0.01 0.13 0.09 0.27 0.35 0.42 0.59 200 1.00 a[2] 0.38 0.01 0.11 0.16 0.30 0.38 0.44 0.59 200 1.00 a[3] 0.42 0.01 0.14 0.20 0.33 0.39 0.48 0.73 154 1.01 a[4] 0.21 0.01 0.19 -0.22 0.12 0.25 0.33 0.51 200 1.02 a[5] 0.28 0.01 0.14 0.03 0.20 0.29 0.36 0.59 200 0.99 a[6] 0.29 0.01 0.12 0.07 0.21 0.28 0.36 0.51 200 0.99 a[7] 0.32 0.01 0.15 0.08 0.22 0.30 0.40 0.63 200 0.99 a[8] 0.17 0.01 0.13 -0.13 0.08 0.18 0.24 0.39 200 0.99 a[9] 0.08 0.01 0.12 -0.21 0.01 0.09 0.16 0.28 161 1.00 a[10] 0.19 0.01 0.11 -0.06 0.12 0.18 0.24 0.45 200 0.99 a[11] 0.25 0.01 0.14 0.04 0.15 0.22 0.33 0.61 94 1.02 a[12] 0.05 0.01 0.13 -0.26 -0.02 0.07 0.13 0.26 200 1.00 a[13] 0.06 0.01 0.12 -0.20 -0.01 0.07 0.13 0.26 200 0.99 a[14] 0.03 0.01 0.12 -0.21 -0.03 0.04 0.09 0.31 200 1.00 a[15] -0.02 0.01 0.13 -0.32 -0.10 0.00 0.07 0.20 200 0.99 a[16] -0.05 0.01 0.12 -0.31 -0.12 -0.04 0.03 0.14 200 1.00 a[17] -0.04 0.01 0.14 -0.32 -0.11 -0.02 0.04 0.22 200 0.98 a[18] -0.01 0.01 0.11 -0.22 -0.07 -0.03 0.04 0.26 200 0.99 a[19] -0.04 0.01 0.13 -0.31 -0.12 -0.05 0.01 0.24 200 1.01 a[20] -0.02 0.01 0.14 -0.26 -0.10 -0.04 0.05 0.35 200 1.01 a[21] -0.13 0.01 0.12 -0.38 -0.20 -0.12 -0.05 0.10 200 0.99 a[22] -0.13 0.01 0.12 -0.35 -0.20 -0.13 -0.06 0.10 200 1.00 a[23] -0.18 0.01 0.13 -0.44 -0.24 -0.17 -0.11 0.03 200 1.00 a[24] -0.17 0.01 0.13 -0.41 -0.25 -0.18 -0.09 0.07 200 1.00 a[25] -0.25 0.01 0.13 -0.51 -0.33 -0.24 -0.16 -0.02 200 0.99 a[26] -0.08 0.02 0.17 -0.33 -0.20 -0.11 0.02 0.29 82 1.02 a[27] -0.29 0.01 0.14 -0.54 -0.36 -0.28 -0.19 -0.06 200 0.98 a[28] -0.37 0.01 0.16 -0.81 -0.46 -0.35 -0.27 -0.11 200 1.00 a[29] -0.29 0.01 0.12 -0.55 -0.38 -0.30 -0.23 -0.05 200 0.99 a[30] -0.39 0.01 0.14 -0.71 -0.47 -0.38 -0.29 -0.17 180 1.00 a[31] -0.25 0.01 0.15 -0.47 -0.36 -0.27 -0.17 0.06 200 1.01 a[32] -0.40 0.01 0.14 -0.69 -0.48 -0.40 -0.31 -0.14 200 1.00 lp__ -1.06 0.94 5.57 -12.58 -4.46 -0.73 2.64 9.97 35 1.08

All this output isn't so bad: the etas are the team-level residuals and the a's are team abilities. The group-level error sd sigma_a is estimated at 0.13 which is a small value, which indicates that, unsurprisingly, our final estimates of team abilities are not far from the initial ranking. We can attribute this to a combination of two factors: first, the initial ranking is pretty accurate; second, there aren't a lot of data points here (indeed, half the teams only played three games each) so it would barely be possible for there to be big changes.

Now it's time to make some graphs. First a simple list of estimates and standard errors of team abilities. I'll order the teams based on Nate's prior ranking, which makes sense for a couple reasons. First, this ordering is informative, there's a general trend from good to bad so it should be easy to understand the results. Second, the prior ranking is what we were using to pull toward in the multilevel model, so this graph is equivalent to a plot of estimate vs. group-level predictor, which is the sort of graph I like to make to understand what the multilevel model is doing.

Here's the code:

colVars <- function(a) {n <- dim(a)[[1]]; c <- dim(a)[[2]]; return(.colMeans(((a - matrix(.colMeans(a, n, c), nrow = n, ncol = c, byrow = TRUE)) ^ 2), n, c) * n / (n - 1))} sims <- extract(fit) a_sims <- sims$a a_hat <- colMeans(a_sims) a_se <- sqrt(colVars(a_sims)) library ("arm") png ("worldcup1.png", height=500, width=500) coefplot (rev(a_hat), rev(a_se), CI=1, varnames=rev(teams), main="Team quality (estimate +/- 1 s.e.)\n", cex.var=.9, mar=c(0,4,5.1,2), xlim=c(-.5,.5)) dev.off()

And here's the graph:

At this point we could compute lots of fun things such as the probability that Argentina will beat Germany in the final, etc., but it's clear enough from this picture that the estimate will be close to 50% so really the model isn't giving us anything for this one game. I mean, sure, I could compute such a probability and put it in the headline above, and maybe get some clicks, but what's the point?

One thing I *am* curious about, though, is how much of these estimates are coming from the prior ranking? So I refit the model. First I create a new Stan model, "worldcup_noprior_matt.stan", just removing the parameter "b". Then I run it from R:

fit_noprior <- stan_run("worldcup_noprior_matt.stan", data=data, chains=4, iter=100) print(fit_noprior)

Convergence after 100 iterations is not perfect---no surprise, given that with less information available, the model fit is less stable---so I brute-force it and run 1000:

fit_noprior <- stan_run("worldcup_noprior_matt.stan", data=data, chains=4, iter=1000) print(fit_noprior)

Now time for the plot. I could just copy the above code but this is starting to get messy so I encapsulate it into a function and run it:

worldcup_plot <- function (fit, file.png){ sims <- extract(fit) a_sims <- sims$a a_hat <- colMeans(a_sims) a_se <- sqrt(colVars(a_sims)) library ("arm") png (file.png, height=500, width=500) coefplot (rev(a_hat), rev(a_se), CI=1, varnames=rev(teams), main="Team quality (estimate +/- 1 s.e.)\n", cex.var=.9, mar=c(0,4,5.1,2), xlim=c(-.5,.5)) dev.off() } worldcup_plot(fit_noprior, "worldcup1_noprior.png")

I put in the xlim values in that last line to make the graph look nice, after seeing the first version of the plot. And I also played around a bit with the margins (the argument "mar" to coefplot above) to get everything to look good. Overall, though, coefplot() worked wonderfully, straight out of the box. Thanks, Yu-Sung!

Similar to before but much noisier.

**Act 2: Checking the fit of model to data**

OK, fine. Now let's check model fit. For this we'll go back to the model worldcup_matt.stan which includes the prior ranking as a linear predictor. For each match we can compute an expected outcome based on the model and a 95% predictive interval based on the t distribution. This is all on the signed sqrt scale so we can do a signed square to put it on the original scale, then compare to the actual game outcomes.

Here's the R code (which, as usual, took a few iterations to debug; I'm not showing all my steps here):

expected_on_sqrt_scale <- a_hat[team1] - a_hat[team2] sigma_y_sims <- sims$sigma_y interval_975 <- median(qt(.975,df)*sigma_y_sims) signed_square <- function (a) {sign(a)*a^2} lower <- signed_square(expected_on_sqrt_scale - interval_975) upper <- signed_square(expected_on_sqrt_scale + interval_975) png ("worldcup2.png", height=1000, width=500) coefplot (rev(score1 - score2), sds=rep(0, ngames), lower.conf.bounds=rev(lower), upper.conf.bounds=rev(upper), varnames=rev(paste(teams[team1], "vs.", teams[team2])), main="Game score differentials\ncompared to 95% predictive interval from model\n", mar=c(0,7,6,2)) dev.off ()

In the code above, I gave the graph a height of 1000 to allow room for all the games, and a width of 500 to conveniently fit on the blog page. Also I went to the trouble to explicitly state in the title that the intervals were 95%.

And here's what we get:

Whoa! Some majorly bad fit here. Even forgetting about Brazil vs. Germany, lots more than 5% of the points are outside the 95% intervals. My first thought was to play with the t degrees of freedom, but that can't really be the solution. And indeed switching to a t_4, or going the next step and estimating the degrees of freedom from the data (using the gamma(2,0.1) prior restricted to df>1, as recommended by Aki) didn't do much either.

**Act 3: Thinking hard about the model. What to do next?**

So then I got to thinking about problems with the model. Square root or no square root, the model is only approximate, indeed it's not a generative model at all, as it makes continuous predictions for discrete data. So my next step was to take the probabilities from my model, use them as predictions for each discrete score differential outcome (-6, -5, -4, . . ., up to +6) and then use these probabilities for a multinomial likelihood.

I won't copy the model in this post but it's included in the zipfile that I've attached here. The model didn't work at all and I didn't get around to fixing it.

Indeed, once we move to directly modeling the discrete data, a latent-data approach seems like it might be more reasonable, perhaps something a simple as an ordered logit or maybe something more tailored to the particular application.

Ordered logit's in the Stan manual so it shouldn't be too difficult to implement. But at this point I was going to bed and I knew I wouldn't have time to debug the model the next day (that is, today) so I resolved to just write up what I *did* do and be open about the difficulties.

These difficulties are not, fundamentally, R or Stan problems. Rather, they represent the challenges that arise when analyzing any new problem statistically.

**Act 4: Rethinking the predictive intervals**

But then, while typing up the above (it's taken me about an hour), I realized that my predictive intervals were too narrow because they're based on point estimates of the team abilities. How could I have been so foolish??

Now I'm excited. (And I'm writing here in real time, i.e. I have not yet cleaned up my code and recomputed those predictive error bounds.) Maybe everything will work out. That would make me so happy! The fractal nature of scientific revolutions, indeed.

OK, let's go to it.

*Pause while I code*

OK, that wan't too bad. First I refit the model using a zillion iterations so I can get the predictive intervals using simulation without worrying about anything:

fit <- stan_run("worldcup_matt.stan", data=data, chains=4, iter=5000) print(fit)

Now the code:

sims <- extract (fit) a_sims <- sims$a sigma_y_sims <- sims$sigma_y nsims <- length(sigma_y_sims) random_outcome <- array(NA, c(nsims,ngames)) for (s in 1:nsims){ random_outcome_on_sqrt_scale <- (a_sims[s,team1] - a_sims[s,team2]) + rt(ngames,df)*sigma_y_sims[s] random_outcome[s,] <- signed_square(random_outcome_on_sqrt_scale) } sim_quantiles <- array(NA,c(ngames,2)) for (i in 1:ngames){ sim_quantiles[i,] <- quantile(random_outcome[,i], c(.025,.975)) } png ("worldcup3.png", height=1000, width=500) coefplot (rev(score1 - score2), sds=rep(0, ngames), lower.conf.bounds=rev(sim_quantiles[,1]), upper.conf.bounds=rev(sim_quantiles[,2]), varnames=rev(paste(teams[team1], "vs.", teams[team2])), main="Game score differentials\ncompared to 95% predictive interval from model\n", mar=c(0,7,6,2)) dev.off ()

And now . . . the graph:

Hmm, the uncertainty bars are barely wider than before. So time for a brief re-think. The problem is that the predictions are continuous and the data are discrete. The easiest reasonable way to get discrete predictions from the model is to simply round to the nearest integer. So we'll do this, just changing the appropriate line in the above code to:

sim_quantiles[i,] <- quantile(round(random_outcome[,i]), c(.025,.975))

Aaaaand, here's the new graph:

Still no good. Many more than 1/20 of the points are outside the 95% predictive intervals.

This is particularly bad given that the model was fit to the data and the above predictions are not cross-validated, hence we'd actually expect a bit of overcoverage.

One last thing: Instead of plotting the games in order of the data, I'll order them by expected score differential. I'll do it using difference in prior ranks so as not to poison the well with the estimates based on the data. Also I'll go back to the continuous predictive intervals as they show a bit more information:

a_hat <- colMeans(a_sims) new_order <- order(a_hat[team1] - a_hat[team2]) for (i in 1:ngames){ sim_quantiles[i,] <- quantile(random_outcome[,i], c(.025,.975)) } png ("worldcup5.png", height=1000, width=500) coefplot ((score1 - score2)[new_order], sds=rep(0, ngames), lower.conf.bounds=sim_quantiles[new_order,1], upper.conf.bounds=sim_quantiles[new_order,2], varnames=paste(teams[team1[new_order]], "vs.", teams[team2[new_order]]), main="Game score differentials\ncompared to 95% predictive interval from model\n", mar=c(0,7,6,2)) dev.off ()

Hey, this worked on the first try! Cool.

But now I'm realizing one more thing. Each game has an official home and away team but this direction is pretty much arbitrary. So I'm going to flip each game so it's favorite vs. underdog, with favorite and underdog determined by the prior ranking.

So here it is, with results listed in order from expected blowouts to expected close matches:

Again, lots of outliers, indicating that my model sucks. Maybe there's also some weird thing going on with Jacobians when I'm doing the square root transformations. Really I think the thing to do is to build a discrete-data model. In 2018, perhaps.

Again, all the data and code are here. (I did not include last night's Holland/Brazil game but of course it would be trivial to update the analyses if you want.)

**P.S.** In the 9 June post linked above, Nate wrote that his method estimates that "a 20 percent chance that both the Netherlands and Chile play up to the higher end of the range, or they get a lucky bounce, or Australia pulls off a miracle, and Spain fails to advance despite wholly deserving to." I can't fault the estimate---after all, 1-in-5 events happen all the time---but I can't see why he was so sure that if Spain loses, that they still wholly deserved to win. They lost 5-1 to Holland and 2-0 to Chile so it doesn't seem like just a lucky bounce. At some point you have to say that if you lost, you got what you deserved, no?

P.P.S. The text in those png graphs is pretty hard to read. Would jpg be better? I'd rather not use pdf because then I'd need to convert it into an image file before uploading it onto the blog post.

P.P.P.S. I found a bug! The story is here.

The post Stan goes to the World Cup appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post D&D 5e: Probabilities for Advantage and Disadvantage appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The new rules for D&D 5e (formerly known as D&D Next) are finally here:

D&D 5e introduces a new game mechanic, advantage and disadvantage.

**Basic d20 Rules**

Usually, players roll a 20-sided die (d20) to resolve everyting from attempts at diplomacy to hitting someone with a sword. Each thing a player tries to do has a difficulty and rolling greater than or equal to the difficulty (with various modifiers for ability and training and magic items) means the character was successful.

**Advantage and Disadvantage**

As of 5th Edition (5e) rolls can be made with advantage or disadvantage. The rules are:

- Advantage: roll two d20 and take the max
- Normal: roll one d20 and take the result
- Disadvantage: roll two d20 and take the min

So what are the chances that you’ll roll equal to or above given number with advantage, normally, or with disadvantage? Here’s a table.

roll | disadvantage | normal | advantage |
---|---|---|---|

20 | 0.002 | 0.050 | 0.098 |

19 | 0.010 | 0.100 | 0.191 |

18 | 0.022 | 0.150 | 0.278 |

17 | 0.039 | 0.200 | 0.359 |

16 | 0.062 | 0.250 | 0.437 |

15 | 0.089 | 0.300 | 0.510 |

14 | 0.123 | 0.350 | 0.576 |

13 | 0.160 | 0.400 | 0.639 |

12 | 0.202 | 0.450 | 0.698 |

11 | 0.249 | 0.500 | 0.751 |

10 | 0.303 | 0.550 | 0.798 |

9 | 0.361 | 0.600 | 0.840 |

8 | 0.424 | 0.650 | 0.877 |

7 | 0.492 | 0.700 | 0.910 |

6 | 0.564 | 0.750 | 0.938 |

5 | 0.640 | 0.800 | 0.960 |

4 | 0.723 | 0.850 | 0.978 |

3 | 0.811 | 0.900 | 0.990 |

2 | 0.903 | 0.950 | 0.998 |

1 | 1.000 | 1.000 | 1.000 |

The effect is huge. There’s less than a 9% chance of rolling 15 or higher with disadvantage, whereas there’s a 30% chance normally and a 51% chance with advantage.

[*Update II: Turns out that The Online Dungeon Master, a blog I read regularly for 4e module reviews and play reports, generated the same table over a year ago (using Excel simulations, no less; a commenter provides a good breakdown of the analytic solution). That's been the story of my life coming into a new field. I agree that this tabular form is what you want for reference, but the easiest way to understand the effect is looking at the non-linearity in the graph below.*]

Here’s a plot (apologies for the poor ggplot2() and png() defaults — I don’t understand ggplot2 config well enough to make titles, clean up labels, axes, tick mark labels, boundaries, margins, colors, and so on to make it more readable without spending all night on the project).

The vertical distances at a given horizontal position show you how much of a bonus you get for advantage or disadvantage.

[*Update: There's an* *alternative plot on the* Roles, Rules, and Rolls* blog* *that displays the difference between advantage and a simple +3 bonus, as used in previous D&D editions.*]

**Analytic Solution**

The probabilities involved are simple rank statistics for two uniform discrete variables.

You can compute these probabilities analytically, as I show in Part III of the page where I explain the math and stats behind my simple baseball simulation game, *Little Professor Baseball*.

The basic game is a cross between *All-Star Baseball* and *Strat-o-Matic*. I wanted a rolling system where both players roll, one for the batter and one for the pitcher, and you use card for the player with the highest roll to resolve the outcome. The winning roll is going to have a distribution like the advantage probabilities above (except that Little Professor Baseball uses rolls from 1–1000 rather than 1–20. As a bonus, earlier sections of the math page explains why *Strat-o-Matic* cards looks so extreme on their own (unlike the *All-Star Baseball* spinners).

I developed the game after reading Jim Albert’s most excellent book, *Curve Ball*, and I included the cards for the 1970 Cinncinnatti Reds and Baltimore Orioles (which for trivia, were mine and Andrew’s favorite teams as kids).

**Simulation-Based Calculation**

I computed the table with a simple Monte Carlo simulation, the R code for which is as follows.

oneroll <- function(D) { return(sample(D,1)); } advantage <- function() { return(max(oneroll(20), oneroll(20))); } disadvantage <- function() { return(min(oneroll(20), oneroll(20))); } NUM_SIMS <- 100000; advs <- rep(0,NUM_SIMS); for (n in 1:NUM_SIMS) advs[n] <- advantage(); disadvs <- rep(0,NUM_SIMS); for (n in 1:NUM_SIMS) disadvs[n] <- disadvantage(); print("", quote=FALSE); print("CCDF (Pr[result >= k])",quote=FALSE); print(sprintf("%2s %6s %6s %6s", "k", "disadv", "normal", "advant"), quote=FALSE); cumulative_disadv <- 0; cumulative_norm <- 0; cumulative_adv <- 0; for (k in 20:1) { cumulative_disadv <- cumulative_disadv + sum(disadvs == k) / NUM_SIMS; cumulative_norm <- cumulative_norm + 0.05; cumulative_adv <- cumulative_adv + sum(advs == k) / NUM_SIMS; print(sprintf("%2d %6.3f %6.3f %6.3f", k, cumulative_disadv, cumulative_norm, cumulative_adv), quote=FALSE); }

I'm sure Ben could've written that in two or three lines of R code.

The post D&D 5e: Probabilities for Advantage and Disadvantage appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey—this is a new kind of spam! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dear Dr. Gelman,

I am writing to inquire about the availability of obtaining a self-funded visiting scholar position in your institution for one year. I will cover all my expenses during my visit. I have completed a M.A. at Sichuan international Studies University in Comparative Literature and World Literature in 2011. My dissertation is entitled “Edith Wharton’s Views on Female”.

My scholarly interests range widely, from British and American literature and culture to intercultural communication to translation of English and Chinese and to English teaching methods, as well as TESOL, Business English, Linguistics. I have published some relevant papers at home. In a subsequent project, I plan to research American urban literature from the viewpoints of feminism and ethical literary criticism. In addition, I welcome the opportunity to conduct research related to English teaching methods as a foreign language in order to further enhance my teaching experience.

My dissertation investigates Wharton’s views on female from the aspect of feminism based on Wharton’s 4 important novels: The House of Mirth (1905), The Custom of the Country (1913), Summer (1917) and The Age of Innocence (1921). My inquiry focuses on awakening heroines’ images to have broken traditional bondages and developed constantly and explain how the heroines from Lily to Undine, Charity and Allen fulfill the sublimation from embryo, extreme, awakening and freely defining, which is the process of forming Wharton’s views on female and also indicates Wharton’s seeking for ideal woman. Then, from the comparison of heroines’ traits, it’s proved that the harmonious androgynous personality is the ideal pursuit for Wharton characters in her fictional world: and the single male or female disposition is not wholesome for individuals, and both male and female traits in balance seems to be an ideal solution to help free the human personality from sex-role stereotype. Last but not the least, this thesis touches upon the reasons to form the views on female from the social background dealing with social economy, women’s social position, female literature trends and female movements and so on and personal experience including Wharton’s friendship with James, bondage suffered in teenage and love and marriage life.

I am eager to broaden my horizon constantly. And I would appreciate being considered as a visiting scholar to do research under your supervision. I believe I can learn a lot from you, your outstanding team and your distinguished university. I have enclosed a copy of my curriculum vitae. Thank you in advance for your consideration. I look forward to hearing from you. If you would like any additional information, please contact me.

Sincerely,

…

A CV was attached in pdf form but I didn’t click on it. Safety first.

The post Hey—this is a new kind of spam! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Chicago alert: Mister P and Stan to be interviewed on WBEZ today (Fri) 3:15pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And here’s the interview.

The actual paper is called The Great Society, Reagan’s revolution, and generations of presidential voting and was somewhat inspired by this story.

Here’s the set of graphs that got things started:

Here’s a bit more of what we found:

And here’s some more:

I love this paper, I love Yair’s graphs, and I love how we were able to fit this complicated model that addresses the age-period-cohort problem. Stan’s the greatest.

The post Chicago alert: Mister P and Stan to be interviewed on WBEZ today (Fri) 3:15pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Open-source tools for running online field experiments appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I [Eckles] just wanted to share that in a collaboration between Facebook and Stanford, we have a new paper out about running online field experiments. One thing this paper does is describe some of the tools we use to design, deploy, and analyze experiments, including the2012 US election voter turnout experiment. And now we have open sourced an implementation of these ideas.

We were inspired by Fisher’s quote — “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination.”

The idea is that one way to consult a statistician in advance is to have their advice built into tools for running experiments — a similar idea to how you emphasize the importance of defaults in data visualization tools.

We have a shorter blog post about this work, a paper, “Designing and Deploying Online Field Experiments” (to appear in Proc. of WWW very soon), and the software and documentation, PlanOut.

We’d be very interested in your thoughts on any of this, as perhaps would your blog readers. I also think many might be interested in using the software to run experiments themselves — that’s our hope!

Looks good to me. It’s great to see this sort of stuff out there, not just in textbooks but really getting used.

P.S. Brian Keegan writes in:

I’m a post-doc in David Lazer’s computational social science group at Northeastern. I noticed that you were going to discuss open-source tools for running online experiments on Thursday, so I wanted to offer a shameless plug for a platform that we’re developing here in hopes it might fit into the themes of your post.

Volunteer Science (http://www.volunteerscience.com/) is a web laboratory platform for conducting group and network experiments that’s built on open standards and open code. We’re very much interested in recruiting others to develop experiments for the platform as well as expanding the number of users who volunteer.

The post Open-source tools for running online field experiments appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “P.S. Is anyone working on hierarchical survival models?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m working on building a predictive model (not causal) of the onset of diabetes mellitus using electronic medical records from a semi-panel of HMO patients. The dependent variable is blood glucose level. The unit of analysis is the patient visit to a network doctor or hospitalization in a network hospital aggregated to the month-year level. The time frame is from the early 80s to the present. Since my focus is on the onset of the disease, my approach is agnostic and prospective. I would like to derive data-driven answers to questions of co-morbidity, patient health and wellness based on physical measures such as BMI or BP as well as physician and hospital quality as an inherent part of the model output.

To me, addressing these issues with data of this type would require multiple models for full coverage:

1) A survival model to capture censoring and time to disease onset

– Censoring can have multiple causes: diagnosed with diabetes type 1 or 2, lost to followup, death, etc

2) Multiple hierarchical bayesian models for massively categorical variables such as patient, diagnosis, doctor, hospital to capture the differing dependence structures

– Patient within zipcode, community, county, state to capture the social determinants of health

– Patient within a family network, e.g., children, siblings, parents, etc., to reflect familial history of disease

– Patient and diagnoses received — thousands of possible diagnoses which collapse into higher levels

– Patient within HMO doctor and hospital network

– Doctor within specialty — probably 70 or so specialties overall

– Doctor within zipcode, community, county, state

– Hospital within zipcode, community, county, state

3) As available, the impact of programs and interventions designed to promote wellness, mitigate or prevent disease…these could include recommendations regarding exercise, diet, etc.

4) Given the wide time frame, macro-economic indexes to capture the well-known impact of the business cycle on the determinants of medically-related activities

These are preliminary thoughts as I have not yet begun the process of testing the need for specifying all of these hierarchies since I am still in the initial stages of the analysis. Just getting this data lined up and talking together is a significant challenge in and of itself.

My question for you concerns the need for multiple models when the dependence structures overlap and are as messy as in the present case. I’m sure you’re going to advise against such a wide-ranging predictive design, enjoining me to greater research focus and specificity. My preference is to retain an expansive and exploratory stance and not to simplify the in-going hypotheses just for the sake of the modeling. Honestly, I think that there is already too much specificity in the literature which does little or nothing to uncover and identify the broad antecedents of this illness.

What do you think? Am I missing something? Suggestions?

P.S. Is anyone working on hierarchical survival models?

My reply: It does sound kind of appealing to just throw everything into the model and let Stan sort it out. On the other hand, it also seems like the “throw it all in at once” strategy is a recipe for confusion, and it could be hard to interpret the results. So let me give you the generic suggestion that, whatever model you start with, you check it out using fake-data simulation (that is, simulate fake data from the model, then fit the model and check that you can recover the parameters of interest and make good predictions). And I’d suggest starting simple and working up from there. Ultimately I think a more complex model is better and should be more believable, but you might have to work up to it, because of challenges of computation, identification, and understanding of the model.

P.S. Matt Gribble adds:

I wanted to plug some exciting work by Michael Crowther extending generalized gamma regression to have random effects not only on the log-hazards scale (frailty models) but also on the log-relative median survival time scale. The paper’s in press at Statistics in Medicine, and I had nothing to do with it but can’t wait to cite/use it. http://onlinelibrary.wiley.com/doi/10.1002/sim.6191/abstract

Not quite what I plugged (I was basing the plug about survival time random effects on slides I saw of his, not on the actual paper) but I think this ref is still cutting-edge stuff in the theme of hierarchical survival models.

The post “P.S. Is anyone working on hierarchical survival models?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Just wondering appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It would be bad news if a student in the class of Laurence Tribe or Alan Dershowitz or Ian Ayres or Edward Wegman or Matthew Whitaker or Karl Weick or Frank Fischer were to hand in an assignment that is obviously ~~plagiarized~~ copied from another source without attribution. Would the prof have the chutzpah to fail the student, or would he just give the student an A out of fear that the student would raise a ruckus if anything were done about it?

But it would be really bad news if *everyone in the class* were to do this. For example, suppose the students were to do the work for real—that is, individually write their own, non-plagiarized papers and put them in a file somewhere—and then make alternative, plagiarized papers to hand in. Perhaps, just to make sure the prof or the overworked teaching assistant doesn’t miss it, the students would even cite the wikipedia entries they’re copying from. Then they’d sit back and wait and see what happens. It would be important, though, that the students write actual papers on their own, because otherwise they’d be missing out on the chance to learn the material, which after all is the real purpose of taking the course.

In any case, this would be a bad situation. It’s not clear how the prof would have the moral authority to fail a student for an offense that he, the professor, had committed without suffering any penalty. But I wouldn’t recommend the students try it. They might just get expelled anyway for the combined violations of plagiarism and embarrassing the university.

It’s like smoking crack in Toronto. It’s still illegal even if the boss does it.

The post Just wondering appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Bayes Data Analysis – Author Needed” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Hi,

My name is Jo Fitzpatrick and I work as an Acquisition Editor for Packt Publishing ( www.packtpub.com ). We recently commissioned a book on Bayesian Data Analysis and I’m currently searching for an author to write this book. You need to have good working knowledge of Bayes and a good level of written English. Please email joannef@packtpub.com for more details.

Thanks,

Joanne Fitzpatrick

Acquisition Editor

[ Packt Publishing ]

Hey, I think I’m qualified for this! Although maybe not the “good level of written English” bit, as I only speak American. . . . In any case, I am happy to see that the term “Bayesian Data Analysis” has become generic.

The post “Bayes Data Analysis – Author Needed” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Just wondering

**Wed:** “P.S. Is anyone working on hierarchical survival models?”

**Thurs:** Open-source tools for running online field experiments

**Fri:** Hey—this is a new kind of spam!

**Sat, Sun:** As Chris Hedges would say: That’s the news, and I am outta here!

The post Visualizing sampling error and dynamic graphics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What do you think of this visualisation from the NYT [in an article by Neil Irwin and Kevin Quealy but I'm not sure if they're the designers of the visualization]? I’m pretty impressed as a method of showing sampling error to a general audience!

I agree.

P.S. In related news, Antony Unwin writes:

A couple of weeks ago you had a discussion on graphics on your blog and it seemed to me that people had very different ideas about what the term “Interactive Graphics” means. For some it is about interacting with presentation graphics on the web, for others it is about using interactive graphics to do data analysis. You really need to see interactive graphics in action to get a feel for it.

I have made a ten minute film to give the flavour of interactive graphics for data analysis with data on last year’s Tour de France and using Martin Theus’s software Mondrian.

The post Visualizing sampling error and dynamic graphics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Grand Opening: The Stan Shop appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The art’s by Michael Malecki. The t-shirts and mugs are printed on demand by Spreadshirt. I tried out a sample and the results are great and have held up to machine washing and drying.

There’s a markup of about $4 per item, which is going straight into the Stan slush fund. No promises that it will be spent wisely, but it will go to the developers.

There aren’t a lot of other products from Spreadshirt that we can put logos on — most of the items (hats, tote bags, etc.) are text-only. But if there are other t-shirts or sweatshirts people want, we could easily expand our product line — feel free to drop suggestions in the comment box.

The post Grand Opening: The Stan Shop appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Dimensionless analysis as applied to swimming! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Recently in one of your blog posts (“priors I don’t believe”) there was a discussion in which I was advocating the use of dimensional analysis and dimensionless groups to normalize every model, including statistical regression models. People wanted an example that made sense in social sciences, but since I don’t really have any social science examples to hand, I couldn’t provide one. However, I do have an interesting example problem that I just blogged, although it’s a physics problem, perhaps it illustrates the techniques well enough that one of your readers could run with it on a social sciences problem, or alternatively perhaps one of your readers would be interested in collaborating with me to develop a social science example.

Lakeland’s post starts with:

I’ve been working on my front-crawl swim stroke as an effort to improve my fitness.

But it doesn’t take too long to get to this:

and this:

So you should enjoy it. This indeed is the sort of analysis we don’t see enough of in statistics, I think.

P.S. Regarding priors, I wanted also to feature this comment from Chris on that earlier post:

This non-technical description of Bayes’ rule bugged me: “Final opinion on headline = (initial gut feeling) * (study support for headline)”. I think I’d have written something like “educated guess” rather than “gut feeling”. Not all gut feelings are of equal merit. One can have gut feelings about technical matters where they have some experience and are reasonably well informed. (I’d call that an educated guess.) Conversely, one can have gut feelings re matters where they are thoroughly uninformed.

I agree. One problem with the whole “subjective Bayes” slogan is the elevation of subjectivity into a principle, which is all too close to an anything-goes view which is contrary to many of the goals of science.

The post Dimensionless analysis as applied to swimming! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The great advantage of the model-based over the ad hoc approach, it seems to me, is that at any given time we know what we are doing.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And this:

Please can Data Analysts get themselves together again and become whole Statisticians before it is too late? Before they, their employers, and their clients forget the other equally important parts of the job statisticians should be doing, such as designing investigations and building models?

I actually think the current term “data scientist” is an improvement over “data analyst” because the scientist can be involved in data collection and decision making, not just analysis.

Box also wrote:

It is widely recognized that the advancement of learning does not proceed by conjecture alone, nor by observation alone, but by an iteration involving both. Certainly, scientific investigation proceeds by such iteration. Examination of empirical data inspires a tentative explanation which, when further exposed to reality, may lead to its modification. . . .

Now, since scientific advance, to which all statisticians must accommodate, takes place by the alternation of two different kinds of reasoning, we would expect also that two different kinds of inferential process would be re- quired to put it into effect.

The first, used in estimating parameters from data conditional on the truth of some tentative model, is appropriately called

Estimation. The second, used in checking whether, in the light of the data, any model of the kind proposed is plausible, has been aptly named by Cuthbert DanielCriticism.

Box continued:

While estimation should, I believe, employ Bayes’ Theorem, or (for the fainthearted) likelihood, criticism needs a different approach. In practice, it is often best done in a rather informal way by examination of residuals or other suitable functions of the data. However, when it is done formally, using tests of goodness of fit, it must, I think, employ sampling theory for its justification.

He was writing in 1978, back before people realized the ways in which model criticism, exploratory data analysis, and sampling theory could be incorporated into Bayesian data analysis (see chapters 6-8 of BDA3 for a review).

The post “The great advantage of the model-based over the ad hoc approach, it seems to me, is that at any given time we know what we are doing.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Being an informed Bayesian: Assessing prior informativeness and prior–likelihood conflict” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dramatically expanded routine adoption of the Bayesian approach has substantially increased the need to assess both the confirmatory and contradictory information in our prior distribution with regard to the information provided by our likelihood function. We propose a diagnostic approach that starts with the familiar posterior matching method. For a given likelihood model, we identify the difference in information needed to form two likelihood functions that, when combined respectively with a given prior and a baseline prior, will lead to the same posterior uncertainty. In cases with independent, identically distributed samples, sample size is the natural measure of information, and this difference can be viewed as the prior data size M(k), with regard to a likelihood function based on k observations. When there is no detectable prior-likelihood conflict relative to the baseline, M(k) is roughly constant over k, a constant that captures the confirmatory information. Otherwise M(k) tends to decrease with k because the contradictory prior detracts information from the likelihood function. In the case of extreme contradiction, M(k)/k will approach its lower bound −1, representing a complete cancelation of prior and likelihood information due to conflict. We also report an intriguing super-informative phenomenon where the prior effectively gains an extra (1+r)^(−1) percent of prior data size relative to its nominal size when the prior mean coincides with the truth, where r is the percentage of the nominal prior data size relative to the total data size underlying the posterior. We demonstrate our method via several examples, including an application exploring the effect of immunoglobulin levels on lupus nephritis. We also provide a theoretical foundation of our method for virtually all likelihood-prior pairs that possess asymptotic conjugacy.

This sound like it’s potentially very important. As many of you know, I’ve been struggling for a few years with how to generally think about weakly informative priors, and this paper represents a way of looking at the problem from a different direction.

This area also is an example of the complementary nature of applied, methodological, computational, and theoretical research. Our methodological work on weakly informative priors (that is, our papers from 2006, 2008, 2013, and 2014) were motivated by persistent problems that were arising in our applied work on hierarchical models. Then, once we have these methods out there, it is possible for deep thinkers such as XL to make sense of them all. Then people can use apply larger framework to new applications, and so on.

P.S. I spoke on weakly informative priors at Harvard a few years ago (see here for a more recent version). When the talk was over, XL stood up and said, “Thank you for a weakly informative talk.” So I’m hoping the paper above gets published and I can write a discussion beginning, “Thank you for a weakly informative paper.”

The post “Being an informed Bayesian: Assessing prior informativeness and prior–likelihood conflict” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Who’s bigger”—the new book that ranks every human on Wikipedia—is more like Bill Simmons than Bill James appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Skiena and Ward provide a numerical ranking for the every Wikipedia resident who’s ever lived. What a great idea! This book is a guaranteed argument-starter. I found something to argue with on nearly every page.

Here’s an argument for you. Their method ranks obscure U.S. president Chester Arthur as the 499th most historically significant figure who has ever lived. William Henry Harrison, who was president for *one month*, is listed as the 288th most significant person. This seems ridiculous to me. We’re considering all people who have ever lived (who are on Wikipedia), including inventors of drugs, discoverers of physical laws, founders of countries, influential religious leaders, explorers, authors, musicians, newspaper editors, etc etc. Surely there are many thousands of people who are more historically significant than these two minor figures who just happened to be president of the United States for brief periods.

How did this happen? I can’t be sure, but think about this: Right now, as you read this, there’s probably a 7th grader somewhere who’s googling Chester Arthur as part of a class project on American presidents (and is not allowed to write about Washington, Jefferson, or Lincoln). And somewhere else there’s a 5th-grade teacher who has assigned a different president to each of the students in his or her class. These kids are all going on Wikipedia, and lots of other pages link to these presidents’ pages.

In his historical baseball abstract, Bill James ranks approximately 1000 of the best baseball players. The ranking is, of course, ultimately subjective (in that James has to make choices in what data sources to use and how to combine them) but it has a clearly-defined goal: wins. James is rating players based on how many wins they contributed to their team. In his historical basketball book, Bill Simmons ranks approximately 100 of the best pro basketball players. Simmons, like James, is entertaining, intelligent, and thought-provoking, but one thing his ratings don’t have is a clear external goal. One could say that James is ranking based on an (imperfect) estimate of a somewhat well-defined outcome, whereas Simmons is ranking for the sake of ranking. As with the notorious “U.S. News” college rankings, the ratings are not an estimate of anything but themselves. That’s fine, it’s just they way it is, I’m not trying to slam Bill Simmons, I’m just trying to make this distinction.

Skiena and Ward, like Simmons, are producing a ranking that is not an estimate of any externally-defined goal. Again, that’s fine—it’s all one could possibly do here, I think.

But . . . I found it grating how Skiena and Ward kept on referring to their rankings as “long-term historical significance,” “historical greatness,” etc. They write, “Our algorithmic rankings help focus attention where it should properly be focused. Any person interested in livestock production would miss the point if they study how to raise geese and Angora rabbits at the expense of cows, pigs, and chickens. Our rankings help answer the question ‘Where’s the beef?’ in history in a rigorous and effective way.” I don’t think Chester Arthur is very beefy at all, but I do believe that he gets lots of links because the U.S. educational system is so focused on presidents.

The part I like is when they write, “what we are really trying to do here is study what shapes the process of historical recollection.”

In summary, I have mixed feelings about this book. I like the idea of quantitatively studying the process of historical recollection. But it’s hard for me to take seriously any method that represents Grover Cleveland as the 98nd most historically significant figure who’s ever lived.

The post “Who’s bigger”—the new book that ranks every human on Wikipedia—is more like Bill Simmons than Bill James appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Who invented the Metropolis algorithm? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I found this at the 15:57 mark of one of Bill Press’s videos but do not know if this bit of history is well known in the MCMC universe. This is Marshall Rosenbluth criticizing Metropolis (and others). The text is taken from an interview of Rosenbluth in 2003 by Kai-Henrik Barth:

Barth:

So basically the Monte Carlo we use today in high energy physics and so on is based on your early work in the ‘50s?

Rosenbluth:

Well, it’s related to it, certainly. I mean, even before I got into it people were using Monte Carlo to trace neutron tracks in complicated reactors and so on. And photon transport using these statistical randomized methods, which was obviously the way to do complicated multi-dimensional calculations. And then Teller had the idea of maybe one should apply these techniques to statistical ensembles, so then I worked out this appropriate algorithm for doing that. We did the first papers on that. That was almost totally classical physics. It’s been a very widely growing field ever since. This meeting at Los Alamos in June had, maybe 200 or 300 people from all over the world.

Barth:

Your collaborators for these papers were Edward Teller and Nick Metropolis and your former wife?

Rosenbluth:

Yes. She actually did all the coding, which at that time was a new art for these new machines. You know, no compilers or anything like that.

Barth:

And it’s also listing A.H. Teller.

Rosenbluth:

That was Teller’s wife, who during the war had been one of these computer [women]— he wanted her to get back into the work, but she never showed up. So she was basically—

Barth:

Put on the paper for it?

Rosenbluth:

Yes. As was Metropolis, I should say. Metropolis was boss of the computer laboratory. We never had a single scientific discussion with him.

That’s funny, as I’d thought that a lot of it was Stanislaw Ulam’s idea. Ulam is not on the 1953 paper of Metropolis at al., but he published a paper in 1949 with ~~Teller~~ Metropolis on the Monte Carlo method. That earlier paper did not have all the details but I’ve always assumed that the method already existed at the time of the 1949 paper and that it just took them a few years to write it up, perhaps because of security issues (no joke, given that these methods were developed for the H-bomb project and of course the Russians had their own, competing program).

In his book, Ulam gives the impression that Teller had a habit of grabbing credit for others’ successful ideas. Anyway, I searched for *Ulam* in the Rosenbluth interview, and this is what I found:

Barth:

Could you fill us in about your participation, your role, in the development of the Ulam-Teller invention [which made modern hydrogen weapons possible]?

Rosenbluth:

Yes. I was a junior member and I had been doing all the detailed physics calculations on the booster and the GEORGE shot, which were similar, and I wasn’t even aware at the time until months later of this joint Ulam-Teller paper. So I can only give my own impression, which is that Teller really understood all the physics of what was going on—he was really a physicist—and that Ulam didn’t. And what I had seen what Ulam suggested was very vague and it seemed to me didn’t constitute any kind of a real invention, whereas what Teller came up with, with the help of Freddie de Hoffmann, me, and Dick Garwin, and others, was a pretty complete design before even the very detailed calculations had been done. So from my point of view, it really should be called “the Teller invention.” Ulam was a mathematician, not a physicist. A very smart and charming guy, but he was one of these people when he wandered into your office you had a feeling you were going to have a pleasant conversation for the next half hour, but you wouldn’t learn anything and nothing new would get done. On the other hand, when Teller came by you knew you were going to have sort of an intensive grilling and some new and relevant ideas thrown at you. So from this perspective, I couldn’t understand really why Ulam was receiving equal billing with Teller. But I don’t know what went on in the higher level discussions.

Barth:

As far as I understand, from Rhodes’ book, Ulam had the idea of radiation pressure first. But I’m not certain about that.

Rosenbluth:

No, I don’t think he did. . . .

Hmm . . . I *really* wasn’t there, of course, but based on my sympathy with Ulam I’m inclined to think that Rosenbluth is pretty much just reporting what Teller was telling him.

Otherwise, I dunno, maybe we need to rename Stan and call it Ed. I’d hate to do that, though, as people might then think it’s named after Ed Wegman . . .

The post Who invented the Metropolis algorithm? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** “Who’s bigger”—the new book that ranks every human on Wikipedia—is more like Bill Simmons than Bill James

**Wed:** “Being an informed Bayesian: Assessing prior informativeness and prior–likelihood conflict”

**Thurs:** “The great advantage of the model-based over the ad hoc approach, it seems to me, is that at any given time we know what we are doing.”

**Fri:** Special July 4th statistics post! Dimensionless analysis as applied to swimming

**Sat, Sun:** As Chris Hedges would say: Don’t worry, baby.

The post Scott Adams blogging appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Pivot: I’m not particularly interested in the topic (rich guys getting richer) but Adams usefully deploys statistical thinking in this one (“Success simply can’t be predicted to any level of statistical comfort”). An appropriate level of sophistication for this devotee of Russell’s paradox (recall the P.P.P.S. here).

Kung Fu Jar-Opening Trick: What is it with Scott Adams and jars?

The post Scott Adams blogging appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Useless Algebra, Inefficient Computation, and Opaque Model Specifications appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Why not write it in BUGS or Stan?**

Over on the Stan users group, Robert Grant wrote

Hello everybody, I’ve just been looking at this new paper

[ed. paywalled link, didn't read]which proposes yet another slightly novel model for longitudinal + time-to-event data, and wondering why they bothered with extensive first-principles exposition of the likelihoods and then writing a one-off sampler in R. (Of course, the answer probably is a combination of the authors’ mathematical expertise and the journal’s fondness for algebraic exposition.) Why not just write it in a graphical model style like Stan? They report getting 2000 iterations every 24 hours, and I find it hard to believe this is a sensible way to do it. Is there any reason why this sort of model can’t be done in Stan?

I replied

It’s not iterations, but time to convergence and then time per effective sample after that that matters.

Stan’s not just a graphical modeling language — you can write down any expression that’s equal to the log posterior up to an additive constant.

My general experience is that it’s usually easy to fit models that have continuous parameters in Stan. We’ve fit all kinds of models that people have written complicated software for. Most of those models could’ve been fit in BUGS or JAGS, too, which would be easier than writing custom code.

Robert replied

… I feel that journal readers are poorly served by pages of algebra and no straightforward BUGS/Stan code. With good checking, it should be ok to present a new model in this way. After all, I don’t have to write down the likelihood and Newton-Raphson every time I publish

a logistic regression.

I am always struck by this same issue. Here’s what I think is going on:

1. **What goes in a paper is up to the author**. If the author struggled with a step or found it a bit tricky to think about themselves, then the struggle goes into the paper. Even if it might be obvious to someone with more experience in a field. I was just reading a paper with a very detailed exposition of EM for a latent logistic regression problem with conditional probability derivations, etc. (JMLR paper Learning from Crowds by Raykar et al., which is an awesome paper, even if it suffers from this flaw and number 3.)

2. **What goes in a paper is up to editors**. If the editors don’t understand something, they’ll ask for details, even if they should be obvious to the entire field. This is agreeing with Robert’s point, I think. Editors like to see the author sweat, because of some kind of no-pain, no-gain esthetic that seems to permeate academic journal publishing. It’s so hard to boil something down, then when you do, you get dinged for it.

I find lots of exposition of basic notions in applied papers in natural language processing, where almost every paper that uses a notion like entropy (including one I just wrote) redefines it from scratch. For that same paper, the editors wanted to know about algorithms for fitting a latent logistic regression. For some journals, you have to insert p-values even if they don’t even make sense from a frequentist perspective.

3. **Authors tend to narratives over crisp expositions**. They want to tell you their story of how they got to the conclusions. (Even me, hence the preface of the Stan manual.) One of the things that frustrates me to no end in papers, particularly in statistics, is that the author tells you a narrative of how they meandered to the model they finally use; in computer science, they meander in the same way to models or algorithm variants. What I want to see is a crisp statement of the model with enough details that I can replicate it. Providing working code is the easiest way to do that. It’s why I liked Jennifer and Andrew’s regression book so much — they gave you (mostly) runnable code.

**The way forward**

I think what applied papers need to do is lay out their model crisply. BUGS is a nice language for this if your model is short and easily expressed as a directed acyclic graph. I think the variable declarations in Stan make it even clearer what’s going on, especially as models go beyond a few lines.

The separation of data and parameters in Stan makes it evident how the model is intended to be used for inference — that information is not part of the model specification in BUGS. But this is also a drawback to Stan’s language in that the distinction between data and parameters is not part of a Bayesian model per se, just a statement of how you’re going to use it. One upshot of this is that missing data models in Stan are awkward, to say the least. (Hopefully the addition of functions should clean this up from a Stan user’s perspective, but it’s still not going to provide an easy-to-follow model spec like many models in BUGS. A possibility we’ve been considering is adding a graphical model specification language that could be combined with data to compile a Stan model — the issue is that we need the graphical model plus the identity of variables provided as data in order to compile and then run a Stan model.)

**Goldilocks and the Three Academics**

For some reason, mathematicians seem to be immune to problems (1) and (3). In fact, I feel they often go too far, which is why this is such a Goldilocks issue. What’s an inscrutable step or bit of math to one person (contact manifolds anyone?) is like counting on your fingers to a more advanced mathematician (or statistician). You can’t write an intro to differentiable manifolds (or hierarchical regression) into every one of your papers. So you have to write books and software that provide an infrastructure for writing simple papers, but then you get dinged because it’s “too easy” or alternatively because people don’t read the book and it’s “too hard.”

P.S. I finally met Phil of “this post is by Phil” fame! He said that no matter how he qualifies his posts, many commenters assume they’re written by Andrew!

The post Useless Algebra, Inefficient Computation, and Opaque Model Specifications appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Comment of the week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Really great, the simple random intercept – random slope mixed model I did yesterday now runs at least an order of magnitude faster after installing RStan 2.3 this morning. You are doing an awesome job, thanks a lot!

The post Comment of the week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Quantifying luck vs. skill in sports appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>If you’ll permit a bit of a diversion, I was wondering if you’d mind sharing your thoughts on how sabermetrics approaches the measurement of luck vs. skill. Phil Birnbaum and Tom Tango use the following method (which I’ve quoted below).

It seems to embody the innovative but often non-intuitive way that sabermetrics approaches problems, but something about it feels off to me.

Causey quotes Phil Birnbaum:

To go from a record of performance to an estimate of a team’s talent, you have to regress its winning percentage towards the mean. How do you figure out how much to regress?

Tango has often given these instructions:

—–

1. First, figure out the standard deviation of team performance. For MLB, for all teams playing at least 160 games up until 2009, that figure is 0.070 (about 11.34 wins per 162 games).

Second, figure out the theoretical standard deviation of luck over a season, using the binomial approximation to normal. That’s estimated by the formula

Square root of (p(1-p)/g))

For baseball, p = .500 (since the average team must be .500), and g = 162. So the SD of luck works out to about 0.039 (6.36 games per season).

So SD(performance) = 0.070, and SD(luck) = 0.039. Square those numbers to get var(performance) and var(luck). Then, if luck is independent of talent, we get

var(performance) = var(talent) + var(luck)

That means var(talent) equals 0.058 squared, so SD(talent) = 0.058.

2. Now, find the number of games for which the SD(luck) equals SD(talent), or 0.058. It turns out that’s about 74 games, because the square root of (p(1-p))/74 is approximately equal to 0.058.

3. That number, 74, is your “answer”. So, now, any time you want to regress a team’s record to the mean, take 74 games of .500 ball (37-37), and add them to the actual performance. The result is your best estimate of the team’s talent.

For instance, suppose your team goes 100-62. What’s its expected talent? Adjust the record to 137-99. That gives an estimated talent of .581, or 94-68.

Or, suppose your team starts 2-6. Adjust it to 39-43. That’s an estimated talent of .476, or 77-85.

—–

Those estimates seemed reasonable to me, but I often wondered: does this really work? Is it really true that you can add 74 games to a 162 game season, and it’ll work, but you can also add 74 games to an 8 game season, and that’ll work too? Surely you want to add fewer .500 games when your original sample is smaller, no?

And why always add the exact number of games that makes the talent SD equal to the luck SD? Is it a rule of thumb? Is it a guess? Again, that can’t be the mathematically best way, can it?

It can, actually. I spent a couple of hours doing some algebra, and it turns out that Tango’s method is exactly right. I was very surprised. Also, I don’t know how Tango figured it out … maybe he use an easier, more intuitive way to figure out that it works than going through a bunch of algebra.

But I can’t find one, so let me take you through the algebra, if you care. Tango, is there an obvious explanation for why this works, more obvious that what I’ve done?

——-

As I wrote a few paragraphs ago,

var(overall) = var(talent) + var(luck). [Call this "equation 1" for later.]

Let v^2 =var(overall), and let t^2 = var(talent). Also, let “g” be the number of games.

From the binomial approximation to normal, we know var(luck) = (.25/g). So

v = SD(overall)

t = SD(talent)

sqr(.25/g) = SD(luck)Suppose you run a regression on overall outcome vs. talent. The variance of talent is t^2. The variance of overall outcome is v^2. Therefore, we know that talent will explain t^2/v^2 of the variance of outcome, so the r-squared we get out of the regression will be t^2/v^2. That means the correlation coefficient, “r”, will be equal to the square root of that, or t/v.

There’s a property of regression in general that implies this: If we want to predict talent from outcome, then, if the outcome X is y standard deviations from the mean, talent will be y(t/v) standard deviations from the mean. That’s one of the things that’s true for any regression of two variables.

So:

Expected talent = average + (number of SDs outcome is away from the mean) (t/v) * (SD of talent)

Expected talent = average + [(outcome - mean)/SD of outcome] [t/v] * (SD of talent)

Expected talent = average + (X – mean)/v * (t/v) * t

Expected talent = average + t^2/v^2 (X – mean)

That last equation means that when we look at how far the observation is from average, we “keep” t^2/v^2 of the difference, and regress to the mean by the rest. In other words, we regress to the mean by (1 – t^2/v^2), or “(100 * (1 – t^2/v^2)) percent”.

Now, if we regress to the mean by (1 – t^2/v^2), that’s the exactly the same as averaging

– (1 – t^2/v^2) parts average performance, and

– (t^2/v^2) parts observed performance.For instance, if you’re regressing one-third of the way to the mean, you can do it two ways. You can (a) move from the average to the observation, and then move the other way by 1/3 of the difference, or (b) you can just take an average of two parts original and one part mean.

But how does that translate, in practical terms, into how many games of average performance we need to add?

From above, we know that:

For every t^2/v^2 games of observed performance, we want (1 – t^2/v^2) games of average performance.

And now a little algebra:

For every 1 game of observed performance, we want (1 – t^2/v^2)/(t^2/v^2) games of average performance.

Simplifying gives,

For every game of observed performance, we want (v^2-t^2)/t^2 games of average performance.

Multiply by g:

For every “g” games of observed performance, we want g(v^2-t^2)/t^2 games of average performance.

But, from equation 1, we know that (v^2-t^2) is just the squared SD of luck, which is .25/g. So,

For every “g” games of observed performance, we want g(.25/g)/t^2 games of average performance.

The “g”s cancel, and we get,

For every “g” games of observed performance, we want .25/t^2 games of average performance.

And that doesn’t depend on g! So no matter whether you’re regressing a team over 1 game, or 10 games, or 20 games, or 162 games, you can always add *the same number of average games* and get the right answer! I wouldn’t have guessed that.

——–

But how many games? Well, it’s (.25/t^2) games.

For baseball, we calculated earlier now that t = 0.058. So .25/t^2 equals … 74 games. Exactly as Tango said, the number of games we’re adding is exactly the number of games for which SD(luck) equals SD(talent)!

Is that a coincidence? No, it’s not. It’s the way it has to be. Why? Here’s a semi-intuitive explanation.

As we saw above, the number of games we have to add does NOT depend on the number of games we started with in the observed W-L record. So, we can pick any number of games. Suppose we just happened to start with 74 games — maybe a team that was 40-34, or something.

Now, for that team, the SD of its talent is 0.058. And, the SD of its luck is also 0.058. Therefore, if we were to do a regression of talent vs. observed, we would necessarily come up with an r-squared of 0.5 — since the variances of talent and luck are exactly equal, talent explains half of the total variance.

That means the correlation coefficient, r, is the square root of 0.5, or 1 divided by the square root of 2. For every SD change in performance, we predict 1/sqr(2) SD change in talent. But the SD of talent is exactly 1/sqr(2) times the SD of performance. Multiply those two 1/sqr(2)’s together and you get 1/2, which means for every win change in performance, we predict 1/2 win change in talent.

That’s another way of saying that we want to regress exactly halfway back to the mean. That, in turn, is the equivalent of averaging one part observation, and one part mean. Since we have 74 games of observation, we need to add 74 games of mean.

So, in the case of “starting with 74 games of observation,” the answer is, “we need to add 74 games of .500 to properly regress to the mean.”

However, we showed above that we want to add the *same* number of .500 games regardless of how many observed games we started with. Since this case works out to 74 games, *all* situations must work out to 74 games.

I responded that “Tom Tango” is a great name, in the “Vance Maverick” or “Larry Lancaster” category. But Causey burst my bubble by informing me that he thinks it’s a pseudonym.

In answer to the original question, they’re doing hierarchical Bayes with a point estimate for the group-level variance. We do something similar in chapter 2 of BDA (the cancer-rate example), just with a Poisson rather than a binomial model. It’s good to see people re-deriving this sort of thing from scratch, and justifying it based on it making sense rather than just because it’s Bayesian. Of course if the model is correct the posterior mean will have lowest mean squared error, so all sorts of different derivations will get you the right answer.

The post Quantifying luck vs. skill in sports appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post (Py, R, Cmd) Stan 2.3 Released appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Instructions on how to install at:

As always, let us know if you’re having problems or have comments or suggestions.

We’re hoping to roll out the next release a bit quicker this time, because we have lots of good new features that are almost ready to go (vectorizing multivariate normal, higher-order autodiff for probability functions, differential equation solver, L-BFGS optimizer).

Here are the official release notes.

v.2.3.0 (18 June 2014) ====================================================================== We had a record number of user-submitted patches this go around. Thanks to everyone! New Features ------------ * user-defined function definitions added to Stan language * cholesky_corr data type for Cholesky factors of correlation matrices * a^b syntax for pow(a,b) (thanks to Mitzi Morris) * reshaping functions: to_matrix(), to_vector(), to_row_vector(), to_array_1d(), to_array_2d() * matrix operations: quad_form_sym() (x' *Sigma * x), QR decompositions qr_Q(), qr_R() * densities: Gaussian processes multi_gp_log(), multi_gp(), and alternative negative binomial parameterization neg_binomial_2() * random number generator: multi_normal_cholesky_rng() * sorting: sort_indices_*() for returning indexes in sorted order by value * added JSON parser to C++ (not exposed through interfaces yet; thanks to Mitzi Morris) * many fixes to I/O for data and inits to check consistency and report errors * removed some uses of std::cout where they don't belong * updated parser for C++11 compatibility (thanks to Rob Goedman) New Developer -------------- * added Marco Inacio as core developer Optimizations ------------- * turned off Eigen asserts * efficiency improvements to posterior analysis print Documentation ------------- * Clarified licensing policy for individual code contributions * Huge numbers of fixes to the documentation, including many user-contributed patches (thanks!), fixes to parallelization in CmdStan, Windows install instructions, boundaries for Dirichlet and Beta, removed suggestion to use floor and ceiling as indices, vectorized many models, clarified that && doesn't short circuit, clarified von Mises normalization, updated censoring doc (thanks to Alexey Stukalov), negative binomial doc enhanced, new references, new discussion of hierarchical models referencing Betancourt and Girolami paper, * Avraham Adler was particularly useful in pointing out and fixing documentation errors Bug Fixes ------------ * fixed bug in lkj density * fixed bug in Jacobian for corr_matrix data type * fix cholesky_cov_matrix test code to allow use as parameter * fixed poisson_rng, neg_binomial_rng * allow binary operations (e.g., < and >) within range constraints * support MS Visual Studio 2008 * fixed memory leaks in categorical sampling statement, categorical_log function, and softmax functions * removed many compiler warnings * numerous bug fixes to arithmetic test code conditions and messages, including calls from * fixed model crashes when no parameter specified * fixed template name conflicts for some compiler bugs (thanks Kevin S. Van Horn) Code Reorganizations & Updates ------------------------------ * CmdStan is now in its own repository on GitHub: stan-dev/cmdstan * consolidate and simplify error handling across modules * pulled functionality from CmdStan command class and PyStan and RStan into Stan C++ * generalized some interfaces to allow std::vector as well as Eigen for compatibility * generalize some I/O CSV writer capabilities * optimization output text cleaned up * integer overflow during I/O now raises informative error messages * const correctness for chains (thanks Kevin S. Van Horn)

The post (Py, R, Cmd) Stan 2.3 Released appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Estimating a customer satisfaction regression, asking only a subset of predictors for each person appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’d like to speak with you briefly to get your thoughts on the imputation of missing data in a new online web-survey technique I’m developing.

Our survey uses Split Questionnaire Design. The total number of surveys will vary in length with different customers, but will generally be between 5 to 15 questions in length.

We ask only a set of 3 questions from the question pool, consisting of the following: an overall customer satisfaction question that is asked of all customers/visitors, and 2 questions selected at random from the pool of questions.

The goal of collecting this data is to run a linear regression of the eight variables (independent variables) against overall satisfaction (the dependent variable). However, in order to do this, the missing data in the survey must be imputed so the data set is “rectangular”. This is important, as different imputation techniques are ideal for different intended outcomes; the goal here is to result in a data set to allow for a regression against the dependent variable, overall satisfaction.

The motivation was that people won’t want to answer 5 or 10 questions but maybe you can get them to answer 2 or 3.

At first I was stuck—will these subsets give you enough information to estimate the regression?—but after thinking about it for a few moments I realized that yes, you can do it. You can’t estimate the full joint distribution with just the subsets, but a regression model with no interactions is living on a lower-dimensional space.

With a sample in which you ask random pairs of questions to different people, you can estimate the covariance matrix for all your variables, and from this you can reconstruct the linear regression. Perhaps the easiest way to do this is not through imputation but rather to do it directly by constructing a regularized estimate of the covariance matrix.

The short story is that there are four concerns:

1. Do you have enough data so that you can mathematically “identify” (that is, estimate) the regression given the available data. It turns out (and it’s easy to show mathematically) that, yes, if you ask at least 2 questions on each person, giving different questions randomly to different groups of people, that you do have enough information to estimate the regression.

2. How do you actually do the estimation? There are various ways to estimate the regression. Various missing-data imputation programs will do the job. You can pretty much pick it based on which software you are comfortable with.

3. Efficiency. How many questions should you ask each person (it should be at least 2, but you could ask 3 or 4, for example)? How do you do the design? Do you want some questions included more often than others? For this you could just make up a design, or it should be possible to do some simulation studies to design something more optimal.

4. Practical issues that will come up, for example unintentional missing data (not everyone will answer every question). This can probably be handled easily enough using whatever missing-data imputation algorithm you are using.

I thought it would make sense to blog this since other people might be interested in doing this sort of thing too.

The post Estimating a customer satisfaction regression, asking only a subset of predictors for each person appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More on those randomistas appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Ziliak replied:

Ozler’s post is very good indeed, and well written.

Ozler’s suggestion for improvement of the cowpeas trial does not include investigations of soil/weather/and other exogenous differences which will, in a random design, bias even the simplest difference, MS – TS, on a single farm. But this is to me a lingering trait of post-Fisherian agronomy, which has not looked into Gosset’s aka Student’s work on repeated balanced trials. Nor are current researchers aware as much as they might be of the battles between Student and Fisher, and the fact that Egon Pearson, Neyman, Harold Jeffreys, John Wishart and others sided with Student’s balance, not Fisher’s random.

Here and here are two papers on the topic and history.

Thus in my [Ziliak's] view, it is misleading to lump “Fisher, Yates, Neyman” et al together, and not mention Gosset. Lots of conflict between them. All were learning from and responding to Gosset’s work, and Fisher not in the most scientific manner (as I argue in detail in the Review of Behavioral Economics article).

I seem to recall someone (Rubin?) telling me that Gosset way overrated, but now I’m forgetting all the details.

The post More on those randomistas appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Too Linear To Be True: The curious case of Jens Forster appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Yup, another social psychology researcher from northwestern Europe who got results that people just don’t believe.

I’m a fan of Retraction Watch but not a regular reader so I actually heard about this one indirectly, via this email from Baruch Eitam which contained the above link and the following note:

Of the latest troubles in Social psychology you probably heard. Now, about the others I wasn’t surprised, I “grew up” in this climate. I know these practices and in a real sense the manner my lab works is molded as a negative to these practices (mostly, shared scripts and datafiles and replications, replications, replications). I personally wasted ~1.5 years of work and 600(!) participants trying to replicate one of Bargh’s experiments (at Columbia btw). Jens’ case suprised me a great deal. I met the guy and he seemed to me the most passionate and honest person. Also not one of these hotshots or ideologists that treat the data as ornamentation to their great ideas. Orthogonally to my prior I am also disgusted by the manner of the discussion. I find that the whistleblowers have turned into a bunch of bloodthirsty hunters that fail to show even a shred of doubt when at stake (with all due respect to science) is a persons’ life’s work, integrity and livelihood. I know you pitched in the discussion and we had an excellent lab around you recent paper on the non-fishing version of “researchers degrees of freedom”. I strongly resonate with this rigor squeeze and have intuitively applied many of these procedures, without much guidance, as early as my second year of doctoral studies. Still to what degree of certainty can we be sure about the conclusions based on the (weird, i must admit) linear pattern found in Forster’s data? Can we quantify the certainty of the conclusion itself? Shouldn’t at least two different statistical approaches be applied when the future of a person is at stake here? I also asked a collaborator of mine Zoltan Dienes to look into the “statistical accusations” but I would love to hear your opinion and would be more than glad if you address this on your blog as i feel it is an ethical matter at least as it is a professional one (giving the dude a fair hearing that is).

I have tried to voice this opinion on retraction watch but was attacked by a bunch of angry men and lost interest (I am from Israel, there are more serious battles to fight here…)

My reply: Figure 3 on page 9 of this report looks pretty convincing no?

The only weird thing is, why would someone make such data so linear on purpose? It makes me suspect that perhaps what Forster reported as raw data were not raw data but rather were derived quantities based on a fitted model (perhaps with the rare departures from perfect linearity arising from roundoff issues). I could imagine this happening in a collaborative project where data and analyses are passed around. Forster wrote that he “never manipulated data” but later he said “if data manipulation took place, something that cannot even be decided on the basis of the available data, it cannot be said who did it and how it was done” which makes me wonder whether there were some holes in the collaboration.

At this point, one possible scenario for Forster seems to be that something is very wrong in these papers, he knows it, and he’s trying to give denials that are misleading but literally true. I don’t know for sure, but it does seem like (a) the data are fishy and (b) he must know that these aren’t the raw data. Another possibility is that, if the raw data ever were recovered, the ultimate research conclusions might not change much. I can’t really tell, having not looked at the original papers and indeed having only skimmed the report from the university committee.

It’s hard for me to picture someone going to the trouble of making up data to look exactly linear, but I could easily see a research team, through a mix of sloppiness and misunderstanding of statistics, taking the estimates from a linear model and then plugging them in a later analysis. Sometimes this sort of post-processing makes sense in an analysis, other times it’s not the right thing to do but people do it without realizing the problem.

The post Too Linear To Be True: The curious case of Jens Forster appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>