The post “Women Respond to Nobel Laureate’s ‘Trouble With Girls'” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Some people are horrified by what the old guy said, other people are horrified by how he was treated. He was clueless in his views about women in science, or he was cluelessly naive about gotcha journalism. I haven’t been following the details and so I’ll expressing no judgment one way or another.

My comment on the episode is that I find the whole “nobel laureate” thing a bit tacky in general. People get the prize and they get attention for all sorts of stupid things, people do all sorts of things in order to try to get it, and, beyond all that, research shows that *not* getting the Nobel Prize reduces your expected lifespan by two years. Bad news all around.

**P.S.** Regarding this case in particular, Basbøll points to this long post from Louise Mensch. Again, I don’t want to get involved in the details here, but I am again reminded how much I prefer blogs to twitter. On the positive side, I prefer a blog exchange to a twitter debate. And, on the negative side, I’d rather see a blogwar than a twitter mob.

The post “Women Respond to Nobel Laureate’s ‘Trouble With Girls'” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What’s the stupidest thing the NYC Department of Education and Columbia University Teachers College did in the past decade? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The principal of a popular elementary school in Harlem acknowledged that she forged answers on students’ state English exams in April because the students had not finished the tests . . . As a result of the cheating, the city invalidated several dozen English test results for the school’s third grade.

The school is a new public school—it opened in 2011—that is run jointly by the New York City Department of Education and Columbia University Teachers College.

So far, it just seems like an unfortunate error. According to the news article, “Nancy Streim, associate vice president for school and community partnerships at Teachers College, said Ms. Worrell-Breeden had created a ‘culture of academic excellence'” at the previous school where she was principal. Maybe Worrell-Breeden just cared too much and was under too much pressure to succeed, she cracked and helped the students cheat.

But then I kept reading:

In 2009 and 2010, while Ms. Worrell-Breeden was at P.S. 18, she was the subject of two investigations by the special commissioner of investigation. The first found that she had participated in exercise classes while she was collecting what is known as “per session” pay, or overtime, to supervise an after-school program. The inquiry also found that she had failed to offer the overtime opportunity to others in the school, as required, before claiming it for herself.

The second investigation found that she had inappropriately requested and obtained notarized statements from two employees at the school in which she asked them to lie and say that she had offered them the overtime opportunity.

*After* those findings, we learn, “She moved to P.S. 30, another school in the Bronx, where she was principal briefly before being chosen by Teachers College to run its new school.”

So, let’s get this straight: She was found to be a liar, a cheat, and a thief, and then, with that all known, she was hired to *two* jobs as school principal??

The news article quotes Nancy Streim of Teachers College as saying, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.”

On balance, huh? Whatever else you can say about Worrell-Breeden, she seems to have had the talent of conning powerful people. Or maybe just one or two powerful people in the Department of Education who had the power to get her these jobs.

This is really bad. Is it so hard to find a school principal that you have no choice but to hire someone who lies, cheats, and steals?

It just seems weird to me. I accept that all of us have character flaws, but this is ridiculous. Principal is a supervisory position. What kind of toxic environment will you have in a school where the principal is in the habit of forging documents and instructing employees to lie? How could this possibly be considered a good idea?

Here’s the blurb on the relevant Teachers College official:

Nancy Streim joined Teachers College in August 2007 in the newly created position of Associate Vice President for School and Community Partnership. . . . Dr. Streim comes to Teachers College after nineteen years at the University of Pennsylvania’s Graduate School of Education where she most recently served as Associate Dean for Educational Practice. . . . She recently completed a year long project for the Bill and Melinda Gates Foundation in which she documented principles underlying successful university-assisted public schools across the U.S. She has served as principal investigator for five major grant-funded projects that address the teaching and learning of math and science in elementary and middle grades.

It’s not clear to me whether Streim actually thought Worrell-Breeden was the best person for the job. Reading between the lines, maybe what happened is that Worrell-Breeden was plugged into the power structure at the Department of Education and someone at the D.O.E. lined up the job for her.

In a talk I found online, Streim says something about “patient negotiations” with school officials. Maybe a few years ago someone in power told her: Yes, we’ll give you a community school to run, but you have to take Worrell-Breeden as principal. I don’t know, but it’s possible.

I guess I’d prefer to think that Teachers College made a dirty but necessary deal. That’s more palatable to me than the idea that the people at the Department of Education and Teachers College thought it was a good idea to hire a liar/cheat/thief as a school principal.

Or maybe I’m missing the point? Perhaps integrity is not so important. The world is full of people with integrity but no competence, and we wouldn’t want that either.

The post What’s the stupidest thing the NYC Department of Education and Columbia University Teachers College did in the past decade? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**The statistics**

The statistical content has to do with a biology paper by M. Kellis, B. W. Birren, and E.S. Lander from 2004 that contains the following passage:

Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.

Here’s where the 95% came from. In Pachter’s words:

The authors identified 457 duplicated gene pairs that arose by whole genome duplication (for a total of 914 genes) in yeast. Of the 457 pairs 76 showed accelerated (protein) evolution in S. cerevisiae. The term “accelerated” was defined to relate to amino acid substitution rates in S. cerevisiae, which were required to be 50% faster than those in another yeast species, K. waltii. Of the 76 genes, only four pairs were accelerated in both paralogs. Therefore 72 gene pairs showed acceleration in only one paralog (72/76 = 95%).

In his post on the topic, Pachter asks for a p-value for this 72/76 result which the authors of the paper in question had called “surprising.”

My first thought on the matter was that no p-value is needed because 72 out of 76 is such an extreme proportion. I guess I’d been implicitly comparing to a null hypothesis of 50%. Or, to put it another way, if you have 76 pairs, out of which 80 were accelerated (I think I did this right and that I’m not butchering the technical jargon: I got 80 by taking 72 pairs with only one paralog plus 4 pairs with two paralogs each), it would be extremely extremely unlikely to see only four pairs with acceleration in both.

But, then, as I read on, I realized this isn’t an appropriate comparison. Indeed, the clue is above, where Pachter notes that there were 457 pairs in total, thus in a null model you’re working with a probability of 80/(2*457) = 0.087, and when the probability is 0.087, it’s not so unlikely that you’d only see 4 pairs out of 457 with two accelerated paralogs. (Just to get the order of magnitude, 0.087^2 = 0.0077, and 0.0077*457 = 3.5, so 4 pairs is pretty much what you’d expect.)

So it sounds like Kellis et al. got excited by this 72 out of 76 number, without being clear on the denominator. I don’t know enough about biology to comment on the implications of this calculation on the larger questions being asked.

Pachter frames his criticisms around p-values, a perspective I find a bit irrelevant, but I agree with his larger point that, where possible, probability models should be stated explicitly.

*The link between the scientific theory and statistical theory is often a weak point in quantitative research.* In this case, the science has something to do with genes and evolution, and the statistical model is was that allowed Kellis et al. to consider 72 out of 76 to be “striking” and “surprising.” It is all too common for a researcher to reject a null hypothesis that is not clearly formed, in order to then make a positive claim of support for some preferred theory. But a lot of steps are missing in such an argument.

**The culture**

The cultural issue is summarized in this comment by Michael Eisen:

The more this conversation goes on the more it disturbs me [Eisen]. Lior raised an important point regarding the analyses contained in an influential paper from the early days of genome sequencing. A detailed, thorough and occasionally amusing discussion ensued, the long and the short of which to any intelligent reader should be that a major conclusion of the paper under discussion was simply wrong. This is, of course, how science should proceed (even if it rarely does). People make mistakes, others point them out, we all learn something in the process, and science advances.

However, I find the responses from Manolis and Eric to be entirely lacking. Instead of really engaging with the comments people have made, they have been almost entirely defensive. Why not just say “Hey look, we were wrong. In dealing with this complicated and new dataset we did an analysis that, while perhaps technically excusable under some kind of ‘model comparison defense’ was, in hindsight, wrong and led us to make and highlight a point that subsequent data and insights have shown to be wrong. We should have known better at the time, but we’ve learned from our mistake and will do better in the future. Thanks for helping us to be better scientists.”

Sadly, what we’ve gotten instead is a series of defenses of an analysis that Manolis and Eric – who is no fool – surely know by this point was simply wrong.

In an update, Pachter amplifies upon this point:

One of the comments made in response to my post that I’d like to respond to first was by an author of KBL [Kellis, Birren, and Lander; in this case the comment was made by Kellis] who dismissed the entire premise of the my challenge writing “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become, as Ferguson and Heene noted, a vast graveyard of undead theories?

Indeed. One of the things I’ve been fighting against recently (for example, in my article, It’s too hard to publish criticisms and obtain data for replication, or in this discussion of some controversial comments about replication coming from a cancer biologist) is the idea that, once something is published, it should be taken as truth. This attitude, of raising a high bar to post-publication criticism, is sometimes framed in terms of fairness. But, as I like to say, what’s so special about publication in a journal? Should there be a high barrier to criticisms of claims made in Arxiv preprints? What about scrawled, unpublished lab notes??? Publication can be a good way of spreading the word about a new claim or finding, but I don’t don’t don’t don’t don’t like the norm in which something that is published should not be criticized.

To put it another way: Yes, ha ha ha, let’s spend our time on guitar practice rather than exhuming 11-year-old published articles. Fine—I’ll accept that, as long as you also accept that we should not be *citing* 11-year-old articles.

If a paper is worth citing, it’s worth criticizing its flaws. Conversely, if you *don’t* think the flaws in your 11-year-old article are worth careful examination, maybe there could be some way you could withdraw your paper from the published journal? Not a “retraction,” exactly, maybe just an Expression of Irrelevance? A statement by the authors that the paper in question is no longer worth examining as it does not relate to any current research concerns, nor are its claims of historical interest. Something like that. Keep the paper in the public record but make it clear that the authors no longer stand behind its claims.

**P.S.** Elsewhere Pachter characterizes a different work of Kellis as “dishonest and fraudulent.” Strong words, considering Kellis is a tenured professor at MIT who has received many awards. As an outsider to all this, I’m wondering: Is it possible that Kellis is dishonest, fraudulent, and *also* a top researcher? Kinda like how Linda is a bank teller who is also a feminist? Maybe Kellis is an excellent experimentalist but with an unfortunate habit of making overly broad claims from his data? Maybe someone can help me out on this.

The post “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Ripped from the pages of a George Pelecanos novel appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Did anyone else notice that this DC multiple-murder case seems just like a Pelecanos story?

Check out the latest headline, “D.C. Mansion Murder Suspect Is Innocent Because He Hates Pizza, Lawyer Says”:

Robin Flicker, a lawyer who has represented suspect Wint in the past but has not been officially hired as his defense attorney, says police are zeroing in on Wint because his DNA was found on pizza at the crime scene. The only problem, Flicker said is that Wint doesn’t like pizza.

“He doesn’t eat pizza,” Flicker told ABC News. “If he were hungry, he wouldn’t order pizza.”

When I saw the DC setting, the local businessman, the manhunt, and the horror/comic story of a pizza-ordering killer, I thought about Pelecanos immediately. And then I noticed that the victim’s family was Greek. Can’t get more Pelecanos than that.

I googled *pizza murder dc pelecanos* but I didn’t see any hits at all. I can’t figure that one out: surely someone would interview him for his thoughts on this one?

The post Ripped from the pages of a George Pelecanos novel appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

**Wed:** What do I say when I don’t have much to say?

**Thurs:** “Women Respond to Nobel Laureate’s ‘Trouble With Girls’”

**Fri:** This sentence by Thomas Mallon would make Barry N. Malzberg spin in his grave, except that he’s still alive so it would just make him spin in his retirement

**Sat:** If you leave your datasets sitting out on the counter, they get moldy

**Sun:** Spam!

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The 3 Stages of Busy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This week quickly got booked after last week’s NIPS deadline.

So we’re meeting in another week. *That’s* busy for you: after one week off the grid, he had a week’s worth of pent-up meetings! I thought I was busy, but it’s nothing like that.

And this made me formulate my idea of the 3 Stages of Busy. It goes like this:

Stage 1 (early career): Not busy, at least not with external commitments. You can do what you want.

Stage 2 (mid career; my friend described above): Busy, overwhelmed with obligations.

Stage 3 (late career; me): So busy that it’s pointless to schedule anything, so you can do what you want (including writing blogs two months in advance!).

The post The 3 Stages of Busy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Ira Glass asks. We answer. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The celebrated radio quiz show star says:

There’s this study done by the Pew Research Center and Smithsonian Magazine . . . they called up one thousand and one Americans. I do not understand why it is a thousand and one rather than just a thousand. Maybe a thousand and one just seemed sexier or something. . . .

I think I know the answer to this one! The survey may well have aimed for 1000 people, but you can’t know ahead of time exactly how many people will respond. They call people, leave messages, call back, call back again, etc. The exact number of people who end up in the survey is a random variable.

The post Ira Glass asks. We answer. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post 45 years ago in the sister blog appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More gremlins: “Instead, he simply pretended the other two estimates did not exist. That is inexcusable.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Brandon Shollenberger writes:

I’ve spent some time examining the work done by Richard Tol which was used in the latest IPCC report. I was troubled enough by his work I even submitted a formal complaint with the IPCC nearly two months ago (I’ve not heard back from them thus far). It expressed some of the same concerns you expressed in a post last year.

The reason I wanted to contact you is I recently realized most people looking at Tol’s work are unaware of a rather important point. I wrote a post to explain it which I’d invite you to read, but I’ll give a quick summary to possibly save you some time.

As you know, Richard Tol claimed moderate global warming will be beneficial based upon a data set he created. However, errors in his data set (some of which are still uncorrected) call his results into question. Primarily, once several errors are corrected, it turns out the only result which shows any non-trivial benefit from global warming is Tol’s own 2002 paper.

That is obviously troubling, but there is a point which makes this even worse. As it happens, Tol’s 2002 paper did not include just one result. It actually included three different results. A table for it shows those results are +2.3%, +0.2% and -2.7%.

The 2002 paper does nothing to suggest any one of those results is the “right” one, nor does any of Tol’s later work. That means Tol used the +2.3% value from his 2002 paper while ignoring the +0.2% and -2.7% values, without any stated explanation.

It might be true the +2.3% value is the “best” estimate from the 2002 paper, but even if so, one needs to provide an explanation as to why it should be favored over the other two estimates. Tol didn’t do so. Instead, he simply pretended the other two estimates did not exist. That is inexcusable.

I’m not sure how interested you are in Tol’s work, but I thought you might be interested to know things are even worse than you thought.

This is horrible and also kind of hilarious. We start with a published paper by Tol claiming strong evidence for a benefit from moderate global warming. Then it turns out he had some data errors; fixing the errors led to a weakening of this conclusions. Then more errors came out, and it turned out that there was only one point in his entire dataset supporting his claims—and that point came from his own previously published study. And then . . . even that one point isn’t representative of that paper.

You pull and pull on the thread, and the entire garment falls apart. There’s nothing left.

At no point did Tol apologize or thank the people who pointed out his errors; instead he lashed out, over and over again. Irresponsible indeed.

The post More gremlins: “Instead, he simply pretended the other two estimates did not exist. That is inexcusable.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan 2.7 (CRAN, variational inference, and much much more) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Stan 2.7 is now available for all interfaces. As usual, everything you need can be found starting from the Stan home page:

- RStan is on CRAN!
^{(1)} - Variational Inference in CmdStan!!
^{(2)} - Two new Stan developers!!!
^{ } - A whole new logo!!!!
^{ } - Math library with autodiff now available in its own repo!!!!!
^{ }

^{(2)} Coming soon to an interface near you.

v2.7.0 (9 July 2015) ====================================================================== New Team Members -------------------------------------------------- * Alp Kucukelbir, who brings you variational inference * Robert L. Grant, who brings you the StataStan interface Major New Feature -------------------------------------------------- * Black-box variational inference, mean field and full rank (#1505) New Features -------------------------------------------------- * Line numbers reported for runtime errors (#1195) * Wiener first passage time density (#765) (thanks to Michael Schvartsman) * Partial initialization (#1069) * NegBinomial2 RNG (#1471) and PoissonLog RNG (#1458) and extended range for Dirichlet RNG (#1474) and fixed Poisson RNG for older Mac compilers (#1472) * Error messages now use operator notation (#1401) * More specific error messages for illegal assignments (#1100) * More specific error messages for illegal sampling statement signatures (#1425) * Extended range on ibeta derivatives with wide impact on CDFs (#1426) * Display initialization error messages (#1403) * Works with Intel compilers and GCC 4.4 (#1506, #1514, #1519) Bug Fixes -------------------------------------------------- * Allow functions ending in _lp to call functions ending in _lp (#1500) * Update warnings to catch uses of illegal sampling functions like CDFs and updated declared signatures (#1152) * Disallow constraints on local variables (#1295) * Allow min() and max() in variable declaration bounds and remove unnecessary use of math.h and top-level :: namespace (#1436) * Updated exponential lower bound check (#1179) * Extended sum to work with zero size arrays (#1443) * Positive definiteness checks fixed (were > 1e-8, now > 0) (#1386) Code Reorganization and Back End Upgrades -------------------------------------------------- * New static constants (#469, #765) * Added major/minor/patch versions as properties (#1383) * Pulled all math-like functionality into stan::math namespace * Pulled the Stan Math Library out into its own repository (#1520) * Included in Stan C++ repository as submodule * Removed final usage of std::cout and std::cerr (#699) and updated tests for null streams (#1239) * Removed over 1000 CppLint warnings * Remove model write CSV methods (#445) * Reduced generality of operators in fvar (#1198) * Removed folder-level includes due to order issues (part of Math reorg) and include math.hpp include (#1438) * Updated to Boost 1.58 (#1457) * Travis continuous integration for Linux (#607) * Add grad() method to math::var for autodiff to encapsulate math::vari * Added finite diff functionals for testing (#1271) * More configurable distribution unit tests (#1268) * Clean up directory-level includes (#1511) * Removed all lint from new math lib and add cpplint to build lib (#1412) * Split out derivative functionals (#1389) Manual and Documentation -------------------------------------------------- * New Logo in Manual; remove old logos (#1023) * Corrected all known bug reports and typos; details in issues #1420, #1508, #1496 * Thanks to Sunil Nandihalli, Andy Choi, Sebastian Weber, Heraa Hu, @jonathan-g (GitHub handle), M. B. Joseph, Damjan Vukcevic, @tosh1ki (GitHub handle), Juan S. Casallas * Fix some parsing issues for index (#1498) * Added chapter on variational inference * Added strangely unrelated regressions and multivariate probit examples * Discussion from Ben Goodrich about reject() and sampling * Start to reorganize code with fast examples first, then explanations * Added CONTRIBUTING.md file (#1408)

The post Stan 2.7 (CRAN, variational inference, and much much more) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post BREAKING . . . Kit Harrington’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Rasmus “ticket to” Bååth writes:

I heeded your call to construct a Stan model of the height of Kit “Snow” Harrington. The response on Gawker has been poor, unfortunately, but here it is, anyway.

Yeah, I think the people at Gawker have bigger things to worry about this week. . . .

Here’s Rasmus’s inference for Kit’s height:

And here’s his summary:

From this analysis it is unclear how tall Kit is, there is much uncertainty in the posterior distribution, but according to the analysis (which might be quite off) there’s a 50% probability he’s between 1.71 and 1.75 cm tall. It is stated in the article that he is NOT 5’8” (173 cm), but according to this analysis it’s not an unreasonable height, as the mean of the posterior is 173 cm.

His Stan model is at the link. (I tried to copy it here but there was some html crap.)

The post BREAKING . . . Kit Harrington’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As D.M.C. would say, bad meaning bad not bad meaning good.

Deborah Mayo points to this terrible, terrible definition of statistical significance from the Agency for Healthcare Research and Quality:

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

The definition is wrong, as is the example. I mean, really wrong. So wrong that it’s perversely impressive how many errors they managed to pack into two brief paragraphs:

1. I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right? You could try to give them some slack and assume they meant, “whether the results of a study represent a true pattern in the general population” or something like that—but, even so, it’s not clear what is meant by “true.”

2. Even if you could some how get some definition of “likely to be true,” that is not what statistical significance is about. It’s just not.

3. “Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.” Ummm, this is close, if you replace “an effect” with “a difference at least as large as what was observed” and if you append “conditional on there being a zero underlying effect.” Of course in real life there are very few zero underlying effects (I hope the Agency for Healthcare Research and Quality mostly studies treatments with positive effects!), hence the irrelevance of statistical significance to relevant questions in this field.

4. “The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).” No no no no no. As has been often said, the p-value is a measure of sample size. And, even conditional on sample size, and conditional on measurement error and variation between people, the probability that the results are true (whatever exactly that means) depends strongly on what is being studied, what Tversky and Kahneman called the base rate.

5. As Mayo points out, it’s sloppy to use “likely” to talk about probability.

6. “Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05)." Ummmm, yes, I guess that's correct. Lots of ignorant researchers believe this. I suppose that, without this belief, Psychological Science would have difficulty filling its pages, and Science, Nature, and PPNAS would have no social science papers to publish and they'd have to go back to their traditional plan of publishing papers in the biological and physical sciences. 7. "The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems." Hahahahahaha. Funny. What's really amusing is that they hyperlink "probability" so we can learn more technical stuff from them. OK, I'll bite, I'll follow the link:

Probability

Definition: The likelihood (or chance) that an event will occur. In a clinical research study, it is the number of times a condition or event occurs in a study group divided by the number of people being studied.

Example: For example, a group of adult men who had chest pain when they walked had diagnostic tests to find the cause of the pain. Eighty-five percent were found to have a type of heart disease known as coronary artery disease. The probability of coronary artery disease in men who have chest pain with walking is 85 percent.

Fuuuuuuuuuuuuuuuck. No no no no no. First, of course “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.

It’s refreshing to see these sorts of errors out in the open, though. If someone writing a tutorial makes these huge, huge errors, you can see how everyday researchers make these mistakes too.

For example:

A pair of researchers find that, for a certain group of women they are studying, three times as many are wearing red or pink shirts during days 6-14 of their monthly cycle (which the researchers, in their youthful ignorance, were led to believe were the most fertile days of the month). Therefore, the *probability* (see above definition) of wearing red or pink is three times more likely during these days. And the result is *statistically significant* (see above definition), so the results are probably true. That pretty much covers it.

All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data. But now I see it. It’s the two steps: (a) the observed frequency is the probability, (b) if p less than .05 then the result is probably real. Plus, the intellectual incentive of having your pet theory confirmed, and the professional incentive of getting published in the tabloids. But underlying all this are the wrong definitions of “probability” and “statistical significance.”

Who wrote these definitions in this U.S. government document, I wonder? I went all over the webpage and couldn’t find any list of authors. This relates to a recurring point made by Basbøll and myself: it’s hard to know what to do with a piece of writing if you don’t know where it came from. Basbøll and I wrote about this in the context of plagiarism (a statistical analogy would be the statement that it can be hard to effectively use a statistical method if the person who wrote it up doesn’t understand it himself), but really the point is more general. If this article on statistical significance had an author of record, we could examine the author’s qualifications, possibly contact him or her, see other things written by the same author, etc. Without this, we’re stuck.

Wikipedia articles typically don’t have named authors, but the authors do have online handles and they thus take responsibility for their words. Also Wikipedia requires sources. There are no sources given for these two paragraphs on statistical significance which are so full of errors.

**What, then?**

The question then arises: how *should* statistical significance be defined in one paragraph for the layperson? I think the solution is, if you’re not gonna be rigorous, don’t fake it.

Here’s my try.

Statistical Significance

Definition: A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under thenull hypothesis(which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

I think that’s better than their definition. Of course, I’m an experienced author of statistics textbooks so I should be able to correctly and concisely define p-values and statistical significance. But . . . the government could’ve asked me to do this for them! I’d’ve done it. It only took me 10 minutes! Would I write the whole glossary for them? Maybe not. But at least they’d have a correct definition of statistical significance.

I guess they can go back now and change it.

Just to be clear, I’m not trying to slag on whoever prepared this document. I’m sure they did the best they could, they just didn’t know any better. It would be as if someone asked me to write a glossary about medicine. The flaw is in whoever commissioned the glossary, to not run it by some expert to check. Or maybe they could’ve just omitted the glossary entirely, as these topics are covered in standard textbooks.

P.S. And whassup with that ugly, ugly logo? It’s the U.S. government. We’re the greatest country on earth. Sure, our health-care system is famously crappy, but can’t we come up with a better logo than this? Christ.

P.P.S. Following Paul Alper’s suggestion, I made my definition more general by removing the phrase, “that the true underlying effect is zero.”

P.P.P.S. The bigger picture, though, is that I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest. If you’re gonna define statistical significance, you should do it right, but really I think all this stuff is generally misguided.

The post A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Don’t put your whiteboard behind your projection screen appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>If you have any say in setting up your seminar rooms, don’t put your board behind your screen, please — I almsot always want to use them both at the same time.

I also just got back from a DARPA workshop at the Embassy Suites in Portland, and there the problem was a podium in between two tiny screens, neither of which was easily visible from the back of the big ballroom. Nobody knows where to point when there are two boards. One big screen is way better.

At my summer school course in Sydney earlier this year, they had a neat setup where there were two screens, but one could be used with an overhead projection of a small desktop, so I could just write on paper and send it up to the second screen. And the screens were big enough that all 200+ students could see both. Yet another great feature of Australia.

The post Don’t put your whiteboard behind your projection screen appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Richard Feynman and the tyranny of measurement appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This is all fine and it makes sense to me. Indeed, I recognize Feynman’s attitude myself: it can often be a take a lot of work to follow someone else’s paper if has lots of technical material, and I typically prefer to read a paper shallowly, get the gist, and then focus on a mix of specific details (trying to understand one example) and big picture, without necessarily following all the arguments. This seems to be Feynman’s attitude too.

The place where I part from Hsu is in this judgment of his:

Feynman’s cognitive profile was probably a bit lopsided — he was stronger mathematically than verbally. . . . it was often easier for him to invent his own solution than to read through someone else’s lengthy paper.

I have a couple of problems with this. First, Feynman was obviously very strong verbally, given that he wrote a couple of classic books. Sure, he dictated these books, he didn’t actually write them (at least that’s my understanding of how the books were put together), but still, you need good verbal skills to put things the way he did. By comparison, consider Murray Gell-mann, who prided himself on his cultured literacy but couldn’t write well for general audiences.

Anyway, sure, Feynman’s math skills were much better developed than his verbal skills. But compared to other top physicists (which is the relevant measure here)? That’s not so clear.

I’ll go with Hsu’s position that Feynman was better than others at coming up with original ideas while not being so willing to put in the effort to understand what others had written. But I’m guessing that this latter disinclination doesn’t have much to do with “verbal skills.”

Here’s where I think Hsu has fallen victim to the tyranny of measurement—that is, to the fallacy of treating concepts as more important if they are more accessible to measurement.

“Much stronger mathematically than verbally”—where does that come from?

College admissions tests are divided into math and verbal sections, so there’s that. But it’s a fallacy to divide cognitive abilities into these two parts, especially in a particular domain such as theoretical physics which requires very particular skills.

Let me put it another way. My math skills are much lower than Feynman’s and my verbal skills are comparable. I think we can all agree that my “imbalance”—the difference (however measured) between my math and verbal skills—is much lower than Feynman’s. Nonetheless, I too do my best to avoid reading highly technical work by others. Like Feynman (but of course at a much lower level), I prefer to come up with my own ideas rather than work to figure out what others are doing. And I typically evaluate others’ work using my personal basket of examples. Which can irritate the Judea Pearls of the world, as I just don’t always have the patience to figure out exactly *why* something that doesn’t work, doesn’t work. Like Feynman in that story, I can do it, but it takes work. Sometimes that work is worth it; for example, I’ve spent a lot of time trying to understand exactly what assumptions implicitly support regression discontinuity analysis, so that I could get a better sense of what happened in the notorious regression discontinuity FAIL pollution in China analysis, where the researchers in question seemingly followed all the rules but still went wrong.

Anyway, that’s a tangent. My real point is that we should be able to talk about different cognitive styles and abilities without the tyranny of measurement straitjacketing us into simple categories that happen to line up with college admissions tests. In many settings I imagine these dimensions are psychometrically relevant but I’m skeptical about applying them to theoretical physics.

The post Richard Feynman and the tyranny of measurement appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

**Wed:** Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

**Thurs:** Flamebait: “Mathiness” in economics and political science

**Fri:** 45 years ago in the sister blog

**Sat:** Ira Glass asks. We answer.

**Sun:** The 3 Stages of Busy

The post On deck for the rest of the summer and beginning of fall appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Richard Feynman and the tyranny of measurement

A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

Flamebait: “Mathiness” in economics and political science

45 years ago in the sister blog

Ira Glass asks. We answer.

The 3 Stages of Busy

Ripped from the pages of a George Pelecanos novel

“We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

What do I say when I don’t have much to say?

“Women Respond to Nobel Laureate’s ‘Trouble With Girls’”

This sentence by Thomas Mallon would make Barry N. Malzberg spin in his grave, except that he’s still alive so it would just make him spin in his retirement

If you leave your datasets sitting out on the counter, they get moldy

Spam!

The plagiarist next door strikes back: Different standards of plagiarism in different communities

How to parameterize hyperpriors in hierarchical models?

How Hamiltonian Monte Carlo works

When does Bayes do the job?

Here’s a theoretical research project for you

Classifying causes of death using “verbal autopsies”

All hail Lord Spiegelhalter!

Dan Kahan doesn’t trust the Turk

Neither time nor stomach

He wants to teach himself some statistics

Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential!

Harry S. Truman, Jesus H. Christ, Roy G. Biv

Why couldn’t Breaking Bad find Mexican Mexicans?

Rockin the tabloids

A statistical approach to quadrature

Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.

0.05 is a joke

Jökull Snæbjarnarson writes . . .

Aahhhhh, young people!

Plaig! (non-Wegman edition)

We provide a service

“The belief was so strong that it trumped the evidence before them.”

“Can you change your Bayesian prior?”

How to analyze hierarchical survey data with post-stratification?

A political sociological course on statistics for high school students

Questions about data transplanted in kidney study

Performing design calculations (type M and type S errors) on a routine basis?

“Another bad chart for you to criticize”

Constructing an informative prior using meta-analysis

Stan attribution

Cannabis/IQ follow-up: Same old story

Defining conditional probability

In defense of endless arguments

Emails I never finished reading

BREAKING . . . Sepp Blatter accepted $2M payoff from Dennis Hastert

Comments on Imbens and Rubin causal inference book

“Dow 36,000″ guy offers an opinion on Tom Brady’s balls. The rest of us are supposed to listen?

Irwin Shaw: “I might mistrust intellectuals, but I’d mistrust nonintellectuals even more.”

Death of a statistician

Being polite vs. saying what we really think

Why is this double-y-axis graph not so bad?

“There are many studies showing . . .”

Even though it’s published in a top psychology journal, she still doesn’t believe it

He’s skeptical about Neuroskeptic’s skepticism

Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

Medical decision making under uncertainty

Unreplicable

“The frequentist case against the significance test”

Erdos bio for kids

Have weak data. But need to make decision. What to do?

“I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

Optimistic or pessimistic priors

Draw your own graph!

Low-power pose

Annals of Spam

The Final Bug, or, Please please please please please work this time!

Enjoy.

The post On deck for the rest of the summer and beginning of fall appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “17 Baby Names You Didn’t Know Were Totally Made Up” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Want to drive the baby-naming public up the wall? Tell them you’re naming your daughter Renesmee. Author Stephenie Meyer invented the name for the half-vampire child in her wildly popular Twilight series. In the story it’s simply an homage to the child’s two grandmothers, Renee and Esmé. To the traditional-minded, though, Renesmee has become a symbol of everything wrong with modern baby naming: It’s not a “real name.” The author just made it up, then parents followed in imitation of pop culture.

All undeniably true, yet that history itself is surprisingly traditional. . . .

And here are the 17 classic, yet made-up, names:

Wendy

Cedric

Miranda

Vanessa

Coraline

Evangeline

Amanda

Gloria

Dorian

Clarinda

Cora

Pamela

Fiona

Jessica

Lucinda

Ronia

Imogen

The commenters express some disagreement regarding Coraline but it seems that the others on the list really were just made up. And a commenter also adds the names Stella and Norma among the made-up list. And “People who are not Shakespeare give us names like Nevaeh and Quvenzhane.”

P.S. Wattenberg adds:

Note for sticklers: Each of the writers below is credited with using the name inventively—as a coinage rather than a recycling of a familiar name—and with introducing the name to the broader culture. Scattered previous examples of usage may exist, since name creativity isn’t limited to writers.

The post “17 Baby Names You Didn’t Know Were Totally Made Up” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Lauryn’s back! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Really, no snark here. She’s got some excellent tracks on the new Nina Simone tribute album. The best part’s the sample from the classic Nina song. But that’s often the case. They wouldn’t sample something if it was no good.

P.S. Let me clarify: I prefer Lauryn’s version to Nina’s original. The best parts of Lauryn’s are the Nina samples, but I think in its entirety the new version works better, at least to my modern ears.

The post Lauryn’s back! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Annals of Spam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Andrew,

Just finished http://andrewgelman.com/2010/12/24/foreign_languag/

This leads to the silliness of considering foreign language skills as a purely positional good or as a method for selecting students, while forgetting the direct benefits of being able to communicate in various ways with different cultures.

– Found this interesting..Since you covered a language-related topic, I thought you might be interested in our new infographic where we put the new Google translate iOS app to the test. We compared the app against our best human translators and found the results quite surprising.

Would you like me to send you the link?

Thanks,

Ashley Harris

Outreach Coordinator

“Ashley Harris,” indeed.

What I’m wondering is, can’t all these bots just communicate with each other and leave us humans out of the loop? Or maybe I should be afraid of this happening?

The post Annals of Spam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Measurement is part of design appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It was interesting that these topics, which are central to any modern discussion of observational studies, were not considered important by a leader in the field, and this suggests that our thinking has changed since 1972.

Today I’d like to make a similar argument, this time regarding the topic of *measurement*. This time I’ll consider Donald Rubin’s 2008 article, “For objective causal inference, design trumps analysis.”

All of Rubin’s article is worth reading—it’s all about the ways in which we can structure the design of observational studies to make inferences more believable—and the general point is important and, I think, underrated.

When people do experiments, they think about *design*, but when they do observational studies, they think about *identification strategies*, which is related to design but is different in that it’s all about finding and analyzing data and checking assumptions, not so much about about systematic data collection. So Rubin makes valuable points in his article.

But today I want to focus on something that Rubin doesn’t really mention in his article: *measurement*, which is a topic we’ve been talking a lot about here lately.

Rubin talks about randomization, or the approximate equivalent in observational studies (the “assignment mechanism”), and about sample size (“traditional power calculations,” as his article was written before Type S and Type M errors were well known), and about the information available to the decision makers, and about balance between treatment and control groups.

Rubin does briefly mention the importance of measurement, but only in the context of being able to match or adjust for pre-treatment differences between treatment and control groups.

That’s fine, but here I’m concerned with something even more basic: the *validity* and *reliability* of the measurements of outcomes and treatments (or, more generally, comparison groups). I’m assuming Rubin was taking validity for granted—assuming that the x and y variables being measured were the treatment and outcome of interest—and, in a sense, the reliability question is included in the question about sample size. In practice, though, studies are often using sloppy measurements (days of peak fertility, fat arms, beauty, etc.), and if the measurements are bad enough, the problems go beyond sample size, partly because in such studies the sample size would have to creep into the zillions for anything to be detectable, and partly because the biases in measurement can easily be larger than the effects being studied.

So, I’d just like to take Rubin’s excellent article and append a brief discussion of the importance of measurement.

**P.S.** I sent the above to Rubin, who replied:

In that article I was focusing on the design of observational studies, which I thought had been badly neglected by everyone in past years, including Cochran and me. Issues of good measurement, I think I did mention briefly (I’ll have to check—I do in my lectures, but maybe I skipped that point here), but having good measurements had been discussed by Cochran in his 1965 JRSS paper, so were an already emphasized point.

And I wanted to focus on the neglected point, not all relevant points for observational studies.

The post Measurement is part of design appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan is Turing complete appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan is Turing complete appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post New papers on LOO/WAIC and Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Aki, Jonah, and I have released the much-discussed paper on LOO and WAIC in Stan: Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models.

We (that is, Aki) now recommend LOO rather than WAIC, especially now that we have an R function to quickly compute LOO using Pareto smoothed importance sampling. In either case, a key contribution of our paper is to show how LOO/WAIC can be computed in the regular workflow of model fitting.

We also compute the standard error of the difference between LOO (or WAIC) when comparing two models, and we demonstrate with the famous arsenic well-switching example.

Also 2 new tutorial articles on Stan will be appearing:

in JEBS: Stan: A probabilistic programming language for Bayesian inference and optimization.

in JSS: Stan: A probabilistic programming language

The two articles have very similar titles but surprisingly little overlap. I guess it’s the difference between what Bob thinks is important to say, and what I think is important to say.

Enjoy!

**P.S.** Jonah writes:

For anyone interested in the the R package “loo” mentioned in the paper, please install from GitHub and not CRAN. There is a version on CRAN but it needs to be updated so please for now use the version here:

To get it running, you must first install the “devtools” package in R, then you can just install and load “loo” via:

library("devtools") install_github("jgabry/loo") library("loo")

Jonah will post an update when the new version is also on CRAN.

**P.P.S.** This P.P.S. is by Jonah. The latest version of the loo R package (0.1.2) is now up on CRAN and should be installable for most people by running

install.packages("loo")

although depending on various things (your operating system, R version, CRAN mirror, what you ate for breakfast, etc.) you might need

install.packages("loo", type = "source")

to get the new version. For bug reports, installation trouble, suggestions, etc., please use our GitHub issues page. The Stan users google group is also a fine place to ask questions about using the package with your models.

Finally, while we do recommend Stan of course, the R package isn’t only for Stan models. If you can compute a pointwise log-likelihood matrix then you can use the package.

The post New papers on LOO/WAIC and Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Psych dept: “We are especially interested in candidates whose research program contributes to the development of new quantitative methods” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Department of Psychology at the University of Michigan, Ann Arbor, invites applications for a tenure-track faculty position. The expected start date is September 1, 2016. The primary criterion for appointment is excellence in research and teaching. We are especially interested in candidates whose research program contributes to the development of new quantitative methods.

Although the specific research area is open, we are especially interested in applicants for whom quantitative theoretical modeling, which could include computational models, analytic models, statistical models or psychometric models, is an essential part of their research program. The successful candidate will participate in the teaching rotation of graduate-level statistics and methods. Quantitative psychologists from all areas of psychology and related disciplines are encouraged to apply. This is a university-year appointment.

Successful candidates must have a Ph.D. in a relevant discipline (e.g. Bio-statistics, Psychology) by the time the position starts, and a commitment to undergraduate and graduate teaching. New faculty hired at the Assistant Professor level will be expected to establish an independent research program. Please submit a letter of intent, curriculum vitae, a statement of current and future research plans, a statement of teaching philosophy and experience, and evidence of teaching excellence (if any).

Applicants should also request at least three letters of recommendation from referees. All materials should be uploaded by September 30, 2015 as a single PDF attachment to https://psychology-lsa.applicantstack.com/x/apply/a2s9hqlu3cgv. For inquiries about the positions please contact Richard Gonzalez. gonzo@umich.edu.

The University of Michigan is an equal opportunity/affirmative action employer. Qualified women and minority candidates are encouraged to apply. The University is supportive of the needs of dual-career couples.

The post Psych dept: “We are especially interested in candidates whose research program contributes to the development of new quantitative methods” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Prior *information*, not prior *belief* appeared first on Statistical Modeling, Causal Inference, and Social Science.

Consider this sort of question that a classically-trained statistician asked me the other day:

If two Bayesians are given the same data, they will come to two conclusions. What do you think about that? Does it bother you?

My response is that the statistician has nothing to do with it. I’d prefer to say that if two different analyses are done using different information, they will come to different conclusions. This different information can come in the prior distribution p(theta), it could come in the data model p(y|theta), it could come in the choice of how to set up the model and what data to include in the first place. I’ve listed these in roughly increasing order of importance.

Sure, we could refer to all statistical models as “beliefs”: we have a belief that certain measurements are statistically independent with a common mean, we have a belief that a response function is additive and linear, we have a belief that our measurements are unbiased, etc. Fine. But I don’t think this adds anything beyond just calling this a “model.” Indeed, referring to “belief” can be misleading. When I fit a regression model, I don’t typically believe in additivity or linearity at all, I’m just fitting a model, using available information and making assumptions, compromising the goal of including all available information because of the practical difficulties of fitting and understanding a huge model.

Same with the prior distribution. When putting together any part of a statistical model, we use some information without wanting to claim that this represents our beliefs about the world.

The post Prior *information*, not prior *belief* appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Awesomest media request of the year appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>RE: Donald Trump presidential candidacy

Hi,

Firstly, apologies for the group email but I wasn’t sure who would be best prized to answer this query as we’ve not had much luck so far.

I am a Dubai-based reporter for **.

Donald Trump recently announced his intension to run for the US presidency in 2016.

He currently has a lot of high profile commercial and business deals in Dubai and is actively in talks for more in the wider region.We have been trying to determine:

If a candidate succeeds in winning a nomination and goes on to win the election and reside in the White House do they have to give up their business interests as these would be seen as a conflict of interest? Can a US president serve in office and still have massive commercial business interests abroad?Basically, would Trump have to relinquish these relationships if he was successfully elected? Are there are existing rules specifically governing this? Is there any previous case studies to go on?

Lastly, what are his chances of winning a nomination or being elected? So far, from what we have read it seems highly unlikely?

Regards,

***

Executive Editor

***

The post Awesomest media request of the year appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Survey weighting and regression modeling appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We start by distinguishing two purposes of estimation: to estimate population descriptive statistics and to estimate causal effects. In the former type of research, weighting is called for when it is needed to make the analysis sample representative of the target population. In the latter type, the weighting issue is more nuanced. We discuss three distinct potential motives for weighting when estimating causal effects: (1) to achieve precise estimates by correcting for heteroskedasticity, (2) to achieve consistent estimates by correcting for endogenous sampling, and (3) to identify average partial effects in the presence of unmodeled heterogeneity of effects.

These is indeed an important and difficult topic and I’m glad to see economists becoming aware of it. I do not quite agree with their focus—in practice, heteroskedasticity never seems like much of a bit deal to me, nor do I care much about so-called consistency of estimates—but there are many ways to Rome, and the first step is to move beyond a naive view of weighting as some sort of magic solution.

Solon et al. pretty much only refer to literature within the field of economics, which is too bad because they miss this twenty-year-old paper by Chris Winship and Larry Radbill, “Sampling Weights and Regression Analysis,” from Sociological Methods and Research, which begins:

Most major population surveys used by social scientists are based on complex sampling designs where sampling units have different probabilities of being selected. Although sampling weights must generally be used to derive unbiased estimates of univariate population characteristics, the decision about their use in regression analysis is more complicated. Where sampling weights are solely a function of independent variables included in the model, unweighted OLS estimates are preferred because they are unbiased, consistent, and have smaller standard errors than weighted OLS estimates. Where sampling weights are a function of the dependent variable (and thus of the error term), we recommend first attempting to respecify the model so that they are solely a function of the independent variables. If this can be accomplished, then unweighted OLS is again preferred. . . .

This topic also has close connections with multilevel regression and poststratification, as discussed in my 2007 article, “Struggles with survey weighting and regression modeling,” which is (somewhat) famous for its opening:

Survey weighting is a mess. It is not always clear how to use weights in estimating anything more complicated than a simple mean or ratios, and standard errors are tricky even with simple weighted means.

See also our response to the discusssions.

I was unaware of Winship and Radbill’s work when writing my paper, so I accept blame for insularity as well.

In any case, it’s good to see broader interest in this important unsolved problem.

The post Survey weighting and regression modeling appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Don’t do the Wilcoxon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Wilcoxon test is a nonparametric rank-based test for comparing two groups. It’s a cool idea because, if data are continuous and there is no possibility of a tie, the reference distribution depends only on the sample size. There are no nuisance parameters, and the distribution can be tabulated. From a Bayesian point of view, however, this is no big deal, and I prefer to think of Wilcoxon as a procedure that throws away information (by reducing the data to ranks) to gain robustness.

Fine. But if you’re gonna do that, I’d recommend instead the following approach:

1. As in classical Wilcoxon, replace the data by their ranks: 1, 2, . . . N.

2. Translate these ranks into z-scores using the inverse-normal cdf applied to the values 1/(2*N), 3/(2*N), . . . (2*N – 1)/(2*N).

3. Fit a normal model.

In simple examples this should work just about the same as Wilcoxon as it is based on the same general principle, which is to discard the numerical information in the data and just keep the ranks. The advantage of this new approach is that, by using the normal distribution, it allows you to plug in all the standard methods that you’re familiar with: regression, analysis of variance, multilevel models, measurement-error models, and so on.

The trouble with Wilcoxon is that it’s a bit of a dead end: if you want to do anything more complicated than a simple comparison of two groups, you have to come up with new procedures and work out new reference distributions. With the transform-to-normal approach you can do pretty much anything you want.

The question arises: if my simple recommended approach indeed dominates Wilcoxon, how is it that Wilcoxon remains popular? I think much has to do with computation: the inverse-normal transformation is now trivial, but in the old days it would’ve added a lot of work to what, after all, is intended to be rapid and approximate.

**Take-home message**

I am not saying that the rank-then-inverse-normal-transform strategy is always or even often a good idea. What I’m saying is that, *if* you were planning to do a rank transformation before analyzing your data, I recommend this z-score approach rather than the classical Wilcoxon method.

The post Don’t do the Wilcoxon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Survey weighting and regression modeling

**Wed:** Prior *information*, not prior *belief*

**Thurs:** Draw your own graph!

**Fri:** Measurement is part of design

**Sat:** Annals of Spam

**Sun:** “17 Baby Names You Didn’t Know Were Totally Made Up”

The post “Physical Models of Living Systems” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’d like to alert you that my new textbook, “Physical Models of Living Systems,” has just been published. Among other things, this book is my attempt to bring Bayesian inference to undergraduates in any science or engineering major, and the course I teach from it has been enthusiastically received.

The book is intended for intermediate-level undergraduates. The only prerequisite for the course is first-year physics or something similar. Advanced appendices to each chapter make the book useful also for PhD students. There is almost no overlap with my prior book Biological Physics.

Rather than attempting an encyclopedic survey of biophysics, my aim has been to develop skills and frameworks that are essential to the practice of almost any science or engineering field, in the context of some life-science case studies.

I have quantitative and qualitative data on Penn students’ assessment of the usefulness of the class in their later work. This appears in the Instructor section of the above Web site.

Many of my students come to the course with no computer background, so I have also written the short booklet Student’s Guide to Physical Modeling with Matlab (with Tom Dodson), which is available free via the above web site. A parallel book, Student’s Guide to Physical Modeling with Python (with Jesse M. Kinder) will also be available soon. These resources are not specifically about life science applications.

This sounds great. And let me again recommend the classic How Animals Work, by Knut Schmidt-Nielsen.

**P.S.** We last encountered Nelson a couple years ago when answering his question, “What are some situations in which the classical approach (or a naive implementation of it, based on cookbook recipes) gives worse results than a Bayesian approach, results that actually impeded the science?” The question was surprisingly easy to answer. You might also want to check out the comment section there, because some of the commenters had some misconceptions that I tried to clarify.

The post “Physical Models of Living Systems” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Inauthentic leadership? Development and validation of methods-based criticism appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I need some help with a critique of a paper that is part of the apparently growing retraction scandal in leadership studies. Here’s Retraction Watch.

The paper I want to look at is here: “Authentic Leadership: Development and Validation of a Theory-Based Measure” By F. O. Walumbwa, B. J. Avolio, W. L. Gardner, T. S. Wernsing, S. J. Peterson Journal of Management 2007.

I have a lot of issues with this paper that are right on the surface, and the one thing (again on the surface) that seems to justify its existence is the possibility that the quantitative stuff is, well, true. But that’s exactly what’s been questioned. And in pretty harsh terms. The critics are saying its defects are entirely to obvious to anyone who knows anything about structural equation modeling. The implication is that in leadership studies no one—not the authors, not the reviewers, not the editors, not the readers—actually understands SEM. It’s just a way of presenting their ideas about leadership as science—“research”, “evidence-based”, etc.

Hey, I can relate to that: I don’t understand structural equation modeling either! I keep meaning to sit down and figure it all out sometime. Meanwhile, it remains popular in much of social science, and my go-to way of understanding anything framed as a structural equation model is to think about it some other way.

For example, there was the recent discussion of the claim that subliminal smiley-faces have been effects on political attitudes. It turns out there was no strong evidence for such a claim, but there was some indirect argument based on structural equation models.

Anyway, Basbøll continues:

I’m way out of my depth on the technical issues, however. There’s some discussion of the statistical issues with the paper here.

There’s also a question about data fabrication, but I want to leave that on the side. I’m hoping there’s someone among your readers who might have some pretty quick way to see if the critics are right that the structural equation modeling they use is bogus.

The paper has been widely cited, and has won an award—for being the most cited paper in 2013.

The editors are not saying very much about the criticism.

Hey, I published a paper recently in Journal of Management. So maybe I could ask my editors there what they think of all this.

Basbøll continues

In addition to doing a close reading of the argument (which is weird to me, like I say), I also want to track down all the people that have been citing it, to see whether the statistical analysis actually matters. I suspect it’s just taken as “reviewed therefore true”. If the critics are right, that would make the use of statistics here a great example of cargo-cult science, completely detached from reality.

You’ve talked about measurement recently, so I should say that I don’t think the thing they’re trying to measure can be measured, and is best talked about in other ways, but if their analysis itself is bad, however they got the data (whether by mis-measurement or by outright fabrication), then that point is somewhat moot.

What do all of you think? I’m prepared to agree with Basbøll on this because I agree with him on other things.

This sort of reasoning is sometimes called a “prior,” but I’d prefer to think of it as a model in which the quality of the article is an unknown predictive quantity and “Basbøll doesn’t like it” is the value of a predictor.

In any case, I have neither the energy nor the interest to actually read the damn article. But if any of you have any thoughts on it, feel free to chime in.

The post Inauthentic leadership? Development and validation of methods-based criticism appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Economists betting on replication appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A bunch of folks are collaborating on a project to replicate 18 experimental studies published in prominent Econ journals (mostly American Economic Review, a few Quarterly Journal of Economics). This is already pretty exciting, but the really cool bit is they’re opening a market (with real money) to predict which studies will replicate. Unfortunately participation is restricted, but the market activity will be public. The market opens tomorrow, so it should be pretty exciting to watch.

There was some discussion about doing this with psychology papers, but the sense was that some people were so upset with the replication movement already that there would be backlash against the whole betting thing. I’m curious how the econ project goes.

The post Economists betting on replication appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey—guess what? There really is a hot hand! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>No, it’s not April 1, and yup, I’m serious. Josh Miller came into my office yesterday and convinced me that the hot hand is real.

Here’s the background. Last year we posted a discussion on streakiness in basketball shooting. Miller has a new paper out, with Adam Sanjurjo, which begins:

We find a subtle but substantial bias in a standard measure of the conditional dependence of present outcomes on streaks of past outcomes in sequential data. The mechanism is driven by a form of selection bias, which leads to an underestimate of the true conditional probability of a given outcome when conditioning on prior outcomes of the same kind. The biased measure has been used prominently in the literature that investigates incorrect beliefs in sequential decision making — most notably the Gambler’s Fallacy and the Hot Hand Fallacy. Upon correcting for the bias, the conclusions of some prominent studies in the literature are reversed. The bias also provides a structural explanation of why the belief in the law of small numbers persists, as repeated experience with finite sequences can only reinforce these beliefs, on average.

What’s this bias they’re talking about?

Jack takes a coin from his pocket and decides that he will flip it 4 times in a row, writing down the outcome of each flip on a scrap of paper. After he is done flipping, he will look at the flips that immediately followed an outcome of heads, and compute the relative frequency of heads on those flips. Because the coin is fair, Jack of course expects this conditional relative frequency to be equal to the probability of flipping a heads: 0.5. Shockingly, Jack is wrong. If he were to sample 1 million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4.

Really?? Let’s try it in R:

rep <- 1e6 n <- 4 data <- array(sample(c(0,1), rep*n, replace=TRUE), c(rep,n)) prob <- rep(NA, rep) for (i in 1:rep){ heads1 <- data[i,1:(n-1)]==1 heads2 <- data[i,2:n]==1 prob[i] <- sum(heads1 & heads2)/sum(heads1) }

OK, I've simulated, for each player, the conditional probability that he gets heads, given that he got heads on the previous flip.

What's the mean of these?

> print(mean(prob)) [1] NaN

Oh yeah, that's right: sometimes the first three flips are tails, so the probability is 0/0. So we'll toss these out. Then what do we get?

> print(mean(prob, na.rm=TRUE)) [1] 0.41

Hey! That's not 50%! Indeed, if you get this sort of data, it will look like people are anti-streaky (heads more likely to be followed by tails, and vice-versa), even though they're not.

With sequences of length 10, the average streakiness statistic (that is, for each person you compute the ~~conditional probability~~ proportion that he gets heads, conditional on him having just got heads on the previous flip, and then you average this across people), is .445. This is pretty far from .5, given that previous estimates of streak-shooting probability have been in the range of 2 percentage points.

And the bias is larger for comparisons such as the ~~probability~~ proportion of heads, conditional on following three straight heads, compared to the overall probability of heads. Which is one measure of streakiness, if "heads" is replaced by success of a basketball shot.

So here's the deal. The classic 1985 paper by Gilovich, Vallone, and Tversky and various followups used these frequency comparisons, and as a result they all systematically underestimated streakiness, reporting no hot hand when, actually, when the data are analyzed correctly, the evidence is there, as Miller and Sanjurjo report in the above-linked paper and also in another recent article which uses the example of the NBA three-point shooting contest.

Next step: fitting a model in Stan to estimate individual players' streakiness.

This is big news. Just to calibrate, here's what I wrote on the topic last year:

Consider the continuing controversy regarding the “hot hand” in basketball. Ever since the celebrated study of Gilovich, Vallone, and Tversky (1985) found no evidence of serial correlation in the successive shots of college and professional basketball players, people have been combing sports statistics to discover in what settings, if any, the hot hand might appear. Yaari (2012) points to some studies that have found time dependence in basketball, baseball, volleyball, and bowling, and this is sometimes presented as a debate: Does the hot hand exist or not?

A better framing is to start from the position that the effects are certainly not zero. Athletes are not machines, and anything that can affect their expectations (for example, success in previous tries) should affect their performance—one way or another. To put it another way, there is little debate that a “cold hand” can exist: It is no surprise that a player will be less successful if he or she is sick, or injured, or playing against excellent defense. Occasional periods of poor performance will manifest themselves as a small positive time correlation when data are aggregated.

However, the effects that have been seen are small, on the order of 2 percentage points (for example, the probability of a success in some sports task might be 45% if a player is “hot” and 43% otherwise). These small average differences exist amid a huge amount of variation, not just among players but also across different scenarios for a particular player. Sometimes if you succeed, you will stay relaxed and focused; other times you can succeed and get overconfident.

I don't think I said anything *wrong* there, exactly, but Miller and Sanjurjo's bias correction makes a difference. For example, they estimate the probability of success in the 3-point shooting contest as 6 percentage points higher after three straight successes. In comparison, the raw (biased) estimate is 4 percentage points. The difference between 4% and 6% isn't huge, but the overall impact of all this analysis is to show clear evidence for the hot hand. It will be interesting to see what Stan finds regarding the variation.

The post Hey—guess what? There really is a hot hand! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Evaluating the Millennium Villages Project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Evaluating the Millennium Villages Project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An Excel add-in for regression analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I know you are not particularly fond of Excel, but you might (I hope) be interested in a free Excel add-in for multivariate data analysis and linear regression that I am distributing here: http://regressit.com. I originally developed it for teaching an advanced MBA elective course on regression and time series analysis at Duke University, but it is intended for teaching data analysis at any level where students are familiar with Excel (and use it on PC’s), and it is also intended for serious applied work as a complement to other analytical software. It has been available to the public since May 2014, and a new version has just been released. If I do say so myself, its default regression output is more thoughtfully designed and includes higher quality graphics than what is provided by the best-known statistical programming languages or by commercial Excel add-ins such as Analyse-it, XLstat, or StatTools. It also has a number of unique features that are designed to facilitate data exploration and model testing and to support a disciplined and well-documented approach to analysis, with an emphasis on data visualization. My frustration with the stone-age graphics output of the leading regression software was the original motivation for its development, and I am now offering it for free as a public service. Please take it for a test drive and see for yourself. I’d welcome your feedback.

I don’t know Excel at all so I can’t take it for a test drive . . . but I bet that some of you can! Please share your thoughts. So many people use Excel that an improvement here could have a huge effect on good statistical practice. I don’t know if Reinhart and Rogoff read this blog but there must be some Excel users in the audience, right?

P.S. Nau wanted to share some further thoughts:

It may appear at first glance as though there is little that is new here: just another program that performs descriptive data analysis and plain old linear regression. The difference is in the details, and the details are many. Every design element in RegressIt has been chosen with a view toward helping the user to work efficiently and competently, to interactively share the results of the analysis with others, to enjoy the process, and to leave behind a clear trail of breadcrumbs. In this respect, RegressIt is a sort of “concept car” that illustrates features which would be nice to have in other analytical procedures besides regression if the software was designed from the ground up with the user in mind and did not carry a burden of backward compatibility with the way it looked a decade or two ago. Also, it tries to take advantage of things that Excel is good for while compensating for its lack of discipline. The design choices are based on my own experience in 30+ years of teaching as well as playing around with data for my own purposes. When a student or colleague or someone on the other side of the internet wants to discuss the results of an analysis that he or she has performed, which might or might not be for a problem whose solution I already know, I want to be able, with a few mouse clicks, to replicate their analysis and drill deeper or perform variations on it, and compare new results side-by-side with old ones, while having an armchair conversation. I might also want to do this on the spur of the moment in front of a class without worrying about my typing. When I am looking at at one among many tables or charts, I often wonder: what model produced this, and what were the variables, what was the sample, when did the analysis take place, and by whom? What other models were tried before or afterward, and what was good or bad about this one? If a chart is just labeled “Residuals of Y” or “Residuals vs. Fitted Values”, that is not very helpful, particularly if it has been copied and pasted into a report where it takes on a life of its own. And when I look at the output of a model on the computer screen, I want to see as much of it at one time as possible. I want an efficient screen design—ideally one that would look good in an auditorium as well as on my desktop—and I want easy navigation within and across models. I would rather not scroll up and down through a linear log file that reminds me of line-printer days (which I do remember!) and makes it hard to distinguish the code from the results. I would like to see a presentation that by default is fairly complete in terms of including some well-chosen chart output that allows me to engage my visual cortex without saying “yuck”. And I want the same things if the original analyst is not a student or colleague but merely myself yesterday or last week or last year.

I hope you will give it a close look, kick the tires, and take it for a drive with some data of your own. And please read everything that is on the features and advice pages on the web site. Otherwise you may overlook some of what RegressIt is doing that is novel. And whatever you may think of it in the end, I would welcome your input on improvements or extensions that could be made. Is there any low-hanging fruit could easily be added, or is there some deal-breaking omission that absolutely needs to be fixed? We can make changes in a hurry if we have to–there is no calendar of scheduled releases. We are two professors who work on this in our spare time. RegressIt’s feature set is limited at present, but our hope is that the features it does include will be useful in some circumstances to people who do most of their work in R or Stata and well as to people who do most of their work in Excel, and we plan to add more to it in the future. Thanks in advance for your input!

The post An Excel add-in for regression analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Short course on Bayesian data analysis and Stan 19-21 July in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bob Carpenter, Daniel Lee, and I are giving a 3-day short course in two weeks.

Before class everyone should install R, RStudio and RStan on their computers. If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Sunday, July 19: Introduction to Bayes and Stan

Morning:

Intro to Bayes

Intro to Stan

The statistical crisis in science

Afternoon:

Stan by example

Components of a Stan program

Little data: how traditional statistical ideas remain relevant in a big data world

Monday, July 20: Computation, Monte Carlo and Applied Modeling

Morning:

Computation with Monte Carlo Methods

Debugging in Stan

Generalizing from sample to population

Afternoon:

Multilevel regression and generalized linear models

Computation and Inference in Stan

Why we don’t (usually) have to worry about multiple comparisons

Tuesday, July 21: Advanced Stan and Big Data

Morning:

Vectors, matrices, and transformations

Mixture models and complex data structures in Stan

Hierarchical modeling and prior information

Afternoon:

Bayesian computation for big data

Advanced Stan programming

Open problems in Bayesian data analysis

Specific topics on Bayesian inference and computation include, but are not limited to:

Bayesian inference and prediction

Naive Bayes, supervised, and unsupervised classification

Overview of Monte Carlo methods

Convergence and effective sample size

Hamiltonian Monte Carlo and the no-U-turn sampler

Continuous and discrete-data regression models

Mixture models

Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:

Reproducible research

Probabilistic programming

Stan syntax and programming

Optimization

Warmup, adaptation, and convergence

Identifiability and problematic posteriors

Handling missing data

Ragged and sparse data structures

Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The post Short course on Bayesian data analysis and Stan 19-21 July in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discreteland and Continuousland appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discreteland and Continuousland appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** “There are many studies showing . . .”

**Wed:** An Excel add-in for regression analysis

**Thurs:** Unreplicable

**Fri:** Economists betting on replication

**Sat:** Inauthentic leadership? Development and validation of methods-based criticism

**Sun:** “Physical Models of Living Systems”

The post “Menstrual Cycle Phase Does Not Predict Political Conservatism” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Someone pointed me to this article by Isabel Scott and Nicholas Pound:

Recent authors have reported a relationship between women’s fertility status, as indexed by menstrual cycle phase, and conservatism in moral, social and political values. We conducted a survey to test for the existence of a relationship between menstrual cycle day and conservatism.

2213 women reporting regular menstrual cycles provided data about their political views. Of these women, 2208 provided information about their cycle date . . . We also recorded relationship status, which has been reported to interact with menstrual cycle phase in determining political preferences.

We found no evidence of a relationship between estimated cyclical fertility changes and conservatism, and no evidence of an interaction between relationship status and cyclical fertility in determining political attitudes. . . .

I have no problem with the authors’ substantive findings. And they get an extra bonus for not labeling day 6 as high conception risk:

Seeing this clearly-sourced graph makes me annoyed one more time at those psychology researchers who refused to acknowledge that, in a paper all about peak fertility, they’d used the wrong dates for peak fertility. So, good on Scott and Pound for getting this one right.

There’s one thing that does bother me about their paper, though, and that’s how they characterize the relation of their study to earlier work such as the notorious paper by Durante et al.

Scott and Pound write:

Our results are therefore difficult to reconcile with those of Durante et al, particularly since we attempted the analyses using a range of approaches and exclusion criteria, including tests similar to those used by Durante et al, and our results were similar under all of them.

Huh? Why “difficult to reconcile”? The reconciliation seems obvious to me: There’s no evidence of anything going on here. Durante et al. had a small noisy dataset and went all garden-of-forking-paths on it. And they found a statistically significant comparison in one of their interactions. No news here.

Scott and Pound continue:

Lack of statistical power does not seem a likely explanation for the discrepancy between our results and those reported in Durante et al, since even after the most restrictive exclusion criteria were applied, we retained a sample large enough to detect a moderate effect . . .

Again, I feel like I’m missing something. “Lack of statistical power” is exactly what was going on with Durante et al., indeed their example was the “Jerry West” of our “power = .06″ graph:

Scott and Pound continue:

One factor that may partially explain the discrepancy is our different approaches to measuring conservatism and how the relevant questions were framed. . . . However, these methodological differences seem unlikely to fully explain the discrepancy between our results . . . One further possibility is that differences in responses to our survey and the other surveys discussed here are attributable to variation in the samples surveyed. . . .

Sure, but aren’t you ignoring the elephant in the room? Why is there any discrepancy to explain? Why not at least raise the possibility that those earlier publications were just examples of the much-documented human ability to read patterns in noise.

I suspect that Scott and Pound have considered this explanation but felt it would be politic not to explicitly suggest it in their paper.

**P.S.** The above graph is a rare example of a double-y-axis plot that isn’t so bad. But the left axis should have a lower bound at 0: it’s not possible for conception risk to be negative!

The post “Menstrual Cycle Phase Does Not Predict Political Conservatism” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post July 4th appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post July 4th appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

[The above quote is *not* by Senn; it’s a quote of something he disagrees with!]

Canonical and wrong. X and I discuss this problem in section 3 of our article on the history of anti-Bayesianism (see also rejoinder to discussion here). We write:

The big, big problem with the Pr(sunrise tomorrow | sunrise in the past) argument is not in the prior but in the likelihood, which assumes a constant probability and independent events. Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas? Why does stationarity apply to this series? That’s not frequentist, it is not Bayesian, it’s just dumb. Or, to put it more charitably, it is a plain vanilla default model that we should use only if we are ready to abandon it on the slightest pretext.

Strain at the gnat that is the prior and swallow the ungainly camel that is the iid likelihood. Senn’s discussion is good in that he ~~keeps his eye on the ball~~ knits his row straight without getting distracted by stray bits of yarn.

The post “Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Humility needed in decision-making appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Daniel Gilbert maintains that people generally make bad decisions on risk issues, and suggests that communication strategies and education programmes would help (Nature 474, 275–277; 2011). This version of the deficit model pervades policy-making and branches of the social sciences.

In this model, conflicts between expert and public perceptions of risk are put down to the difficulties that laypeople have in reasoning in the face of uncertainties rather than to deficits in knowledge per se.

Indeed, this is the “Nudge” story we hear a lot: the idea is that our well-known cognitive biases are messing us up, and policymakers should be accounting for this.

But MacGillivray and Pidgeon take a more Gigerenzian view:

There are three problems with this stance.

First, it relies on a selective reading of the literature. . . .

Second, it rests on some bold extrapolations. For example, it is not clear how the biases Gilbert identifies in the classic ‘trolley’ experiment play out in the real world. Many such reasoning ‘errors’ are mutually contradictory — for example, people have been accused of both excessive reliance on and neglect of generic ‘base-rate’ information to judge the probability of an event. This casts doubt on the idea that they reflect universal or hard-wired failings in cognition.

The third problem is the presentation of rational choice theory as the only way of deciding how to handle risk issues.

They conclude:

Given that many modern risk crises stem from science’s inability to foresee the dark side of technological progress, a little humility from the rationality project wouldn’t go amiss.

The post Humility needed in decision-making appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Recently in the sister blog appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>When is the death penalty okay?

How much does advertising matter in presidential elections?

Bartenders are Democrats, beer wholesalers are Republicans

The ambiguity of racial categories

No, public opinion is not driven by ‘unreasoning bias and emotion’

Political science: Who is it for?

Modern campaigning has big effects on voter turnout

Political writing that sounds good but makes no sense

How much are Harry Potter’s glasses worth?

The post Recently in the sister blog appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Where does Mister P draw the line? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mr. P is pretty impressive, but I’m not sure how far to push him in particular and MLM [multilevel modeling] in general.

Mr. P and MLM certainly seem to do well with problems such as eight schools, radon, or the Xbox survey. In those cases, one can make reasonable claims that the performance of the eight schools (or the houses or the interviewees, conditional on modeling) are in some sense related.

Then there are totally unrelated settings. Say you’re estimating the effect of silicone spray on enabling your car to get you to work: fixing a squeaky door hinge, covering a bad check you paid against the car loan, and fixing a bald tire. There’s only one case where I can imagine any sort of causal or even correlative connection, and I’d likely need persuading to even consider trying to model the relationship between silicone spray and keeping the car from being repossessed.

If those two cases ring true, where does one draw the line between them? For a specific example, see “New drugs and clinical trial design in advanced sarcoma: have we made any progress?” (inked from here). The discussion covers rare but somewhat related diseases, and the challenge is to do clinical studies with sufficient power from number of participants in aggregate and by disease subtype.

Do you know if people have successfully used MLM or Mr. P in such settings? I’ve done some searching and not found anything I recognized.

I suspect that the real issue is understanding potential causal mechanisms, but MLM and perhaps Mr. P. sound intriguing for such cases. I’m thinking of trying fake data to test the idea.

I have a few quick thoughts here:

– First, on the technical question about what happens if you try to fit a hierarchical model to unrelated topics: if the topics are *really* unrelated, there should be no reason to expect the true underlying parameter values to be similar, hence the group-level variance will be estimated to be huge, hence essentially no pooling. The example I sometimes give is: suppose you’re estimating 8 parameters: the effects of SAT coaching in 7 schools, and the speed of light. These will be so different that you’re just getting the unpooled estimate. The unpooled estimate is not the best—you’d rather pool the 7 schools together—but it’s the best you can do given your model and your available information.

– To continue this a bit, suppose you are estimating 8 parameters: the effects of a fancy SAT coaching program in 4 schools, and the effects of a crappy SAT coaching program in 4 other schools. Then what you’d want to do is partially pool each group of 4 or, essentially equivalently, to fit a multilevel regression at the school level with a predictor indicating the prior assessment of quality of the coaching program. Without that information, you’re in a tough situation.

– Now consider your silicone spray example. Here you’re estimating unrelated things so you won’t get anything useful from partial pooling. Bayesian inference can still be helpful here, though, in that you should be able to write down informative priors for all your effects of interest. In my books I was too quick to use noninformative priors.

The post Where does Mister P draw the line? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey, this is what Michael Lacour should’ve done when they asked him for his data appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>FOIA that, pal!

The post Hey, this is what Michael Lacour should’ve done when they asked him for his data appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A note from John Lott appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It’s been nearly 20 years since the last time there was a high-profile report of a social science survey that turned out to be undocumented. I’m referring to the case of John Lott, who said he did a survey on gun use in 1997, but, in the words of Wikipedia, “was unable to produce the data, or any records showing that the survey had been undertaken.” Lott, like LaCour nearly two decades later, mounted an aggressive, if not particularly convincing, defense.

Lott disputes what is written on the Wikipedia page. Here’s what he wrote to me, first on his background:

You probably don’t care, but your commentary is quite wrong about my career and the survey. Since most of the points that you raise are dealt with in the post below, I will just mention that you have the trajectory of my career quite wrong. My politically incorrect work had basically ended my academic career in 2001. After having had positions at Wharton, University of Chicago, and Yale, I was unable to get an academic job in 2001 and spent 5 months being unemployed before ending up at a think tank AEI. If you want an example of what had happened you can see here. A similar story occurred at Yale where some US Senators complained about my research. My career actual improved after that, at least if you judge it by getting academic appointments. For a while universities didn’t want to touch someone who would get these types of complaints from high profile politicians. I later re-entered academia, though eventually I got tired of all the political correctness and left academia.

Regarding the disputed survey, Lott points here and writes:

Your article gives no indication that the survey was replicated nor do you explain why the tax records and those who participated in the survey were not of value to you. Your comparison to Michael LaCour is also quite disingenuous. Compare our academic work. As I understand it, LaCour’s data went to the heart of his claim. In my case we are talking about one paragraph in my book and the survey data was biased against the claim that I was making (see the link above).

I have to admit I never know what to make of it when someone describes me as “disingenuous,” which according to the dictionary, means “not candid or sincere, typically by pretending that one knows less about something than one really does.” I feel like responding, truly, that I *was* being candid and sincere! But of course once someone accuses you of being insincere, it won’t work to respond in that way. So I can’t really do anything with that one.

Anyway, Lott followed up with some specific responses to the Wikipedia entry:

The Wikipedia statement . . . is completely false (“was unable to produce the data, or any records showing that the survey had been undertaken”). You can contact tax law Professor Joe Olson who went through my tax records. There were also people who have come forward to state that they took the survey.

A number of academics and others have tried to correct the false claims on Wikipedia but they have continually been prevented from doing so, even on obviously false statements. Here are some posts that a computer science professor put up about his experience trying to correct the record at Wikipedia.

http://doubletap.cs.umd.edu/WikipediaStudy/namecalling.htm

http://doubletap.cs.umd.edu/WikipediaStudy/details.htm

http://doubletap.cs.umd.edu/WikipediaStudy/lambert.htm

http://doubletap.cs.umd.edu/WikipediaStudy/I hope that you will correct the obviously false claim that I “was unable to produce the data, or any records showing that the survey had been undertaken.” Now possibly the people who wrote the Wikipedia post want to dismiss my tax records or the statements by those who say that they took the survey, but that is very different than them saying that I was unable to produce “any records.” As to the data, before the ruckus erupted over the data, I had already redone the survey and gotten similar results. There are statements from 10 academics who had contemporaneous knowledge of my hard disk crash where I lost the data for that and all my other projects and from academics who worked with me to replace the various data sets that were lost.

I don’t really have anything to add here. With LaCour there was a pile of raw data and also a collaborator, Don Green, who recommended to the journal that their joint paper be withdrawn. The Lott case happened two decades ago, there’s no data file and no collaborator, so any evidence is indirect. In any case, I thought it only fair to share Lott’s words on the topic.

The post A note from John Lott appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Introducing StataStan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Thanks to Robert Grant, we now have a Stata interface! For more details, see:

- Robert Grant’s Blog:
**Introducing StataStan**

Jonah and Ben have already kicked the tires, and it works. We’ll be working on it more as time goes on as part of our Institute of Education Sciences grant (turns out education researchers use a lot of Stata).

We welcome feedback, either on the Stan users list or on Robert’s blog post. Please don’t leave comments about StataStan here — I don’t want to either close comments for this post or hijack Robert’s traffic.

Thanks, Robert!

P.S. Yes, we know that Stata released its own Bayesian analysis package, which even provides a way to program your own Bayesian models. Their language doesn’t look very flexible, and the MCMC sampler is based on Metropolis and Gibbs, so we’re not too worried about the competition on hard problems.

The post Introducing StataStan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post God is in every leaf of every probability puzzle appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A couple you’ve just met invite you over to dinner, saying “come by around 5pm, and we can talk for a while before our three kids come home from school at 6pm”.

You arrive at the appointed time, and are invited into the house. Walking down the hall, your host points to three closed doors and says, “those are the kids’ bedrooms”. You stumble a bit when passing one of these doors, and accidently push the door open. There you see a dresser with a jewelry box, and a bed on which a dress has been laid out. “Ah”, you think to yourself, “I see that at least one of their three kids is a girl”.

Your hosts sit you down in the kitchen, and leave you there while they go off to get goodies from the stores in the basement. While they’re away, you notice a letter from the principal of the local school tacked up on the refrigerator. “Dear Parent”, it begins, “Each year at this time, I write to all parents, such as yourself, who have a boy or boys in the school, asking you to volunteer your time to help the boys’ hockey team…” “Umm”, you think, “I see that they have at least one boy as well”.

That, of course, leaves only two possibilities: Either they have two boys and one girl, or two girls and one boy. What are the probabilities of these two possibilities?

NOTE: This isn’t a trick puzzle. You should assume all things that it seems you’re meant to assume, and not assume things that you aren’t told to assume. If things can easily be imagined in either of two ways, you should assume that they are equally likely. For example, you may be able to imagine a reason that a family with two boys and a girl would be more likely to have invited you to dinner than one with two girls and a boy. If so, this would affect the probabilities of the two possibilities. But if your imagination is that good, you can probably imagine the opposite as well. You should assume that any such extra information not mentioned in the story is not available.

As a commenter pointed out, there’s something weird about how the puzzle is written, not just the charmingly retro sex roles but also various irrelevant details such as the time of the dinner. (Although I can see why Radford wrote it that way, as it was a way to reveal the number of kids in a natural context.)

The solution at first seems pretty obvious: As Radford says, the two possibilities are:

(a) 2 boys and 1 girl, or

(b) 1 boy and 2 girls.

If it’s possibility (a), the probability of the random bedroom being a girl’s is 1/3, and the probability of getting that note (“I write to all parents . . . who have a boy or boys at the school”) is 1, so the probability of the data is 1/3.

If it’s possibility (b), the probability of the random bedroom being a girl’s is 2/3, and the probability of getting the school note is still 1, so the probability of the data is 2/3.

The likelihood ratio is thus 2:1 in favor of possibility (b).

Case closed . . . but is it?

Two complications arise. First, as commenter J. Cross pointed out, if the kids go to multiple schools, it’s not clear what is the probability of getting that note, but a first guess would be that the probability of you seeing such a note on the fridge is proportional to the number of boys in the family. Actually, even if there’s only one school the kids go to, it might be more likely to see the note prominently on the fridge if there are 2 boys: presumably, the probability that at least one boy is interested in hockey is an higher if there are two boys than if there’s only one.

The other complication is the prior odds. Pr(boy birth) is about .512, so the prior odds, which are .512/.488 in favor of the 2 boys and 1 girl, rather than 2 girls and 1 boy.

This is just to demonstrate that, as Feynman could’ve said in one of his mellower moments, God is in every leaf of every tree: Just about every problem is worth looking at carefully. It’s the fractal nature of reality.

The post God is in every leaf of every probability puzzle appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Where does Mister P draw the line?

**Wed:** Recently in the sister blog

**Thurs:** Humility needed in decision-making

**Fri:** “Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas?”

**Sat:** July 4th

**Sun:** “Menstrual Cycle Phase Does Not Predict Political Conservatism”

The post What’s So Fun About Fake Data? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What’s So Fun About Fake Data? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Interpreting posterior probabilities in the context of weakly informative priors appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m an ecologist, and I typically work with small sample sizes from field experiments, which have highly variable data. I analyze almost all of my data now using hierarchical models, but I’ve been wondering about my interpretation of the posterior distributions. I’ve read your blog, several of your papers (Gelman and Weakliem, Gelman and Carlin), and your excellent BDA book, and I was wondering if I could ask your advice/opinion on my interpretation of posterior probabilities.

I’ve thought of 95% posterior credible intervals as a good way to estimate effect size, but I still see many researchers use them in something akin to null hypothesis testing: “The 95% interval included zero, and therefore the pattern was not significant”. I tend not to do that. Since I work with small sample sizes and variable data, it seems as though I’m unlikely to find a “significant effect” unless I’m vastly overestimating the true effect size (Type M error) or unless the true effect size is enormous (a rarity). More often than not, I find ‘suggestive’, but not ‘significant’ effects.

In such cases, I calculate one-tailed posterior probabilities that the effect is positive (or negative) and report that along with estimates of the effect size. For example, I might say something like

“Foliar damage tended to be slightly higher in ‘Ambient’ treatments, although the difference between treatments was small and variable (Pr(Ambient>Warmed) = 0.86, CI95 = 2.3% less – 6.9% more damage).”

By giving the probability of an effect as well as an estimate of the effect size, I find this to be more informative than simply saying ‘not significant’. This allows researchers to make their own judgements on importance, rather than defining importance for them by p < 0.05. I know that such one-tailed probabilities can be inaccurate when using flat priors, but I place weakly informative priors ( N(0,1) or N(0,2) ) on all parameters in an attempt to avoid such overestimates unless strongly supported by my small sample sizes. I was wondering if you agree with this philosophy of data reporting and interpretation, or if I’m misusing the posterior probabilities. I’ve done some research on this, but I can’t find anyone that’s offered a solid opinion on this. Based on my reading and the few interactions I’ve had with others, it seems that the strength of posterior probabilities compared to p-values is that they allow for such fluid interpretation (what’s the probability the effect is positive? what’s the probability the effect > 5? etc.), whereas p-values simply tell you “if the null hypothesis is true, theres a 70 or 80% chance I could observe an effect as strong as mine by chance alone”. I prefer to give the probability of an effect bounded by the CI of the effect to give the most transparent interpretation possible.

My reply:

My short answer is that this is addressed in this post:

*If* you believe your prior, then yes, it makes sense to report posterior probabilities as you do. Typically, though, we use flat priors even though we have pretty strong knowledge that parameters are close to 0 (this is consistent with the fact that we see lots of estimates that are 1 or 2 se’s from 0, but very few that are 4 or 6 se’s from 0). So, really, if you want to make such a statement I think you’d want a more informative prior that shrinks to 0. If, for whatever reason, you *don’t* want to assign such a prior, then you have to be a bit more careful about interpreting those posterior probabilities.

In your case, you’re using weakly-informative priors such as N(0,1), this is less of a concern. Ultimately I guess the way to go is to embed any problem in a hierarchical meta-analysis so that the prior makes sense in the context of the problem. But, yeah, I’ve been using N(0,1) a lot myself lately.

The post Interpreting posterior probabilities in the context of weakly informative priors appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>