There’s an idea in philosophy called the Australia principle—I don’t know the original of this theory but here’s an example that turned up in a google search—that posits that Australia doesn’t exist; instead, they just build the parts that are needed when you visit: a little mock-up of the airport, a cityscape with a model […]

The post Prior distributions and the Australia principle appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Prior distributions and the Australia principle appeared first on All About Statistics.

]]>There’s an idea in philosophy called the Australia principle—I don’t know the original of this theory but here’s an example that turned up in a google search—that posits that Australia doesn’t exist; instead, they just build the parts that are needed when you visit: a little mock-up of the airport, a cityscape with a model of the Sydney Opera House in the background, some kangaroos, a bunch of desert in case you go into the outback, etc. The idea is that it would be ridiculously inefficient to build an entire continent and that it makes much more sense for them to just construct a sort of stage set for the few places you’ll ever go.

And this is the principle underlying the article, The prior can often only be understood in the context of the likelihood, by Dan Simpson, Mike Betancourt, and myself. The idea is that, for any given problem, for places in parameter space where the likelihood is strong, relative to the questions you’re asking, you won’t need to worry much about the prior; something vague will do. And in places where the likelihood is weak, relative to the questions you’re asking, you’ll need to construct more of a prior to make up the difference.

This implies:

1. The prior can often only be understood in the context of the likelihood.

2. What prior is needed can depend on the question being asked.

To follow up on item 2, consider a survey of 3000 people, each of whom is asked a binary survey response, and suppose this survey is a simple random sample of the general population. If this is a public opinion poll, N = 3000 is more than enough: the standard error of the sample proportion is something like 0.5/sqrt(3000) = 0.01; you can estimate a proportion to an accuracy of about 1 percentage point, which is fine for all practical purposes, especially considering that, realistically, nonsampling error will be likely be more than that anyway. On the other hand, if the question on this survey of 3000 people is whether your baby is a boy or a girl, and if the goal is to compare sex ratios of beautiful and ugly parents, then N = 3000 is way way too small to tell you anything (see, for example, the discussion on page 645 here), and if you want any kind of reasonable posterior distribution for the difference in sex ratios you’ll need a strong prior. You need to supply the relevant scenery yourself, as it’s not coming from the likelihood.

The same principle—that the prior you need depends on the other information you have and the question you’re asking—also applies to assumptions within the data model (which in turn determines the likelihood). But for simplicity here we’re following the usual convention and pretending that the likelihood is known exactly ahead of time so that all the modeling choices arise in the prior.

**P.S.** The funny thing is, Dan Simpson is from Australia himself. Just a coincidence, I’m sure.

The post Prior distributions and the Australia principle appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Prior distributions and the Australia principle appeared first on All About Statistics.

]]>This came up in comments recently so I thought I’d clarify the point. Mister P is MRP, multilevel regression and poststratification. The idea goes like this: 1. You want to adjust for differences between sample and population. Let y be your outcome of interest and X be your demographic and geographic variables you’d like to […]

The post Regularized Prediction and Poststratification (the generalization of Mister P) appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Regularized Prediction and Poststratification (the generalization of Mister P) appeared first on All About Statistics.

]]>This came up in comments recently so I thought I’d clarify the point.

Mister P is MRP, multilevel regression and poststratification. The idea goes like this:

1. You want to adjust for differences between sample and population. Let y be your outcome of interest and X be your demographic and geographic variables you’d like to adjust for. Assume X is discrete so you can define a set of *poststratification cells*, j=1,…,J (for example, if you’re poststratifying on 4 age categories, 5 education categories, 4 ethnicity categories, and 50 states, then J=4*5*4*50, and the cells might go from 18-29-year-old no-high-school-education whites in Alabama, to over-65-year-old, post-graduate-education latinos in Wyoming). Each cell j has a population N_j from the census.

2. You fit a regression model y | X to data, to get a predicted average response for each person in the population, conditional on their demographic and geographic variables. You’re thus estimating theta_j, for j=1,…,J. The {\em regression} part of MRP comes in because you need to make these predictions.

3. Given point estimates of theta, you can estimate the population average as sum_j (N_j*theta_j) / sum_j (N_j). Or you can estimate various intermediate-level averages (for example, state-level results) using partial sums over the relevant subsets of the poststratification cells.

4. In the Bayesian version (for example, using Stan), you get a matrix of posterior simulations, with each row of the matrix representing one simulation draw of the vector theta; this then propagates to uncertainties in any poststrat averages.

5. The {\em multilevel} part of MRP comes because you want to adjust for lots of cells j in your poststrat, so you’ll need to estimate lots of parameters theta_j in your regression, and multilevel regression is one way to get stable estimates with good predictive accuracy.

OK, fine. The point is: poststratification is key. It’s all about (a) adjusting for many ways in which your sample isn’t representative of the population, and (b) getting estimates for population subgroups of interest.

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any {\em regularized prediction} method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, **regularized prediction and poststratification**. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.

The post Regularized Prediction and Poststratification (the generalization of Mister P) appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Regularized Prediction and Poststratification (the generalization of Mister P) appeared first on All About Statistics.

]]>The post On the interpretation of a regression model appeared first on All About Statistics.

]]>Yesterday, NaytaData posted a nice graph on reddit, with bicycle traffic and mean air temperature, in Helsinki, Finland,

I found that graph interesting, so I did ask for the data (NaytaData kindly sent them to me tonight).

```
df=read.csv("cyclistsTempHKI.csv")
library(ggplot2)
ggplot(df, aes(meanTemp, cyclists)) +
geom_point() +
geom_smooth(span = 0.3)
```

But as mentioned by someone on twitter, the interpretation is somehow trivial : people get out on their bike when the weather is nice, the hotter, the more cyclists. Which is interpreted in a causal way…

But actually, we can also visualize the data as follows, as suggested by Antoine Chambert-Loir

```
ggplot(df, aes(cyclists, meanTemp)) +
geom_point() +
geom_smooth(span = 0.3)
```

The interpretation would be, somehow, that the more cyclists on the road, the hotter it is. Why not consider a causal interpretation here ? Like cyclists go so fast, or sweat so much, that they increase temperature…

Of course, it is the standard (recurrent) discussion “correlation is not causality”, but in regression models, we like to tell a story, to pretend that we have some sort of a causal story. But we do not prove it. Here, we know that the first one is more credible than the second one, but how do we know that ? To go further, how can we use machine learning techniques to prove causal relationships ? How could a machine choose between the first and the second story ?

**Please comment on the article here:** **Statistics – Freakonometrics**

The post On the interpretation of a regression model appeared first on All About Statistics.

]]>When I was visiting the University of Washington the other day, Ariel Rokem showed me this cool data visualization and exploration tool produced by Jason Yeatman, Adam Richie-Halford, Josh Smith, and himself. The above image gives a sense of the dashboard but the real thing is much more impressive because it’s interactive. You can rotate […]

The post Awesome data visualization tool for brain research appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Awesome data visualization tool for brain research appeared first on All About Statistics.

]]>When I was visiting the University of Washington the other day, Ariel Rokem showed me this cool data visualization and exploration tool produced by Jason Yeatman, Adam Richie-Halford, Josh Smith, and himself. The above image gives a sense of the dashboard but the real thing is much more impressive because it’s interactive. You can rotate that brain image.

And here’s a research paper describing what they did.

The post Awesome data visualization tool for brain research appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Awesome data visualization tool for brain research appeared first on All About Statistics.

]]>The post Le Monde puzzle [#1051] appeared first on All About Statistics.

]]>

Acombinatoric Le Monde mathematical puzzle of limited size:

When the only allowed move is to switch two balls from adjacent boxes, what is the minimal number of moves to return all balls in the above picture to their respective boxes? Same question with six boxes and 12 balls.

**T**he question is rather interesting to code as I decided to use recursion (as usual!) but wanted to gain time by storing the number of steps needed by any configuration to reach its ordered recombination. Meaning I had to update an external vector within the recursive function for each new configuration I met. With help from Julien Stoehr, who presented me with the following code, a simplification of a common R function

v.assign <- function (i,value,...) { temp <- get(i, pos = 1) temp[...] <- value assign(i, temp, pos = 1)}

which assigns one or several entries to the external vector i. I thus used this trick in the following R code, where cosz is a vector of size 5¹⁰, much larger than the less than 10! values I need but easier to code. While n≤5.

n=5;tn=2*n baz=n^(0:(tn-1)) cosz=rep(-1,n^tn) swee <- function(balz){ indz <- sum((balz-1)*baz) if (cosz[indz]==-1){ if (min(diff(balz))==0){ #ordered v.assign("cosz",indz,value=1)}else{ val <- n^tn for (i in 2:n) for (j in (2*i-1):(2*i)) for (k in (2*i-3):(2*i-2)){ calz <- balz calz[k] <- balz[j];calz[j] <- balz[k] if (max(balz[k]-calz[k],calz[j]-balz[j])>0) val <- min(val,1+swee(calz))} v.assign("cosz",indz,value=val) }} return(cosz[indz])}

which returns 2 for n=2, 6 for n=3, 11 for n=4, 15 for n=5. In the case n=6, I need a much better coding of the permutations of interest. Which is akin to ranking all words within a dictionary with letters (1,1,…,6,6). After some thinking (!) and searching, I came up with a new version, defining

parclass=rep(2,n) rankum=function(confg){ n=length(confg);permdex=1 for (i in 1:(n-1)){ x=confg[i] if (x>1){ for (j in 1:(x-1)){ if(parclass[j]>0){ parclass[j]=parclass[j]-1 permdex=permdex+ritpermz(n-i,parclass) parclass[j]=parclass[j]+1}}} parclass[x]=parclass[x]-1} return(permdex)} ritpermz=function(n,parclass){ return(factorial(n)/prod(factorial(parclass)))}

for finding the index of a given permutation, between 1 and (2n)!/2!..2!, and then calling the initial swee(p) with this modified allocation. The R code is still running…

**Please comment on the article here:** **R – Xi'an's Og**

The post Le Monde puzzle [#1051] appeared first on All About Statistics.

]]>Some people pointed me to this article, “Issues with data and analyses: Errors, underlying themes, and potential solutions,” by Andrew Brown, Kathryn Kaiser, and David Allison. They discuss “why focusing on errors [in science] is important,” “underlying themes of errors and their contributing factors, “the prevalence and consequences of errors,” and “how to improve conditions […]

The post How to think about research, and research criticism, and research criticism criticism, and research criticism criticism criticism? appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post How to think about research, and research criticism, and research criticism criticism, and research criticism criticism criticism? appeared first on All About Statistics.

]]>Some people pointed me to this article, “Issues with data and analyses: Errors, underlying themes, and potential solutions,” by Andrew Brown, Kathryn Kaiser, and David Allison. They discuss “why focusing on errors [in science] is important,” “underlying themes of errors and their contributing factors, “the prevalence and consequences of errors,” and “how to improve conditions and quality,” and I like much of what they write. I also appreciate the efforts that Allison and his colleagues have done to point the spotlight on scientific errors in nutrition research, and I share his frustration when researchers refuse to admit errors in their published work; see for example here and here.

But there are a couple things in their paper that bother me.

First, they criticize Jordan Anaya, a prominent critic of Brian “pizzagate” Wansink, in what seems to be an unfair way. Brown, Kaiser, and Allison write:

The recent case of the criticisms inveighed against a prominent researcher’s work (82) offers some stark examples of individuals going beyond commenting on the work itself to criticizing the person in extreme terms (e.g., ref. 83).

Reference 83 is this:

Anaya J (2017) The Donald Trump of food research. Medium.com. Available at https://medium.com/@OmnesRes/the-donald-trump-of-food-research-49e2bc7daa41. Accessed September 21, 2017.

Sure, referring to Wansink as the “Donald Trump of food research” might be taken to be harsh. But if you read the post, I don’t think it’s accurate to say that Anaya is “criticizing the person in extreme terms.” First, I don’t think that analogizing someone to Trump is, in itself, extreme. Second, Anaya is talking facts. He indeed has good reasons for comparing Wansink to Trump. (“Actually, the parallels with Trump are striking. Just as Trump has the best words and huge ideas, Wansink has ‘cool data’ that is ‘tremendously proprietary’. Trump’s inauguration was the most viewed in history period, and Wansink doesn’t p-hack, he performs ‘deep data dives’. . . . Trump doesn’t let facts or data get in his way, and neither does Wansink. When Plan A fails, he just moves on to Plan B, Plan C…”).

You might call this hyperbole, and you might call it rude, but I don’t see it as “criticizing the person in extreme terms”; I think of it as criticizing Wansink’s *public actions* and *public statements* in negative terms.

Again, I don’t see Anaya’s statements as “going beyond commenting on the work”; rather, I see them as a vivid way of commenting on the work, and associated publicity statements issued by Wansink, very directly.

One reason that the brief statement in the article bothered me is that it’s easy to isolate someone like Anaya and say something like, We’re the reasonable, even-keeled critics, and dissociate us from bomb-throwers. But I don’t think that’s right at all. Anaya and his colleagues put in huge amounts of effort to reveal a long and consistent pattern of misrepresentation of data and research methods by a prominent researcher, someone who’d received millions of dollars of government grants, someone who received funding from major corporations, held a government post, appeared frequently on television, and was considered an Ivy League expert. And then he writes a post with a not-so-farfetched analogy to a politician, and all of a sudden this is considered a “stark example” of extreme criticism. I don’t see it. I think we need people such as Anaya who care enough to track down the truth.

Here’s some further commentary on the Brown, Kaiser, and Allison article, by Darren Dahly. What Dahly writes seems basically reasonable, except for the part that calls them “disingenuous.” I hate when people call other people disingenuous. Calling someone disingenuous is like calling them a liar. I think it would be better for Dahly to have just said he thinks their interpretation of Anaya’s blog post is wrong. Anyway, I agree with Dahly on most of what he writes. In particular I agree with him that the phrase “trial by blog” is ridiculous. A blog is a very open way of providing information and allowing comment. When Anaya or I or anyone else posted on Wansink, anyone—including Wansink!—was free to respond in comments. And, for that matter, when Wansink blogged, anyone was free to comment there too (until he took that blog down). In contrast, closed journals and the elite news media (the preferred mode of communication of many practitioners of junk science) rarely allow open discussion. “Trial by blog” is, very simply, a slogan that makes blogging sound bad, even though blogging is far more open-ended than the other forms of communications that are available to us.

In their article, Brown, Kaiser, and Allison write, “Postpublication discussion platforms such as PubPeer, PubMed Commons, and journal comment sections have led to useful conversations that deepen readers’ understanding of papers by bringing to the fore important disagreements in the field.” Sure—but this is highly misleading. Why *not* mention blogs in this list? Blogs have led to lots of useful conversations that deepen readers’ understanding of papers by bringing to the fore important disagreements in the field. And the best thing about blogs is that they are not part of institutional structures.

You also write, “Professional decorum and due process are minimum requirements for a functional peer review system.” But peer review does *not* always follow these rules; see for example this story.

In short, the good things that can be done in official journals such as PNAS can also be done in blogs; also, the bad things that can be done in blogs can also be done in journals. I think it’s good to have multiple channels of communication and I think it’s misleading to associate due process with official journals and to associate abuses with informal channels of communication such as blogs. In the case of Wansink, I’d say the official journals largely did a terrible job, as did Cornell University, whereas bloggers were pretty much uniformly open, fair, and accurate.

Finally, I think their statement, “Individuals engaging in ad hominem attacks in scientific discourse should be subject to censure,” would be made stronger if it were to directly refer to the ad hominem attacks made by Susan Fiske and others in the scientific establishment. I don’t think Jordan Anaya should be subject to censure just because he analogized Brian Wansink to Donald Trump in the context of a detailed and careful discussion of Wansink’s published work.

The larger problem, I think, is that the tone discussion is being used strategically by purveyors of bad science to maintain their power. (See Chris Chambers heres and James Heathers here and here.) I’d say the whole thing is a big joke, except that I am being personally attacked in scientific journals, the popular press, and, apparently, presentations being given by public figures. I don’t think these people care about me personally: they’re just demonstrating their power, using me as an example to scare off others, and attacking me personally as a distraction from the ideas they are promoting and that they don’t want criticized. In short, these people whose research practices are being questioned are engaging in ad hominem attacks in scientific discourse, and I do think that’s a problem.

That all said, one thing I appreciate about Brown, Kaiser, and Allison is that they do engage the real problems in science, unlike the pure status quo defenders who go around calling people terrorists and saying ridiculous things such as that the replication rate is “statistically indistinguishable from 100%.” They wrote some things I agree with, and they wrote some things I disagree with, and we can have an open discussion, and that’s great. On the science, they’re open to the idea that published work can be wrong. They’re not staking their reputation on ovulation and voting, ESP, himmicanes, and the like.

**Andrew Brown, Kathryn Kaiser, and David Allison say . . .**

I told David Allison that I’d be posting something on his article, and he and his colleagues prepared a three-page response which is here.

**Jordan Anaya says . . .**

In addition, Jordan Anaya sent me an email elaborating on some of the things that bothered him about the Brown, Kaiser, and Allison article:

I’m not mad with Allison, he can say whatever he wants about me in whatever media he chooses, as long as it’s his opinion or accurate. I’m not completely oblivious, I knew the title of my blog post would upset people, but that’s kind of the point. I felt the people it would preferentially offend are the old boys’ club at Harvard, so it seemed worth it. Concurrently, I felt that over time the title of my post would age well, so anyone who was critical of the post initially would eventually look silly. The first whistle blower to claim a major scandal is always seen as a little crazy, so I don’t necessarily blame people who were initially critical of my post, but after seeing two decades worth of misconduct from Wansink it would be hard for me to take people seriously now if they think the title is inappropriate. Wansink is clearly a con artist, just like Donald Trump.

I only learned of Allison’s talk because a journalist contacted me. Of course I’m honored whenever my work is talked about at a conference, but if it is misrepresented to such an extent that a journalist has to contact me to get my side of the story that’s a problem. I sort of thought that Allison would regret his statements in the talk given how many additional problems we found with Wansink’s work, but to my surprise he then said the same thing in a publication. I mean, one time is a mistake, but twice is something else.

So I have three issues with Allison’s comments in his talk and paper. First, I don’t agree with his general argument about ad hominem attacks. Second, I don’t think he is being honest in his portrayal of our investigation. Third, I find the whole thing hypocritical.

Going to Wikipedia, there’s a pyramid where ad hominem is defined as “attacks the characteristics or authority of the writer without addressing the substance of the argument”. Ad hominem is not the same as name-calling.

If I call someone a “bad researcher”, maybe that’s name-calling. But if I say “X is a bad researcher because of X, Y, and Z” I don’t know I would consider that name-calling since it was backed up by evidence. Even something like “X is a bad tipper, and he committed the following QRPs” is not ad hominem. Sure, being a bad tipper has nothing to do with the argument, but evidence is still presented. And as I think you’ve discussed, at some point it’s impossible to separate a person from their work. If a basketball player makes a bad play, I would just say he made a bad play. But if he consistently makes bad plays at some point he is a bad player.

Allison doesn’t specifically say my blog post is an example of an ad hominem attack, but it is in a paragraph with other examples of ad hominems. The title of my post could be seen as name-calling, but throughout the post I provide evidence for the title, so I’m not even sure if it’s name-calling. And besides, I’m not sure why being compared to the President of the United States, whom he likely voted for, would be seen as name-calling.

But let’s say Allison is right and my post is extremely inappropriate due to it being unsubstantiated criticism. I think it’s interesting to look at the opposite case, where someone gets unsubstantiated acclaim, a type of reverse ad hominem if you will. Wansink was celebrated as the “Sherlock Holmes of food”. I would classify this as reverse name-calling/ad hominem. If you are concerned about someone’s reputation being unfairly harmed by name-calling, surely you must be similarly concerned about someone gaining an unwarranted amount of fame and power by reverse name-calling.

This might sound silly, but here’s a very applicable example. Dan Gilbert and Steve Pinker both shared Sabeti’s Boston Globe essay on Twitter saying it was one of the best things they’ve ever read, an unsubstantiated reverse ad hominem. Why does this matter? Well the essay was filled with errors and attacked you. So by blindly throwing their support behind the essay (a reverse ad hominem), they are essentially blindly criticizing you.

So if we are going to be deeply concerned about (perceived) unwarranted criticism, then we need to be equally concerned about unwarranted praise since that can result in someone getting millions of dollars in funding and best-selling books based on pseudoscience.

Lastly, what is inappropriate is subjective, so it’s impossible to police. I agree with Chris Chambers here. Sure, if someone calls me an idiot I probably won’t throw a party, but if they then point out problems with my work I’d be happy. I’d rather that than someone say my work is great when it is actually filled with errors. Allison says the only things that matter are the data and methods and “everything else is a distraction”. I agree, so if someone happens to mention something else feel free to ignore it, whether it be positive or negative.

The next problem with his talk/paper is I’m not sure our investigation is being accurately presented. In Allison’s talk, (timestamp 36:30), he says he feels what’s happening to Wansink is a “trial by media” and a “character assassination”. Yes, I’ll admit we used our media contacts, but that’s only because we were unsatisfied with how Wansink was handling the situation. We felt we needed to turn up the pressure, and my blog post was part of that. I know turning to the media is not on Allison’s flowchart of how to deal with scientific misconduct, but if you look at our results I would say it is extremely effective and he may want to update his opinions.

He goes on to use the example of a student cheating on a test and having the professor call them out in front of the class. This analogy doesn’t work for various reasons. First, student exams are confidential, while Wansink’s work and errors and blog are in the public domain. Secondly, when a student does well on a test that is also confidential–the teacher doesn’t announce to the class you got an A. Conversely, Wansink has received uncritical positive media attention throughout his career, so I don’t see any issue with a little negative media attention. We’re back to the ad hominem/reverse ad hominem example. If you have no problems with reverse ad hominems and positive media attention I don’t see how you can be against ad hominems and negative media attention.

Third, this whole thing is filled with irony and hypocrisy. It’s funny to note that the whole thing was indeed started by a blog post, but it was Wansink’s blog post. And in that blog post he threw a postdoc under the bus and praised a grad student. The grad student was identified by name, and it was easy to figure out who the postdoc was (she didn’t want to comment when we contacted her). So if you’re going to be mad about a blog post how about you start there? Wansink not only ruined the Turkish student’s career, he provided information about the postdoc she probably wishes wasn’t made public. If you dislike my blog posts fine, but then you better really really hate Wansink’s post. At the end of his talk he reads a quote from my blog and says that’s not the type of culture he wants to foster. Given his lack of criticism of anything in Wansink’s blog, I guess he prefers a culture where senior academics complain about how their employees are lazy and won’t p-hack enough.

Allison is famous for bemoaning how hard it is to get work corrected/retracted, which is exactly what we faced with his good friend (and coauthor) Wansink. We just happened to then use nontraditional routes of criticism, which were extremely effective. You’d think if someone is writing an article about getting errors corrected, they might want to mention one of the most successful investigations. He might not agree with our methods, but I don’t see how he can ignore the results, and it seems wrong to not mention that this is a possible route which can work.

And as I’ve mentioned before, his article in PNAS is basically a blog post, so it’s funny for him to complain about blogs. And I can’t help but wonder whether he would have singled out my blog post if he wasn’t a friend and coauthor of Wansink. Presumably if he didn’t know Wansink he would have described the case in detail given its scale.

**One more time**

Again, I’m encouraged by the openness of all the people involved in this discussion. I get so frustrated with people who attack and then hide. Much better to get disagreements out in the open. And of course anyone can add their comments below.

The post How to think about research, and research criticism, and research criticism criticism, and research criticism criticism criticism? appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post How to think about research, and research criticism, and research criticism criticism, and research criticism criticism criticism? appeared first on All About Statistics.

]]>The post Awesome postdoc opportunities in computational genomics at JHU appeared first on All About Statistics.

]]>Johns Hopkins is a pretty amazing place to do computational genomics right now. My colleagues are really impressive, for example five of our faculty are part of the Chan Zuckerberg Initiative and we have faculty across a range of departments including Biostatistics, Computer Science, Biology, Biomedical Engineering, Human Genetics. A number of my colleagues are activitely looking for postdocs and in an effort to make the postdoc job market a little less opaque I’m listing this non-comprehensive list of opportunities I know about here.

**Job:** Postdoc(s) in methods development for precision medicine in oncology and immunotherapy (single cell RNA-seq, TCR sequencing).

**Employer:** Elana Fertig and Elizabeth Jaffee

**To apply:** email Elana

**Deadline:** Review ongoing

===============

**Job:** Timp lab is hiring postdocs! We work on sequencing technology and analysis development for direct RNA seq, epigenetics, cancer genomics, infectious disease diagnosis, non-model organism genomics, protein sequencing and all of the above!

**Employer:** Winston Timp

**To apply:** email Winston

**Deadline:** Until position filled

===============

**Job:** The Goff lab (www.gofflab.org) has 2 open postdoc positions for a variety of projects at the intersection of computational biology and neurodevelopment, degeneration, and disease. We develop and utilize single cell analysis techniques, including single cell RNA-Seq, to explore cell state transitions across continuous biological processes. Current funded projects include cross-model analysis of fALS, Parkinson Disease, and Kabuki Syndrome, as well as examination of cell-type-specific neuronal plasticity response, and developmental fate specification in the CNS.

**Employer:** Loyal Goff and Solange Brown

**To apply:** email Loyal

**Deadline:** 12/31/2018 or until position filled

===============

**Job:** The Hansen lab (www.hansenlab.org) is hiring postdocs in computational biology. We are funded to continue development of the recount2 (preprint) project, which involves joint processing, normalization, and analysis of all publicly available human RNA-seq samples. This project combines genomics with analysis of extremely large-scale RNA-seq data. The work is in collaboration with Jeff Leek, Ben Langmead and Alexis Battle at JHU.

**Emplyer:** Kasper D. Hansen

**To apply:** Email Kasper.

**Deadline:** Review ongoing

===============

**Job:** The Jaffe lab at the Lieber Institute for Brain Develpoment (affiliated with JHMI) is hiring at several different levels (research assistant, research associate, postdoc fellow)! We work at the intersection of genomics, biostatistics, and computational biology, leveraging large human brain datasets to better understand how genomic signatures associate with brain development and subsequent dysregulation in mental illness.

**Employer:** Andrew Jaffe

**To apply:** LIBD Careers

**Deadline:** Until position filled

===============

**Job:** The Battle lab (battlelab.jhu.edu) is hiring postdocs in genomics and machine learning/probabilistic modeling. Projects include rare genetic variation, single cell genomics, time series genomics, complex human disease, large scale integrative transcriptomic/eQTL studies, and more.

**Employer:** Alexis Battle

**To apply:** email Alexis

**Deadline:** Until positions filled

**Please comment on the article here:** **Simply Statistics**

The post Awesome postdoc opportunities in computational genomics at JHU appeared first on All About Statistics.

]]>The post maximal spacing around order statistics appeared first on All About Statistics.

]]>**T**he riddle from the Riddler for the coming weeks is extremely simple to express in mathematical terms, as it summarises into characterising the distribution of

when the n-sample is made of iid Normal variates. I however had a hard time finding a result connected with this quantity since most available characterisations are for either Uniform or Exponential variates. I eventually found a 2017 arXival by Nagaraya et al. covering the issue. Since the Normal distribution belongs to the Gumbel domain of attraction, the extreme spacings, that is the spacings between the most extreme orders statistics [rescaled by nφ(Φ⁻¹{1-n⁻¹})] are asymptotically independent and asymptotically distributed as (Theorem 5, p.15, after correcting a typo):

where the ξ’s are Exp(1) variates. A crude approximation is thus to consider that the above Δ is distributed as the maximum of two standard and independent exponential distributions, modulo the rescaling by nφ(Φ⁻¹{1-n⁻¹})… But a more adequate result was pointed out to me by Gérard Biau, namely a 1986 Annals of Probability paper by Paul Deheuvels, my former head at ISUP, Université Pierre and Marie Curie. In this paper, Paul Deheuvels establishes that the largest spacing in a normal sample, M¹, satisfies

from which a conservative upper bound on the value of n required for a given bound x⁰ can be derived. The simulation below compares the limiting cdf (in red) with the empirical cdf of the above Δ based on 10⁴ samples of size n=10³.The limiting cdf is the cdf of the maximum of an infinite sequence of independent exponentials with scales 1,½,…. Which connects with the above result, *in fine*. For a practical application, the 99% quantile of this distribution is 4.71. To achieve a maximum spacing of, say 0.1, with probability 0.99, one would need 2 log(n) > 5.29²/0.1², i.e., log(n)>1402, which is a pretty large number…

**Please comment on the article here:** **R – Xi'an's Og**

The post maximal spacing around order statistics appeared first on All About Statistics.

]]>The post maximal spacing around order statistics appeared first on All About Statistics.

]]>**T**he riddle from the Riddler for the coming weeks is extremely simple to express in mathematical terms, as it summarises into characterising the distribution of

when the n-sample is made of iid Normal variates. I however had a hard time finding a result connected with this quantity since most available characterisations are for either Uniform or Exponential variates. I eventually found a 2017 arXival by Nagaraya et al. covering the issue. Since the Normal distribution belongs to the Gumbel domain of attraction, the extreme spacings, that is the spacings between the most extreme orders statistics [rescaled by nφ(Φ⁻¹{1-n⁻¹})] are asymptotically independent and asymptotically distributed as (Theorem 5, p.15, after correcting a typo):

where the ξ’s are Exp(1) variates. A crude approximation is thus to consider that the above Δ is distributed as the maximum of two standard and independent exponential distributions, modulo the rescaling by nφ(Φ⁻¹{1-n⁻¹})… But a more adequate result was pointed out to me by Gérard Biau, namely a 1986 Annals of Probability paper by Paul Deheuvels, my former head at ISUP, Université Pierre and Marie Curie. In this paper, Paul Deheuvels establishes that the largest spacing in a normal sample, M¹, satisfies

from which a conservative upper bound on the value of n required for a given bound x⁰ can be derived. The simulation below compares the limiting cdf (in red) with the empirical cdf of the above Δ based on 10⁴ samples of size n=10³.The limiting cdf is the cdf of the maximum of an infinite sequence of independent exponentials with scales 1,½,…. Which connects with the above result, *in fine*. For a practical application, the 99% quantile of this distribution is 4.71. To achieve a maximum spacing of, say 0.1, with probability 0.99, one would need 2 log(n) > 5.29²/0.1², i.e., log(n)>1402, which is a pretty large number…

**Please comment on the article here:** **R – Xi'an's Og**

The post maximal spacing around order statistics appeared first on All About Statistics.

]]>The post maximal spacing around order statistics appeared first on All About Statistics.

]]>**T**he riddle from the Riddler for the coming weeks is extremely simple to express in mathematical terms, as it summarises into characterising the distribution of

when the n-sample is made of iid Normal variates. I however had a hard time finding a result connected with this quantity since most available characterisations are for either Uniform or Exponential variates. I eventually found a 2017 arXival by Nagaraya et al. covering the issue. Since the Normal distribution belongs to the Gumbel domain of attraction, the extreme spacings, that is the spacings between the most extreme orders statistics [rescaled by nφ(Φ⁻¹{1-n⁻¹})] are asymptotically independent and asymptotically distributed as (Theorem 5, p.15, after correcting a typo):

where the ξ’s are Exp(1) variates. A crude approximation is thus to consider that the above Δ is distributed as the maximum of two standard and independent exponential distributions, modulo the rescaling by nφ(Φ⁻¹{1-n⁻¹})… But a more adequate result was pointed out to me by Gérard Biau, namely a 1986 Annals of Probability paper by Paul Deheuvels, my former head at ISUP, Université Pierre and Marie Curie. In this paper, Paul Deheuvels establishes that the largest spacing in a normal sample, M¹, satisfies

from which a conservative upper bound on the value of n required for a given bound x⁰ can be derived. The simulation below compares the limiting cdf (in red) with the empirical cdf of the above Δ based on 10⁴ samples of size n=10³.The limiting cdf is the cdf of the maximum of an infinite sequence of independent exponentials with scales 1,½,…. Which connects with the above result, *in fine*. For a practical application, the 99% quantile of this distribution is 4.71. To achieve a maximum spacing of, say 0.1, with probability 0.99, one would need 2 log(n) > 5.29²/0.1², i.e., log(n)>1402, which is a pretty large number…

**Please comment on the article here:** **R – Xi'an's Og**

The post maximal spacing around order statistics appeared first on All About Statistics.

]]>