My thinking about public version control for research projects began Zach Jones' discussion of the idea in a recent *Political Methodologist*. Teaching the linear models class here at UB last semester solidified its importance in my mind. We used Dropbox for version control and sharing, but Git and GitHub are better.

Several recent articles and posts outline why researchers (as opposed to programmers) might use Git and GitHub. Here's a brief summary:

If I've missed something, please let me know.

There are a lot of good reasons to do this:

**History.**We all use version control. Most of us do it poorly. Using Git/GitHub, I'm learning to do it better. Git/GitHub offers a formal way to easily maintain a complete history of a project. In general, it's good to avoid filenames likenew_final_carlisle_v3c_updated.docx

. A recent comic makes this point clear. We need a method of updating files while keeping track of the old versions so that we can go back if needed. But the approach of giving different filenames to different version is inefficient at best. My approach of keeping "old files" in a designated folder is no better. Git/GitHub solves these issues. Second, Git allows you to tag a project at certain stages, such as "Initial submission to AJPS." After getting an invitation to revise and resubmit and making the required changes, I can compare the current version of the project to the (now several months old) version I initially submitted. This makes writing response memos much easier.**Transparency.**Zach Jones most clearly makes the point that Git/GitHub increases transparency in the context of political science research. Git/GitHub essentially allows others to actively monitor the progress of a project or study its past development. Related to the motivation to using GitHub in an open manner is the idea of an "open notebook." Carl Boettiger is one of the most well-know proponents of open notebooks. This kind of openness provides a wonderful opportunity to receive comments and suggestions from a wide audience. This allows other to catch errors that might otherwise go unnoticed. It also gives readers a formal avenue to make suggestions, not to mention keeping a complete history of the suggestions and any subsequent discussion. GitHub allows the public to open issues, which is a wonderful way to receive and organize feedback on a paper.**Accessibility.**Christopher Gandrud makes the point clearly in a clearly in a recent edition of*The Political Methodologist*, though he discusses accessibility purely in the context of building data sets. But similar arguments could be made for code. I recently had a graduate student express interest in some MRP estimates of state-level opinion on the Affordable Care Act. I told her that I had spent some time collecting surveys and writing code to produce the estimates. I noted that, ideally, she would not duplicate my work, but, if possible, build on it. I was able to point her to the GitHub repository for the project, which hopefully she'll find useful as a starting point for her own work. As part of my experience supervising replication projects as part of graduate methods classes and my own experience with replication data, the clean, final versions of the data that researchers typically post publicly do not allow future researchers to build on easily build on previous work. If authors posted the raw data and all the (possibly long and messy) code to do the cleaning and recoding, it would be much easier for future researcher to build on past contribution. Indeed, research shows that making the data and code freely available lowers the barriers to reuse and increases citations.

But these are the commonly cited reasons for using Git and GitHub. But in my practice, I've found another reason, perhaps more important than the above.

One thing that I first noticed in my students, but now I see that I'm just as guilty of, is "the race to a regression." That is, I devote the absolute minimum required effort (or less) to everything leading up to the regression. My attitude is usually that I'll go back later and clean up everything, double checking along the way, if the line of investigation "proves useful" (i.e., provides stars). I rarely go back later. I find that the script

let_me_just_try_this_really_quickly.Rquickly becomes a part of

analysis.R. This is bad practice and careless.

Instead of a race to the regression, Git encourages me to develop projects a little more carefully, thinking about projects in tiny steps, each to be made public, and each done right and summarized in a nice commit message. The care in my research has noticeably improved. I think about how to do something better, do it better, and explain it in a commit message that I can refer to later.

In my view, project development in Git/GitHub works best when users make small, discrete changes to a project. This takes some thought and discipline, but it is the best way to go. I'm guilty of coming back from a conference and making dozens of small changes to an existing projects, incorporating all the suggestions in a single update. I just did it after the State Politics and Policy Conference. It is a poor way to go about developing a project. It is a poor way to keep track of things. It is a poor strategy, but I'm learning.

]]>The editors of the

APSRhave been discussing this issue for some time. In many ways this was prompted by several recent exchanges we had with a scholar who had obtained the replication data from the authors of a manuscript that had appeared in an earlier issue of theReview(in 2010, prior to the University of North Texas’ team taking the reins of the journal). After obtaining the replication data from the authors of the original piece (with the editors’ help) they proceeded to attempt to replicate the results, but were unable to do so. The authors notified us and asked where to publish such a replication study. Our policy at theAPSR(which was also the policy of all of our predecessors and the policy of most major journals in the social sciences as well ) is not to publish works that are only replication studies because they do not represent the kind of original work we publish in theReview.There are very good reasons for

APSR’s policy, and we strongly believe in continuing it. We do believe, however, that a very good point was made. A venue for the publication of replication studies is necessary, especially the discipline aspires to raise the degree of scientific rigor in the field. However,as editors of theMost all other major journals in the field, we believe, do not to publish solely replication studies (certainly this is true ofAPSRwe are also reluctant to publish such studies in the Review, because this would open up a “cheap” way for authors to have their work published in theAPSR, and every Tom, Dick, and Harriet (pardon the expression) could potentially seek to replicate some study, just to get published in theReview.APSR,AJPSandJOP, as well as the major international relations journals).

I feel that all journals (including the *APSR*) should evaluate articles on the strength of the contribution, not the time spent working on the paper (I guess this is what John means by "cheap"). In my mind, if an article makes a claim that is deemed a substantial contribution worthy of the *APSR*, then an article that refutes this claim has made a similar contribution.

Further, I don't think that most replication studies start out as witch hunts with "Tom, Dick, and Harriet" looking for a "cheap" publication in the *APSR. *Every replication that I've been a part of and that my students have worked on started as *extensions *trying to *build on *the original research.

I can certainly speak for myself. I'm never "out to get" the original authors. I never start out trying to destroy someone else's work. I replicate studies for two reasons. First, I'm trying to build on their work. Second, I'm trying to make a methodological point to help strengthen future work. Along the way, I've found many "mistakes" (or at least things that people citing/believing the key findings should be aware of). In one case, the results changed dramatically when I slightly changed the model specification--the specification the author reports seemed to be the only one with the appropriate stars. In another example, a single case drove all of the authors' key findings. (In both cases, I swept the mistakes under the rug and proceeded as though everything was hunky-dory.) In both cases, a summary of the findings deserves publication in good journals, but I don't expect either to ever see daylight.

Finally, I don't think it would be such a bad thing if the *APSR* took responsibility for its publications and published papers by "every Tom, Dick, and Harriet" informing us that some of the the major results we're all citing in the *APSR* are (or might be) wrong. It might be "cheap," but perhaps that makes it a good investment.

First, the background. BDE published a paper back in 2010 that examines whether researchers need to include a product term in order to argue for interaction. This first paper examines the situation in which the researcher expects interaction due to compression (when the researcher expects changes in predicted probabilities to be smaller and the probabilities approach zero and one). BDE argue that the logit model with no product term is able to capture this type of interaction and, therefore, no product term is needed in this situation. I've previously discussed this paper here and mentioned that I have a paper arguing that one should include a product term even when interaction is expected on the basis of compression alone. While I disagree with his particular situation, the rest of the paper is fantastic. In particular, I've found their advice about hypothesizing interaction in terms of \(Pr(Y)\) or \(Y^*\) especially valuable.

In the forthcoming paper, BDE extend their analysis to the situation in which researchers expect effects (i.e., changes in predicted probabilities) to vary, but do not have a theoretically motivated specification. They refer to this as "specification ambiguity." In this situation, I was delighted to read that BDE recommend always include a product term. They find that excluding the product term biases the researcher toward finding interaction. This is the same reason I disagree with their recommendation to exclude product term in the situation of strong theory. With the publication of this new paper, the literature is almost where I'd like it to be, with the exception of the tiny point I mentioned above.

]]>The first step is to simulate an interactive data set.

# simulate a fake data set set.seed(39740) n <- 25 x <- rexp(n) z <- rexp(n) y <- x + z + x*z + rnorm(n)

We can plot the data to get a sense of the pattern using

compactr.

library(compactr) eplot(xlim = mm(x), ylim = mm(y)) points(x, y)

The next step is to estimate a model with a product term and pull out the estimated coefficients (needed to calculate the marginal effects) and the variance-covariance matrix (needed to calculate the standard errors). This is easily done using the R code below.

# estimate the model with a product term m <- lm(y ~ x + z + x*z) # pull out coefficient estimates beta.hat <- coef(m) # pull out the covariance matrix cov <- vcov(m)

And a quick look at the estimates shows that our product term is statistically significant (p < 0.05), indicating that we can reject the (null) hypothesis of constant effects.

> library(arm) > display(m, detail = TRUE) lm(formula = y ~ x + z + x * z) coef.est coef.se t value Pr(>|t|) (Intercept) 0.06 0.40 0.16 0.87 x 0.58 0.31 1.86 0.08 z 1.17 0.31 3.83 0.00 x:z 1.39 0.46 3.02 0.01 --- n = 25, k = 4 residual sd = 1.05, R-Squared = 0.71

The first step is to choose a set of values for *z* over which to compute the marginal effect of *x*. This is necessary because the marginal effect of *x* depends on *z*.

# a set of values of z to compute the (instantaneous) # effect of x z0 <- seq(min(z), max(z), length.out = 1000)

In particular, \(\dfrac{\partial E(y)}{\partial x} = \beta_x + \beta_{xz}z\). We can use the fact to get the point estimates of the marginal effect of *x*.

# calculate the instantaneous effect of x as z varies dy.dx <- beta.hat["x"] + beta.hat["x:z"]*z0

Getting the standard errors is a little tedious, but we can use the nice table on Matt's page to avoid having to work it out ourselves. Referring to that table, we see that the variance of the marginal effect is given by \(\sigma^2_{me} = Var(\beta_x) + z^2Var(\beta_{xz}) + 2zCov(\beta_x\beta_{xz})\). The standard error is simply the square root of this quantity, which we can easily compute in R.

# calculate the standard error of each estimated effect se.dy.dx <- sqrt(cov["x", "x"] + z0^2*cov["x:z", "x:z"] + 2*z0*cov["x", "x:z"])

To get the 90% confidence interval for each estimated marginal effect (one for each *z* in the R object

z0), we simply go up and down 1.64 standard errors from the estimate.

# calculate upper and lower bounds of a 90% CI upr <- dy.dx + 1.64*se.dy.dx lwr <- dy.dx - 1.64*se.dy.dx

All that is left is to plot the estimates and confidence intervals. I prefer to do this in my R package

compactr, but you can use whatever you like.

# plot the ME using compactr library(compactr) eplot(xlim = mm(z0), ylim = mm(c(upr, lwr)), xlab = expression(z), ylab = expression(frac(partialdiff*y, partialdiff*x))) lines(z0, dy.dx, lwd = 3) lines(z0, lwr, lty = 3) lines(z0, upr, lty = 3)

Here's the entire bit of code in one frame for your convenience.

# simulate a fake data set set.seed(39740) n <- 25 x <- rexp(n) z <- rexp(n) y <- x + z + x*z + rnorm(n) library(compactr) eplot(xlim = mm(x), ylim = mm(y)) points(x, y) # estimate the model with a product term m <- lm(y ~ x + z + x*z) # pull out coefficient estimates beta.hat <- coef(m) # pull out the covariance matrix cov <- vcov(m) # a set of values of z to compute the (instantaneous) # effect of x z0 <- seq(min(z), max(z), length.out = 1000) # calculate the instantaneous effect of x as z varies dy.dx <- beta.hat["x"] + beta.hat["x:z"]*z0 # calculate the standard error of each estimated effect se.dy.dx <- sqrt(cov["x", "x"] + z0^2*cov["x:z", "x:z"] + 2*z0*cov["x", "x:z"]) # calculate upper and lower bounds of a 90% CI upr <- dy.dx + 1.64*se.dy.dx lwr <- dy.dx - 1.64*se.dy.dx # plot the ME using compactr eplot(xlim = mm(z0), ylim = mm(c(upr, lwr)), xlab = expression(z), ylab = expression(frac(partialdiff*y, partialdiff*x))) lines(z0, dy.dx, lwd = 3) lines(z0, lwr, lty = 3) lines(z0, upr, lty = 3)]]>

compactrthat helps create nice-looking plots in R and it is now up on CRAN.

You can get it by typing

install.packages("compactr") library("compactr")

directly into the command line in R. See how it works by typing

example(eplot)or reading the details here. Below I describe the basic structure and functions.

The basic idea is to draw an empty plot with more compact axis notation using the

eplot()function and then add points, lines, and text as needed using the

points(),

lines(), and

text()functions from base graphics. It has always been my habit to get the axes of an empty plot looking good before focusing in on the important information.

compactrjust facilitates that workflow.

Just to see what

compactrcan do, compare this matrix of plots created by the

plot()function from base graphics...

to the matrix of plots created by the

eplot()function.

You can get the code to make both of these plots (as well as others) here.

]]>]]>Political scientists often theorize that explanatory variables should have "no effect" and support these claims by demonstrating that the estimated effects are not statistically significant. These empirical arguments are not particularly compelling, but I introduce applied researchers to simple, powerful tools that can strengthen their arguments for these hypotheses. With several supporting examples, I illustrate that researchers can use 90% confidence intervals to argue against meaningful effects and provide persuasive evidence for their hypotheses.

As far as I can tell, the debate started in political science when Wolfinger and Rosenstone published their book *Who Votes?*, in which they study the interaction between voting laws and education using a logit model without a product term. Instead, they simply calculate confidence intervals around the second differences.

Jonathan Nagler, however, rejects their finding in his 1991 article "The Effect of Registration Laws and Education on U.S. Voter Turnout," claiming that “examining predicted probabilities generated by non-linear models such as [logit models] may produce spurious results when used to determine interactive effects between two [explanatory] variables.” Instead, Nagler prefers to include a product term and argue for interaction by testing its statistical significance. When he includes the product of education and registration restrictions in a model of Wolfinger and Rosenstone’s data, the product term is statistically significant, but has the incorrect sign.

More recently, Berry, DeMeritt, and Esarey, in their 2010 article "Testing for Interaction in Binary Logit and Probit Models: Is a Product Term Essential?," challenge Nagler’s approach, arguing that researchers do not need to include product terms in logistic regression models to draw compelling substantive conclusions about interaction, if they expect interaction due to “compression.” They explain:

[I]n any logit or probit model, the marginal effect of \(X\) on \(Pr(Y)\) depends on the values of all [explanatory] variables; this marginal effect is greatest when \(Pr(Y)\) is 0.5 and declines when a change in either variable pushes \(Pr(Y)\) toward 0 or 1. We refer to the phenomenon ... as compression, because deviations of \(Pr(Y)\) away from [0.5] compress further possible change in \(Pr(Y)\) to ever-smaller ranges (p. 251).

Berry, DeMeritt, and Esarey continue:

[A] researcher...may base his hypothesis that independent variables interact in influencing \(Pr(Y)\) strictly on an expectation of compression. In this case, there is no need to include a product term in the model. Put differently, no product term is required if the analyst’s only reason for positing interaction between \(X\) and \(Z\) in influencing \(Pr(Y)\) is an expectation that when the value of \(X\) is extreme, \(Pr(Y)\) is near its limit of zero or one and thus there is little room for \(Pr(Y)\) to change as \(Z\) changes. In any event, the analyst should use the parameter estimates for the model (with no product term) to generate estimates of the effects of variables on \(Pr(Y)\) to test his hypothesis; the same tools used for models with a product term–second differences or marginal effect plots–can be employed (261).

I really like Berry, DeMeritt, and Esarey's 2010 paper. It makes a lot of good points in clear manner and offers good suggestions to applies researchers. I also like the fact that they view theory as driving the choices about the empirical argument. Some analyses seem somewhat rote, but these authors are promoting thoughtful analysis, which is great. For example, they point out that it matters whether we theorize about the so-called latent variable \(y^*\) or \(Pr(Y)\).

I don't however, agree with their final conclusion, and I've written a short paper in response. They suggest that when the analyst theorizes about \(Pr(Y)\) and expects interaction due to compression, no product term is needed. I disagree.

They are correct--logit models can represent interaction due to compression. But we need more from statistical models. We need models that can represent the theory under consideration and plausible alternatives. In this case, a plausible alternative is a non-interactive model. Unfortunately, logit models don't do a good job of representing non-interactive relationships.

I've found using simulations, that simply including a product term allows the model to represent non-interactive relationships surprisingly well. This leads to a counter-intuitive conclusion.

When the analyst theorizes about $$Pr(Y)$ and expects interaction due to compression, she must use a product term in her statistical model, not to allow the model to represent interactive relationships, but to allow the model to represent non-interactive relationships. If she fails to include the product term, then she is biasing herself toward finding interaction.

In other words, if an analyst fails to include a product term, then the size (probability of rejecting the null given that it's true) is much to large. To illustrate, I performed a large simulation study. Each iteration of a study generates a different "true" data generating process. They are highly varied but are all non-interative. For each of these studies, I ask:

- What would happen if I theorized interaction due to compression and excluded the product term? In particular, what would be the size of my test for interaction?
- What would happen if I included a product term? What would the size of my test be?

I plotted the size of the test for interaction using each approach. The result is below.

This figure shows that for highly varied data-generating processes, the size of test of interaction is very near 0.05 (where it should be) when the product term is included. The size falls below 0.1 for 95% of the data-generating processes considered.

However, when the product term is excluded, the size of the test can be very large. In fact, the size of the test fell below 0.1 for only 40% of the data-generating processes that I considered.

This simulation illustrates that analysts are biasing themselves toward finding interaction when they fail to include a product term.

Based on my the analytical arguments and simulation studies, applied researchers can take several steps to ensure their claims of interaction are empirically meaningful.

- Researchers must have a clear idea of what relationships are consistent and inconsistent with their theory. They must use empirical models that can represent both relationships.
- Researchers must include product terms in order to draw compelling substantive conclusions about interaction from logistic regression models. Otherwise, they are simply pulling assumptions through data and then treating them as conclusions.
- If researchers do not include a product term, they should avoid making claims about interaction. The interaction is assumed and should be understood and clearly identified as such.

He makes two points:

- A small magnitude but statistically significant result contains virtually no important information
- Even a large magnitude, statistically significant result is not especially convincing on its own.

In this post, I explore how we can set up the problem in reasonable way and have hypothesis test be informative. In particular, I'd like to argue that statistical significance can contain a lot of information about the hypothesis.

I am especially interested in altering the prior distribution Justin uses. He relies heavily on a prior that places a mass at zero, arguing that a reasonable, skeptical prior places a probability of 0.9 at zero. These priors strike me as unrealistic (though they are commonly used), because I feel that the probability of any true effect being zero is zero (on any problem I've ever worked on, anyway). There is always an effect. It might be positive or negative, and it might be tiny, but there is always an effect. Instead of figuring out whether the effect is zero or non-zero, I see the goal of inference as figuring out the direction of the effect or the size of the effect.

Before my original post even made it up, Justin posted some further comments that address some of my reaction, but not all of my questions, so I thought I'd elaborate a little more.

Rather than calculate \(Pr(\beta = 0 | \text{Sig.})\) as Justin does, I want to calculate \(Pr(\beta \leq 0 | \text{Pos. and Sig.})\), moving from a point (two-sided test) to an interval null (one-sided test). I imagine a situation in which the researcher hypothesizes a positive effect, but of course the conclusion apply to the situation in which the researcher hypothesizes a negative effect.

Also, rather than use a prior that places a mass at zero, I want to use a continuous prior with a small variance that is centered at zero.

So I'm setting up the analysis slightly different that Justin.

- I'm using a one-sided test. I already know the effect is not zero. I just want to know if I have evidence that it's not negative.
- I'm using a prior that reflects my hypotheses. Rather than placing a mass at zero, I have a skeptical prior centered at zero.

In my previous post, I speculated that approaching the problem in this manner would change the results significantly.

Analytics can get us a little way toward a solutions. In the notation below, let significant mean \(p < 0.05\) for a one-sided test. We can use Bayes' Rule to see that

\( Pr(\beta \leq 0 | \text{Sig.}) = \dfrac{Pr(\text{Sig.} | \beta \leq 0)Pr(\beta \leq 0)}{Pr(\text{Sig.} | \beta \leq 0)Pr(\beta \leq 0) +Pr(\text{Sig.} | \beta > 0)Pr(\beta > 0)}\).

Now we need to say something about the prior for \(\beta\). I'm going to use a normal distribution with mean zero and a standard deviation that varies. We can use some simple analytics to study the limiting cases of \(\sigma = 0\) and \(\sigma = \infty\).

As \(\sigma\) goes to zero, then the probability of obtaining a statistically significant result goes to 0.05 when \(\beta \leq 0\) and when \(\beta > 0\). This occurs because we are setting the standard deviation to nearly zero, which we know will yield statistically significant results in 5% of repeated trials. So then we can easily see that

\(Pr(\text{Sig.} | \beta \leq 0)Pr(\beta \leq 0)=Pr(\text{Sig.} | \beta > 0)Pr(\beta > 0) = .05\).

Because we're using a normal distribution centered at zero as our prior for \(\beta\), \(Pr(\beta \leq 0) = Pr(\beta > 0) = .5\). Pluging this information in yields

\( Pr(\beta \leq 0 | \text{Sig.}) = \dfrac{.05 \times 0.5}{0.05 \times 0.5 + 0.05 \times 0.5} = 0.5\).

This means for that a normal distribution with a very small standard deviation, the posterior probability that the null hypothesis is true (the effect is not positive) is about 0.5, regardless of the data.

The second situation in which analytics seem helpful is for prior with a very large standard deviation. In this situation, \(Pr(Sig. | \beta \leq 0) =0\), since almost all of the parameter space yields negative effects so large that they will never be mistaken for positive effects. Similarly \(Pr(Sig. | \beta > 0) =1\), since, for almost all of the parameters space, the effects are so large that they will always be statically significant. Substituting these values in yields a zero in the numerator, so that the posterior probability of a negative relationship is zero. Now we see that if one has a non-informative or flat prior, then statistical significance contains a lot of information--it suggests the probability of the null is zero.

But what about other priors? For the nonlimiting cases, such as \(\sigma = 0.1\), figuring out the posterior probability requires some integration or simulation. I worked up a quick program in R and here are the results. I assume a sample size of The results are presented graphically below.

The true effect in these simulations is a random draw from the prior distribution, so the prior distribution is exactly right. But notice that when we can reject the null hypothesis that \(\beta \leq 0\), then the probability that \(\beta < 0\) is goes to zero. For the less informative priors (e.g. \(\sigma = 0.3\) or \(0.5\)), it goes to zero fairly quickly. In other words, even when we think the effect is near zero (and we're right), statistical significance is informative about the sign of \(\beta\).

In Justin's analysis, the posterior probability of the null hypothesis was bounded below by about 0.3--suggesting that statistical significance doesn't contain much information. When the problem is set up with continuous prior on \(\beta\) and one-tailed tests, then even highly skeptical priors can lead to confident conclusions for large enough samples. Indeed, for any symmetric prior centers at zero with non-zero variance, the posterior probability of the null hypothesis, conditional on significance, gets very near zero for a large enough sample size. The simulations above show that even very skeptical priors (e.g. \(\sigma = 0.1\)) and moderate sample sizes, (e.g. 2,000), the posterior probability of the null can be nearly zero.

So what should you take from all this?

- Priors matter. I'm usually suspicious of priors that place a mass on zero. In this case, the choice of prior has a big impact on the posterior. The correct prior varies across applications, but I can't think of a social science problem where I would consider a prior that places a mass at zero is appropriate. I think similar skepticism is warranted in the case of Bayesian model averaging. That said, I've done BMA, and find it especially useful for forecasting.
- Statistical significance is a useful idea. It does contain a good deal of information. I've defended statistical significance as a useful way to quickly characterize results, and I still think that's the case.

The R code below replicates the figure above.

Simulate <- function(n.sims, samp.size, prior.sd) { x <- runif(samp.size) res <- matrix(NA, nrow = n.sims, ncol = 2) for (i in 1:n.sims) { beta <- rnorm(1, 0, prior.sd) y <- rnorm(samp.size, beta*x, 1) m <- lm(y ~ x) p <- summary(m)[[4]][2,4] pos <- coef(m)[2] > 0 sig <- pos == 1 & p < .1 res[i, ] <- c(beta, sig) } p1 <- sum(res[res[,1] < 0,2])/length(res[res[,1] < 0,2]) p2 <- sum(res[res[,1] > 0,2])/length(res[res[,1] > 0,2]) P <- p1*.5/(p1*.5 + p2*.5) return(P) } sd.length <- 4 n.length <- 6 sd <- c(.01, .1, .3, .5) n <- sd.v <- seq(100, 10000, length = n.length) iter <- 0 P <- matrix(NA, nrow = length(sd), ncol = length(n)) for (i in 1:length(sd)) { for (j in 1:length(n)) { iter = iter + 1 print(paste("Iteration: ", iter, "/", length(sd)*length(n), "; (", i, ", ", j, "); (sigma = ", round(sd[i], 2), ", n = ", n[j], ")", sep = "")) P[i, j] <- Simulate(10000, n[j], sd[i]) } } par(oma = c(0,0,0,0), mar = c(3,5,2,1), family = "serif") plot(NULL, xlim = c(0, max(n)), ylim = c(0, .5), axes = FALSE, xlab = NA, ylab = NA) box() axis(side = 1, tck = -.015, labels = NA) axis(side = 2, tck = -.015, labels = NA) axis(side = 1, lwd = 0, line = -.8) axis(side = 2, lwd = 0, line = -.6, las = 1) loc <- par("usr") text(loc[1]*1.02, loc[4]*1.04, expression(paste("Pr(", beta < 0, " | Sig.)")), xpd = NA, adj = c(1,0), cex = .9) mtext(side = 1, "Sample Size", line = 1) for (i in c(1,2, 3, 4)) { points(n, P[i,], type = "b", pch = 21, col = i) text(n[1], P[i, 1], bquote(sigma == .(round(sd[i], 3))), pos = 3, cex = .6) }]]>

My pet-peeve *almost* made the list at number 39.

Lack of statistical significance indicates that there is not enough evidence to reject the null hypothesis of no effect, so inferences based on the direction of nonstatistically significant coefficients should be offered cautiously or not at all.

I would elaborate the point and say that researchers should not interpret insignificant coefficients as "no effect" either. I've talked about that plenty, so I'll just point non-regular readers to this post and this paper.

Point 17 seems a little weird.

Hypotheses should be numbered consecutively, such as H

_{1}, H_{2}, and H_{3}, instead of being named with number-letter combinations, such as H_{1a}, H_{1b}, and H_{2}, because number-letter combinations foster confusion: H_{2}is the third hypothesis in the aforementioned example.

I can't imagine that the number-letter combinations always foster more confusion that than simple numbering. (Though I did once read a paper with an H_{1}, H_{1a}, and H_{2, }but no H_{1b}. That strategy left me scratching my head.) The reader won't remember numbers or number-letter combinations, so it seems like we should just get rid of numbering hypotheses altogether. If a hypothesis isn't important enough to have a descriptive name, perhaps it's not worth the reader's time.

Point 57 really stepped on my toes.

The manuscript should not contain errors in grammar, spelling, or punctuation; these errors indicate that the writing of the manuscript was not conducted carefully and suggests that the reported research might not have been conducted carefully, either.

I find typos in papers that I've edited dozens of times. It never ends.

]]>In particular, my friend had a hypothesis that the variance of the latent outcome (commonly called "y-star") should increase with an explanatory variable of interest. He was using the heteroskedastic probit model, which looks something like \(Pr(y_i = 1) = \Phi(X_i\beta, e^{Z_i\gamma})\), where \(\Phi()\) is the cumulative normal with mean \(X_i\beta\) and \(e^{Z_i\gamma}\).

He wanted to argue that his explanatory variable increased both the mean function (\(X\beta\)) and the variance function (\(e^{Z\gamma}\)). To do this, he included his variable in both the \(X\) and \(Z\) matrices and tested the statistical significance of the associated coefficients. He found that they were both significant. It would seem that his variance increases the mean and the variance of the latent outcome. He wanted to know if this was good evidence for his theory.

I replied that I did not think so, because a binary outcome variable doesn't contain any direct information about a non-constant variance. Indeed, the variance of a Bernoulli random variable is tied directly to the probability of success. This implies that any inference about changes in \(e^{Z\gamma}\) must come from observed changes in the probability of a success (i.e. changes in the mean function). Because we've assumed a specific (i.e. linear) functional form for the mean function, deviations from this will be attributed to the variance function. Because of this structure, the results are driven totally by our assumption of the linearity of the mean function. Indeed, it would not be hard to find a plausible non-linear mean function (e.g. quadratic specification) that makes the \(\gamma\) parameter no longer significant.

I thought a good way to illustrate this claim would be to show that for a large but plausible sample size of one million, the heteroskedastic probit will suggest a non-constant variance when the relationship is simply a logit.

To see an illustration of this, start by simulating data from a simple logit model. Then estimate a regular probit and a heteroskedastic probit.

n <- 10^6 x <- runif(n) y <- rbinom(n, 1, plogis(-4 + 3*x)) r1 <- glm(y~x,family=binomial(link="probit")) library(glmx) h1 <- hetglm(y ~ x)

Now if the coefficient for x is significant in the model of the scale, then we should conclude there is heteroskedasticity, right? No, because we already know that the latent variance is constant. However, we've *barely* mis-specified the link function (we're using a probit, the true model is logit). This slight mis-specification causes the results to point toward non-constant variance.

Coefficients (binomial model with probit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -2.090026 0.007269 -287.5 <2e-16 *** x 1.589261 0.007418 214.2 <2e-16 *** Latent scale model coefficients (with log link): Estimate Std. Error z value Pr(>|z|) x -0.20780 0.01475 -14.09 <2e-16 ***

Here is a plot of the predicted probabilities from the true, probit, and heteroskedastic probit models. Notice that in the range of the data, the heteroskedastic probit does a great job of representing the relationship. However, that's not because the variance is non-constant as the heteroskedastic probit would suggest. It's because the link function is slightly mis-specified.

I think the logit example makes the point powerfully, but let's look at a second example just for kicks. This time, let's say that we believe there's heteroskedasticity that can be accounted for by x, so we estimate a heteroskedastic probit and include x in the mean and variance function. However, again we're wrong. Actually the true model has a constant variance, but a non-linear mean function (\(\beta_0 + \beta_1x^2\)).

If we simulate the data and estimate the model, we see again that our mis-specified mean function leads us to conclude that the variance is non-constant. In this case though, the mis-specification is severe enough that you'll find significant results with much smaller sample. If we made conclusions about the non-constant variance from the statistical significance of coefficients in the model of the variance, then we would be led astray.

Coefficients (binomial model with probit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -3.6122 0.3726 -9.694 <2e-16 *** x 3.1493 0.2809 11.210 <2e-16 *** Latent scale model coefficients (with log link): Estimate Std. Error z value Pr(>|z|) x -0.8194 0.2055 -3.988 6.66e-05 ***

Again, the plot shows that the heteroskedastic probit does a good job at adjusting for the mis-specified mean function (working much like a non-parametric model).

I think that researchers who have a theory that allows them to speculate about the mean and variance of a latent variable should go ahead and estimate a statistical model that maps cleanly onto their theory (like the heteroskedastic probit). However, these researchers should realize that this model does not allow them to distinguish between non-constant variance and a mis-specified mean function.

]]>