Psychological Statistics

Effect size measures for simple mediation

2026-01-25T13:50:00.002+00:00

A claim I see quite often in mediation is that there is full (or complete) mediation if the indirect effect (ab) is statistically significant and the direct effect (c') is not. That never really made any sense to me as this is essentially a claim about the size of the mediation effect. For instance the overall effect c could be 10, ab could be 5.01 and c' could be 4.99 and the proportion of mediation accounted for by ab and by c' would be about equal at 0.5 or 50%. (If using common approaches like bootstrapping its feasible, albeit unlikely, for c' to be bigger than ab and yet the latter not the former is statistically significant).

In my view, claiming full mediation requires that the ab effect is large and accounts for most and perhaps nearly all of the mediation effect. There are some effect size metrics proposed to determine if mediation is full or substantial. For example, one example includes the proportion of the mediation effect account for by ab being > 0.80 as part of the condition. My personal view is that it's more helpful to report a mediation effect that could go from 0 to 1 (or 0 to 100%) and just interpret that. In practice full or complete mediation expressed as 100% of the effect is likely to be rare and its more useful to know how substantial the effect is. You could also set a threshold of say 90% or 95% in advance if you pre-registered the study (and if being able to declare full mediation really mattered).

Having determined the need for an effect size estimate, what should you use? Until recently I think the choice wasn't obvious. I always recommended the simple proportion measure above because it's simple and easy to interpret and I value that more than other statistical properties. However, a 2026 paper by Yuan, wang and Liu reviews a range of different metrics and proposed a new R-square measure. Unlike many other R-square type measure it looks solely at the variance accounted for by the total effect of X on Y (c). So it takes the value 1 if the variance accounted for by ab is 100% of this value (and 0 if it explains none). I am usually wary of R-square measures but this seems reasonable (and in fact it behaves similarly to the proportion of effect measure described above). So one can see it as a refinement of this simpler approach to use variance rather than the magnitude of effect.

All of the above holds for consistent mediation, but not inconsistent mediation. Yuan et al. do describe a version of their measure for the inconsistent case, but while interesting I think it isn't useful in practice. I think it makes more sense to generalise the proportion measure to the inconsistent case.

I think the paper is well worth a read as it looks at many different measures and also includes R code to calculate most of the measures. I ended up adapting their code to make it easier to use and adding to the range of statistics provided. I also include my attempt to generalise a proportion measure to the inconsistent case. You can find my function, examples of how to use it and a longer note on my thoughts about effect size for simple mediation here.

Yuan, K.-H., Wang, Y., & Liu, H. (2026). Effect size measures in mediation analysis: New and Old, What is Good? Methods in Psychology, 14, 100224. https://doi.org/10.1016/j.metip.2025.100224

Postscript: I have had some informal communication from someone working in the area. One challenge is that the new proposed measure is not very stable. This is fairly common (e.g., the proportion measures aren't either). In general there is no perfect effect size measure for any context and one should consider the properties you need in any given situation against available measures (and often contrasting two or three can give you useful insights, whereas reliance on a single measure could be problematic). For more general thoughts on effect size and what makes a good measure I reccommend looking at Kelley and Preacher (2012).

A quick overview of the egocentric relation event model (EREM) with linked examples in R: part 1

2021-09-08T22:59:00.000+01:00

This post is based on an interdisciplinary collaboration (between anthropologists and psychologists) with Kate Ellis-Davies and Sheina Lew-Levy (who initiated the project), Eleanor Fleming and Adam Boyette. The work was recently published in Field Methods and is available here: Demonstrating the Utility of Egocentric Relational Event Modeling Using Focal Follow Data from Congolese BaYaka Children and Adolescents Engaging in Work and Play. I led on the statistical modeling of the data including adapting the implementation for focal follow data (which I'll explain more about in part 2) with the other authors contributing knowledge of the theory and literature (as we were keen to ensure that we demonstrated the approach using real data addressing a substantive research question). My only regret about the project is that, as of writing, I've only met been able to meet one of my co-authors in person.

The purpose of this blog post is to give a bit more context to the approach and make the it easier to learn about the egocentric relational event model (EREM) by providing access to the R code and data. A relational event is a discrete event involving an actor and one or more targets. For example you might observe children in a play group and code their interactions as discrete events in time such as child 1 approaching child 2, child 2 offering child 1 a toy and so on. Relational event models are one approach to modeling such data (and fall under the broader umbrella of network methods). However, not all relational event data involves such complete and detailed network data. In some cases you are only interested in or have access to data involving one actor at a time in relation to their environment (which may include interactions with other actors). Such data lend themselves to analysis using the egocentric relational event model. Although this sort of data might seem limiting it is often very rich. In particular it lends itself to analysing patterns of sequences among discrete non-overlapping events.

The model comes in two flavours: ordinal and interval. In the ordinal version you only have access to information about the order of events. In the interval version you have (or can infer) start and finish times for each event. This means that you can model not only patterns in the sequences of events but their duration. For example, does a particular type of event increase the frequency or duration of another event?

The Marcum and Butts example uses the American Time Use Survey in which "measures the amount of time people spend doing various activities, such as paid work, childcare, volunteering, and socializing". Using these data we can answer questions such as how often does sleep get interrupted by other activities, which activities cause more sleep interruptions (or more prosaic questions such as how much time do people spend doing a particular activity). We can also look at how covariates impact the frequency or duration of events. For example, are some sleep interruptions more common for men than women?

Please note that the following examples assume you have some familiarity with linear regression models (and ideally generalised linear models for discrete data such as logistic or Poisson regression). You'll also need a working knowledge of R (or at least similar statistical software programming environments). If R is new to you I'd suggest finding a tutorial on using R with RStudio first. (There are a lot of these online - including many videos.)

In this first part I'm actually going to focus not on our data but a paper by Marcum and Butts (2015). This introduces the egocentric relational event model and is a fantastic resource for a lot of the technical details of the model. You can run an EREM with the relevent package in R, but setting up and running EREMs is more than a little bit fiddly. Marcum and Butts' have made this easier with the informR package in R. This is essentially a helper package to make running EREM models easier. They include a really useful tutorial with R code in the paper. So if you want to learn about the EREM I'd suggest working through the relevant parts of the paper (no pun intended). To make this easier (I hope) I've added some commentary and made some minor tweaks to their example.

You can access the Marcum and Butts (2015) worked example here.

In part 2 I'll focus on our Field Methods paper.

References

Ellis-Davies, K., Lew-Levy, S., Fleming, E., Boyette, A. H., & Baguley, T. (2021). Demonstrating the Utility of Egocentric Relational Event Modeling Using Focal Follow Data from Congolese BaYaka Children and Adolescents Engaging in Work and Play. Field Methods. 33(3):287-304.

Marcum, C. S., & Butts, C. T. (2015). Constructing and Modifying Sequence Statistics for relevent Using informR in R. Journal of Statistical Software, 64(5).

I Will Not Ever, NEVER Run a MANOVA

2021-08-17T15:20:00.000+01:00

I have been thinking to write a paper about MANOVA (and in particular why it should be avoided) for some time, but never got round to it. However, I recently discovered an excellent article by Francis Huang that pretty much sums up most of what I'd cover. In this blog post I'll just run through the main issues and refer you to Francis' paper for a more in-depth critique or the section on MANOVA in Serious Stats (Baguley, 2012).

I have three main issues with MANOVA:

1) It doesn't do what people think it does

2) It doesn't offer Type I error protection for subsequent univariate tests (even though many text books say it does)

3) There are generally better approaches available if you really are interested in multivariate research questions

Let's start with the first point. People think MANOVA analyses multiple outcome variables (DVs). This isn't really correct. It creates a composite DV by combining the outcome variables in an atheoretical way. Then analysis proceeds on the composite DV. The composite is in a sense 'optimal' because weights are selected to maximise the variance explained from the set of predictors in the model. However, this optimisation will capitalise on chance. Furthermore it will be unique to your sample – invalidating (or at least making difficult) comparisons between studies. It will also be hard to interpret. This has implications knock-on implications for things like standardised effect sizes as generally effect size metric for MANOVA relate to the composite DV rather than the original outcome variables. For further discussion see Grayson (2004).

In relation to the second point the issue is one that is fairly well known in other contexts. In ANOVA one can use an omnibus test of a factor to decide whether to proceed with post hoc pairwise comparisons. This is the logic behind the Fisher LSD test and it is well known that this test doesn't protect Type I error very well if there are more than 3 means being compared – specially it protects against the complete null hypothesis and not the partial null hypothesis (see Serious Stats p. 495-501). For adequate Type I error protection it would be better to use something like the Holm or Hochberg correction (the latter having greater statistical power if the univariate test statistics are correlated – which they generally are if MANOVA is being considered). That said if you do just want a test of omnibus null hypothesis – that there are no effects on any of the DVs – MANOVA may be a convenient way to summarise a large set of univariate tests that are non-significant.

Last but not least, there exist multivariate regression (and other) approaches that are more appropriate for multivariate research questions (see also Huang, 2019). However, I've rarely seen MANOVA used for multivariate research questions. In fact, I've rarely if ever seen a MANOVA reported that actually aided interpretation of the data.

References

Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences. Palgrave Macmillan. (see pages 647-650)

Grayson, D. (2004). Some Myths and Legends in Quantitative Psychology. Understanding Statistics, 3(2), 101–134. https://doi.org/10.1207/s15328031us0302_3

Huang, F. L. (2020). MANOVA: A Procedure Whose Time Has Passed? Gifted Child Quarterly, 64(1), 56–60. https://doi.org/10.1177/0016986219887200

Huberty, C. J., & Morris, J. D. (1989). Multivariate analysis versus multiple univariate analyses. Psychological Bulletin, 105(2), 302–308. https://doi.org/10.1037/0033-2909.105.2.302

A brief introduction to logistic regression

2021-01-19T14:34:00.002+00:00

I wrote a brief introduction to logistic regression aimed at psychology students. You can take a look at the pdf here:

A more comprehensive introduction in terms of the generalised linear model can be found in my book:

Baguley, T. (2012). Serious stats: a guide to advanced statistics for the behavioral sciences. Palgrave Macmillan.

Serious Stats: Obtaining CIs for Spearman's rho or Kendall's tau

2020-05-18T20:29:00.000+01:00

I wrote a short blog (with R Code) on how to calculate corrected CIs for rho and tau using the Fisher z transformation.

Serious Stats blog post on CIs for rho and tau

Serious stats: Type II versus Type III Sums of Squares

2020-05-13T22:28:00.002+01:00

I have written a short article on Type II versus Type III SS in ANOVA-like models on my Serious Stats blog:

https://seriousstats.wordpress.com/2020/05/13/type-ii-and-type-iii-sums-of-squares-what-should-i-choose/

Egon Pearson correction for Chi-Square

2019-09-05T16:37:00.003+01:00

I have just published a short blog on the Egon Pearson correction for the chi-square test. This includes links to an R function to run the corrected test (and also provides residual analyses for contingency tables).

The blog is here and the R function here.

Provisional programme: ESRC funded conference: Bayesian Data Analysis in the Social Sciences Curriculum (Nottingham, UK 29th Sept 2017)

2017-09-15T14:50:00.000+01:00

Bayesian Data Analysis in the Social Sciences Curriculum

Supported by the ESRC’s Advanced Training Initiative

Venue: Bowden Room Nottingham Conference Centre

Burton Street, Nottingham, NG1 4BU

Booking information online

Provisional schedule:

*Time*	*Speaker*	*Title*
9.30		Registration (and coffee!)
9.50	Thom Baguley	Introduction And Welcome
10.00	Mark Andrews Thom Baguley	Teaching Bayesian Data Analysis To Social Scientists
10.50	Zoltan Dienes	Principles For Teaching And Using Bayes Factors
11.40		Coffee
12.00	Colin Foster	Bayes Factors Show Equivalence Between Two Contrasting Approaches To Developing School Pupils’ Mathematical Fluency
12.20	Helen Hodges	Towards A Bayesian Approach In Criminology: A Case Study Of Risk Assessment In Youth Justice
12.40		Lunch
1.40	Jayne Pickering Matthew Inglis Nina Attridge	Does Pain Affect Performance On The Attentional Networking Task?
2.00	Oliver Clark	First Steps Towards A Bayesian Model Of Video Game Avatar Influence
2.20		Coffee
2.40	Richard Morey	The Fallacy Of Placing Confidence In Confidence Intervals
3.30	Daniel Lakens	Learning Bayes As A Frequentist: A Personal Tragedy In Three Parts
4.20		Close and farewell

Organizers:

Thom Baguley twitter: @seriousstats

Mark Andrews twitter: @xmjandrews

Announcement: ESRC funded conference: Bayesian Data Analysis in the Social Sciences Curriculum (29th Sept 2017)

2017-07-27T20:36:00.003+01:00

I am pleased to announce booking is now open for the end of Prior Exposure grant conference on Bayesian Data Analysis in the Social Sciences Curriculum on 29th September. We are still finalising the programme but have confirmed contributions from Richard Morey (University of Cardiff), Zoltan Dienes (University of Sussex) and Daniel Lakens (Eindhoven University of Technology) as well as presentations from Mark Andrews and myself.

We are also inviting seeking submissions from PhD students and others - on using and teaching Bayes. Booking for the final round of workshops will also open shortly.

https://www.ntu.ac.uk/about-us/events/events/2017/09/esrc-conference-bayesian-data-analysis-in-the-social-sciences-curriculum

You can book via this link:

http://onlinestore.ntu.ac.uk/conferences-events/school-of-social-sciences/events/bayesian-data-analysis-in-the-social-sciences-curriculum

Delegate Registration: Registration is £20 (Early Bird fee of £10 if booked before Friday 1 September 2017). The cost includes lunch and coffee in the Nottingham Conference Centre (Newton building, Nottingham Trent University).

STOP PRESS Introductory Bayesian data analysis workshops for social scientists (June 2017 Nottingham UK)

2017-06-13T15:50:00.003+01:00

The third and (possibly) final round of the workshops of our introductory workshops was overbooked in April, but we have managed to arrange some additional dates in June.

There are still places left on these. More details at: http://www.priorexposure.org.uk/

As with the last round we are planning a free R workshop before hand (reccomended if you need a refresher or have never used R before). Unfortunately we can't offer bursaries for these additional workshops (as this wasn't part of the original ESRC funding).

They are primarily (but not exclusively) aimed at UK social science PhD students (so not just Psychology or Neuroscience, but very much also Sociology, Criminology, Politics and other social science disciplines). We hope the workshops will also appeal to early career researchers and others doing quantitative social science research (but with little or no Bayesian experience).

The registration cost for each workshop is £20 (for postgrads) and £30 (or others).

Serious Stats blog: CI for differences in independent R square coefficients

2017-05-25T22:30:00.001+01:00

In my Serious Stats blog I have a new post on providing CIs for a difference between independent R square coefficients.

You can find the post there or go direct to the function hosted on RPubs. I have been experimenting with knitr but can't yet get the html from R Markdown to work with my blogger or wordpress blogs.

ESRC funded Bayesian data analysis workshops for social scientists

2017-01-24T21:16:00.004+00:00

The third and (possibly) final round of the workshops is open for booking. As with the last round we are planning a free R workshop before hand (reccomended if you need a refresher or have never used R before), but can't offer bursaries for this.

More details at: http://www.priorexposure.org.uk/

This is part of the ESRC Advanced Training Initiative.

The first two workshops are available for booking now (though places are filling up quite fast). They are primarily (but not exclusively) aimed at UK social science PhD students (so not just Psychology, but very much also Sociology, Criminology, Politics and other social science disciplines). We hope the workshops will also appeal to early career researchers and others doing quantitative social science research (but with little or no Bayesian experience).

The ESRC is supporting us with bursary funding for travel and subsistence (see web site for details). These are eligible to all UK social science PhD students (not just for those with ESRC funding), but such funded places are limited.

If the demand is sufficient we may try and put on additional workshops this year (though maybe I'm being too optimistic!). We ran extra workshops last June for this reason.

The registration cost for each workshop is £10 (for postgrads) and £20 (or others) - the information is buried in the booking link but we'll try and make that clearer ... The workshops are non-profit so this fee is to cover basic running costs (e.g., lunch etc.).

ESRC Prior Exposure workshops: advanced Bayesian data analysis

2016-09-02T15:17:00.002+01:00

There are still a few places left on our September Bayesian Data analysis workshops held in Nottingham Trent University on September 15 and 16, 2016.

These are part of the ESRC's Advanced Training Initiative and are aimed at PhD students and researchers (postdocs, lecturers, etc.) in social sciences.

The fees are £20 per workshop (£10 for PhD students). A limited number of bursaries to cover travel expenses for students are also available.

Full details about the workshops, as well as the online booking system can be found here.

Announcements and news about these workshops are also made using our twitter account: @priorexposure

Apologies for the delay in announcing these here - owing to building work over the Summer and the usual holiday absences I wasn't able to post details earlier

Stop Press: Additional dates for the 2016 Prior Exposure Bayesian Data Analysis workshops

2016-01-29T15:59:00.002+00:00

Places on the Easter Prior Exposure (introductory) workshops filled up very quickly and we had to turn away quite a few people. In response we’ve managed to arrange another set of events on 16 and 17 June (again with an optional R bootcamp on June 15th). Booking is now open:

http://www.priorexposure.org.uk/schedule

(Details of the R bootcamp are here)

Unfortunately these extra dates aren't covered by the ESRC funding so we are not able to offer bursaries and have had to raise the booking fee slightly (£15 for students and £25 for others). Workshops 3 and 4 (the more advanced topics) will run later in the year (September) and will offer some bursaries (for UK doctoral students). We are also running everything again in 2017 (our final year before funding runs out).

PLS think twice about partial least squares

2015-07-23T22:29:00.001+01:00

One of the great things about writing a statistics book was finding an excuse to read about dozens of topics that I knew a little about but hadn't got around to studying in depth. Even so, there were a number of topics I ended up missing out on completely (apparently once the book gets to over a 900 pages or so they make you leave stuff out). One of those topics is partial least squares (PLS).

I knew a bit about the technique (but it turns out even less than I thought). I recently came across an excellent paper on partial least squares by Mikko Rönkkö, Cameron McIntosh and John Antonakis. The main thrust of the paper is simple - partial least squares is a widely used technique outside psychology, and it has been suggested should be more widely used within psychology. Rönkkö et al., however argue that this is probably a bad idea. A very bad idea. Their argument rests on two main arguments. First, that partial least squares is equivalent to a regression model using indicator variables to create weighted composite predictors. Second, that the benefits of partial least squares have been greatly overstated. In particular the claim that PLS can deal with measurement error seems simply to be be false (as just creating composites from indicator variables can't do this). Worryingly, some implementations of PLS seem to have dangerous properties (notably one with a 100% false positive rate) and PLS generally seems to inflate Type I error for small effects. The latter property may give the impression of attenuating measurement error (but merely provides a bias that that may sometimes counteract attenuation arising from measurement error).

Rönkkö et al. paper is, I think, a model of clarity and implies that PLS is going to be of limited value to psychologists. I found the paper particularly interesting because I have mostly seen PLS advocated as a way of dealing with multicollinearity. This makes sense as multicollinearity can reasonably be handled by replacing predictors with composites. The main drawback of PLS, however, is that the composites are derived automatically by the PLS algorithm. This sort of 'black box' solution produces good prediction but can overcapitalise on quirks in the sample and thus may not generalise (especially for small samples). More importantly, the composites may well be uninterpretable. For most psychological applications I'd rather use an interpretable but 'non-optimal' composite (e.g., a simple average of highly correlated predictors) than go down this route.

For the same reason I'd generally rather not use MANOVA (which finds an optimum linear combination of DVs in your sample). Of common analytic methods MANOVA is one of the least well understood techniques in psychology (and I have rarely seen a published application of MANOVA that wouldn't be enhanced using a different, often simpler, technique).

Prior exposure workshops 3 and 4 (Bayesian data analysis for social scientists)

2015-07-17T09:17:00.001+01:00

Booking is now open for workshops three and four of our Prior Exposure Bayesian data analysis training (all taking place in Nottingham). The dates are 22 and 23 September 2015.

These follow on from the first two workshops but if you have some training in regression (especially multilevel regression) and familiarity with Bayesian statistics this is roughly where workshop three will start.

Workshop 3: Introduction to advanced Bayesian data analysis. This workshop focuses on advanced probabilistic modeling in Bayesian data analysis, and in particular, Bayesian data analysis using multilevel regression models.

Workshop 4: Nonlinear and latent variable models. This final workshop focuses on Bayesian latent variable modeling, particularly using mixture models.

Further details can be found here:

http://www.priorexposure.org.uk

Fees are £20 per workshop (£10 for PhD students) and some ESRC bursary funding is available for UK social sciences PhD students.

Thom

Prior exposure: Bayesian data analysis workshops (ESRC Advanced Training Initiative)

2015-02-09T13:41:00.001+00:00

Mark Andrews and I have just launched the web site for our Prior Exposure Bayesian Data Analysis workshop series. This is part of the ESRC Advanced Training Initiative.

Further details are available here.

The first two workshops are available for booking now (though places are filling up quite fast). They are primarily (but not exclusively) aimed at UK social science PhD students (so not just Psychology, but very much also Sociology, Criminology, Politics and other social science disciplines). We hope the workshops will also appeal to early career researchers and others doing quantitative social science research (but with little or no Bayesian experience).

The ESRC is supporting us with bursary funding for travel and subsistence (see web site for details). These are eligible to all UK social science PhD students (not just for those with ESRC funding), but such funded places are limited.

We will run similar workshops next year and if the demand is sufficient we may try and put on additional workshops this year (though maybe I'm being too optimistic!).

Update: As of writing the registration cost for each workshop is £10 (for postgrads) and £20 (or others) - the information is buried in the booking link but we'll try and make that clearer ... The workshops are non-profit so this fee is to cover basic running costs (e.g., lunch etc.) and we will try and keep these low costs for subsequent workshops.

Guest post: PNAS, facebook and the ethics of online experimentation

2014-07-04T22:38:00.000+01:00

This is a guest blog post by Gerry Markopoulos. I'm posting it because I think it is an important topic that deserves wider discussion.

Recently, an article was published in the prestigious journal ‘Proceedings of the National Academy of Sciences’ (PNAS), titled ‘Experimental evidence of massive-scale emotional contagion though social networks’. The article was published online on the 2^nd of June, 2014, and it is available here.

I would like to argue that the article needs to be retracted on the basis of violating fundamental ethical principles, it should not have been considered for publication in the first place (on the basis of the journal’s stated principles), and that it could damage the reputation of psychology on an international level. The scientific community’s disapproval needs to be made explicit in order to safeguard the public’s trust in its work and procedures especially when involving human participants.

I am very happy to report that the BPS has responded to the publication in a very timely and unambiguous fashion via a letter to The Guardian. The letter makes clear what ethical principles were violated, and how. It would have been perhaps more practical to list the principles that were not violated. (It might have been a far shorter list!) Understandably, I presume the BPS cannot go any further than merely condemn the article, considering the authors are based in the USA, and the PNAS is a US journal. I will not pretend to know anything about the legal aspect of the situation, but legality is largely irrelevant. When psychologists and other scientistis propose projects to ethics committees, they are not looking for legal loopholes. They are looking to protect their participants from any harm whether or not there is a legal provision for it. That is partly the role of ethics committees, to anticipate and try to predict how research could violate wide ethical principles such as ‘maximising benefits and minimising harm’, especially where innovative research is concerned.

The only mention of ethical issues in the PNAS article is the following:

“…it was consistent with Facebook’s Data Use Policy, to which all users agree prior to creating an account on Facebook, constituting informed consent for this research.”

One can easily see that this constitutes consent, but certainly not informed consent. It is my understanding that participants were not aware they were taking part in a psychological study, they had not been informed of the nature of the study, they had not been informed of their right to withdraw their data, and they were not debriefed. Furthermore, no steps were taken to ensure the continuing wellbeing of the participants considering the sensitive nature of the experimental manipulation – according to the article, the manipulation led to a successful induction of negative emotions. There were no exclusion criteria protecting vulnerable populations (such as depressive or emotionally unstable participants). Such issues would be raised by any informed ethics committee. Why weren’t these issues raised? One can only assume that no ethics committee approved the project. This is the only logical explanation available. I personally contacted the first author on the 29^th of June requesting clarifications, but – perhaps not surprisingly - I never received a response. At this point, I should concede that fully informed consent could compromise the outcome of the study, but such a (perhaps) necessary omission ought to be counteracted with an extensive and carefully-worded debrief minimising the risk of potential harm to participants. Having said that, in the quote above, the authors claim that accepting the terms and conditions constitutes informed consent, which it certainly does not.

The issue of the ethics committee approval (or lack thereof) leads me to what we can do as individuals to protect psychology and the reputation of the scientific community. According to the PNAS website when research with human participants is involved:

“Authors must include in the Methods section a brief statement identifying the institutional and/or licensing committee approving the experiments. For experiments involving human participants, authors must also include a statement confirming that informed consent was obtained from all participants.”

In this case, PNAS appears to have ignored its own rules. On this basis, I contacted PNAS through their contact page firmly requesting the retraction of the article.

A few days after my email, I received a response from PNAS directing me to an editorial piece on this issue. The editorial confirmed earlier suspicions that no committee had scrutinized the research proposal. There had never been a research proposal to begin with. When the article was submitted for publication, the authors stated:

“Because this experiment was conducted by Facebook, Inc. for internal purposes, the Cornell University IRB [Institutional Review Board] determined that the project did not fall under Cornell’s Human Research Protection Program”.

To summarise the editorial response, it says:

“We were aware that no ethics committee had approved the project or the data collection method, we were aware that participants were not given the opportunity to opt out, but the company that collected the data is not obligated to adhere to such rules. Therefore, we published the data. However, we are concerned”.

At this stage, I would like to reiterate this is not an issue of legality. It is an issue of ethics where loopholes have no place. Now more than ever it is obvious that the article needs to be retracted.

Unfortunately, we cannot undo the harm that potentially has been caused by this research. Considering the sample size (over 600,000) and the reported significant effect of the experimental manipulation, it is possible that vulnerable participants were harmed. What we can do is demonstrate to the public that this type of research is not representative of what we do, and that we are as indignant as they are. Can we stop private companies from conducting research in secret? I would think this is unlikely. Secret research cannot be overseen by definition. However, scientists and scientific journals should actively stay away from data collected under questionable circumstances. Publication means condoning the research process from the design stage to the write-up stage. The condoning of unethical data collection methods (through publication) only encourages such practices. This is where a difference can be made, and that is why retraction of the specific article is essential.

Multicollinearity and collinearity (in multiple regression) - a tutorial

2013-11-09T14:06:00.000+00:00

This blog post was written for undergraduate research methods teaching. I have therefore tried to keep everything relatively simple and equation-free. The content is based loosely on more detailed material in my book Serious stats.

What are collinearity and multicollinearity?

Collinearity occurs when two predictor variables (e.g., x₁ and x₂) in a multiple regression have a non-zero correlation. Multicollinearity occurs when more than two predictor variables (e.g., x₁, x₂ and x₃) are inter-correlated.

How common is collinearity or multicollinearity?

If you collect observational data or data from a non-experimental or quasi-experimental study collinearity or multicollinearity will nearly always be present. The only studies where it won’t tend to occur (unless you are very, very lucky) is in certain designed experiments – notably fully balanced designs such as a factorial ANOVA with equal n per cell. Thus the most important issue is not whether multicollinearity or collinearity is present but what impact it has on your analysis.

Do collinearity or multicollinearity matter?

I find it helpful to break down collinearity and multicollinearity into three situations, of which only the third is common.

[Note: From here on I’ll just use the terms collinearity and multicollinearity more or less interchangeably for convenience. This is also common practice in the literature]

1. Perfect collinearity. If two or more predictors are perfectly collinear (you can perfectly predict one from some combination of the others) then your multiple regression software will either not run (e.g., return an error) or it will drop one or more predictors (and possibly also return an error). Perfect collinearity happens a lot by accident (e.g., if you enter two versions of an identical variable such as the mean score and total score of a scale, or dummy codes for every category of a categorical predictor).

2. Almost perfect collinearity. If the correlation between predictors isn’t quite perfect (but is very close to r = 1) then this can sometimes cause “estimation problems” (meaning that the software you are using might be able to run the analysis or might generate incomplete output). Most modern software can cope with this situation just fine – but even then estimates will be hard to interpret (e.g., be implausibly high or low and have very large standard errors). If your software can cope with this situation then technically the estimates will be correct and you just have an extreme form of situation 3 below.

3. Multicollinearity. As already noted, most situations in which you would use regression (apart from certain designed experiments) involve a degree of multicollinearity. So if a method section ever claims that “multicollinearity is not present”, generally this will be untrue. A better statement to make is something along the lines of “there were no problems with multicollinearity”. However, generally this will also be untrue. For this to be true, the degree of multicollinearity needs to be very small or the sample size very large (or both). Neither is common in psychological research.

To understand why it is necessary to consider what impact multicollinearity has on:

i) the overall regression model,
ii) estimates of the effects of individual predictors.

The good news

Multicollinearity has no impact on the overall regression model and associated statistics such as R², F ratios and p values. It also should not generally have an impact on predictions made using the overall model. (The latter might not be true if the predictor correlations in the sample don’t reflect the correlations in the situation you are making predictions for – but that isn’t really a multicollinearity issue, but a consequence of having an unrepresentative sample).

The bad news

Multicollinearity is a problem if you are interested in the effects of individual predictors. This turns out to be a major issue in psychology because this is probably the main reason that psychologists use multiple regression: to tease apart the effects of different predictors. There are two main (albeit related) issues here: the first is a philosophical problem and the second is a statistical one.

The philosophical issue. If two or more predictors are correlated then it is inherently difficult to tease apart their effects. For instance, imagine a study that looks at the effect of happiness and depression on alcohol consumption. If happiness is highly correlated with depression (e.g., r = -.90) then regression commands in packages such as SPSS or R will come up with estimates of the unique effect of happiness on alcohol consumption (by holding depression constant). This estimate is an estimate of the effect of happiness that ignores their shared variance. However, depression isn’t generally constant if happiness varies; they tend to vary together.

The philosophical issue is this: is it meaningful to interpret the unique impact of happiness if happiness and depression are intimately related. Although this philosophical issue is potentially important, researchers often tend to ignore it. The main advice here is to think carefully before trying to interpret individual effects if there is a high level of multicollinearity in your model.

The statistical issue. The underlying statistical issue with multicollinearity is fairly simple. The unique effects of individual predictors are estimated by holding all other predictors constant and thus ignoring any shared variance between predictors. A regression model uses information about the variation between predictors and the associated variation in the outcome (y variable) to calculate estimates. As n (the number of participants or cases) or sample size) increases, the more information you have and the greater the statistical power of the analysis. You also get more information from cases or participants that are more variable relative to each other. So a participant who is more extreme on a predictor has a bigger impact on the analysis than one that is less extreme. If multicollinearity is present then each data point tends to contribute less information to the estimate of individual effects than it does to the overall analysis. (Holding the effects of other predictors constant effectively reduces the variability of a predictor and thus reduces its influence).

Multicollinearity therefore reduces the effective amount of information available to assess the unique effects of a predictor. You can also think of it reducing the effect sample size of the analysis. For instance, in the happiness and depression example happiness and depression (where r = -.90) share (-.91)2 = .81 or 81% of their variance. Thus the tests of their unique use only (100 – 81) = 19% of the information (about a fifth) in the overall model and thus the effective sample size is over 5 times smaller.

Thus the fundamental statistical impact of multicollinearity is to reduce effective sample size and thus statistical power for estimates of individual predictors. It is worth looking at each of the main statistics in turn:

b (the unstandardized slope) – this parameter estimate remains unbiased, but is estimated less accurately when multicollinearity is present (i.e., its standard error is larger)

β (the standardized slope) – this parameter estimate remains unbiased, but is estimated less accurately when multicollinearity is present (i.e., its standard error is larger)

t (the t test statistic) – this is the ratio of the estimate to its standard error and thus will be smaller (and further from statistical significance)

95% CI (the 95% confidence interval) – this is the estimate plus or minus approximately two standard errors, thus the CI will be wider (reflecting greater uncertainty in the estimate)

Stability of estimates. Many textbooks refer to problems with the stability of estimates when multicollinearity is present. What this means is that estimates will jump around a lot if you add or drop predictors or between the same model in different data sets. This isn’t really a separate issue – just a logical consequence of having a smaller effective sample size. Any estimate based on a small effective sample size will be unstable in this sense. Statistics from small (effective) samples tend to be less similar to the population than large samples.

Detecting problems with multicollinearity

Two predictors. A natural starting point is to look at the simple correlations between predictors. If you have only two predictors this is sufficient to detect any problems with collinearity: if the simple correlation between the two predictors is zero then there is no problem. If the correlation is low then collinearity is probably just a minor nuisance – but will still reduce statistical power (meaning that you are less likely to detect an effect and the effect will be measured less accurately). A larger correlation indicates a more serious problem. Working out how severe the problem is not that easy and it is generally a good idea to use a collinearity diagnostic such as tolerance or VIF for this purpose.

More than two predictors. With more than two predictors the simple correlations between predictors can be misleading. Even if they are all very low (and unless they are exactly zero) they could conceal important multicollinearity problems. This will happen if the predictor’s correlations don’t overlap – and thus they have a cumulative effect (e.g., if the correlation between x₁ and x₂ explains a different bit of the variance in the outcome y than the correlation between x₁ and x₃).

Fortunately there are a number of multicollinearity diagnostics that can help detect problems. I will focus on perhaps the simplest of these: tolerance and VIF.

Tolerance. One way to think of tolerance is that it is the proportion of unique information that a predictor provides in the regression analysis. To calculate the tolerance you first obtain the proportion of predictor variance that overlaps with the other predictors. You then subtract this number from 1. For example, if the other predictors explain 60% of the variance in x₁ then the tolerance of x₁ (in a model with those predictors) is 1 – .6 = .4. Tolerance of 1 indicates no multicollinearity (for that predictor) and tolerance values approaching 0 indicate a severe multicollinearity problem.

Tolerance indicates how much information multicollinearity has cost the analysis. Thus tolerance of .4 indicates that parameter estimates, confidence intervals and significance tests for a predictor are only using 40% of the available information.

VIF. The VIF statistic of a predictor in a model is merely the reciprocal of its tolerance (i.e., VIF = 1/tolerance). So if tolerance is .4 then the VIF is 1/.4 = 2.5. VIF stands for variance inflation factor. This number indicates how much larger the error variance for the unique effect of a predictor (relative to a situation where there is no multicollinearity). The VIF can also be thought of the factor by which your sample size needs to be increased to match the efficiency of an analysis with no multicollinearity. So a VIF of 2.5 implies that you’d need a sample size 2.5 times larger than the one you actually have to overcome the degree of multicollinearity in your analysis.

Remedies

The best remedy for multicollinearity is either: i) to design a study to avoid it (e.g., using an appropriate experimental design), or ii) increase your sample size to make your estimates sufficiently accurate. If these are not feasible there are other options that may be helpful (but which can also be harmful).

Dropping a predictor. Generally this is a bad option (though many text books recommend it). The reason it is usually a bad idea is that hides the problem rather than solving it. For instance, if x₁ and x₂ are moderately correlated it is quite possible that each of them significantly predicts y on its own but neither unique effect is statistically significant when both are in the model. Dropping x₁ will thus make it look as though x₂ is predicting y on its own (an vice versa). The true state of affairs is that they are jointly predicting y and that their precise individual contribution to this joint prediction is unknown.

Worse still, dropping a predictor can be actively misleading (e.g., if you select the predictor you drop so that the final model supports your favoured hypothesis or theory). Sometimes dropping a predictor is a somewhat reasonable thing to do. One situation is when two variables are measuring more or less the same thing. For instance, if you have two measures of trait anxiety and they are highly correlated, it may well be reasonable to drop one of them (though in this case there are still better other options. Another situation is when you believe one variable is just a proxy for the other. For instance, both age and school year are highly correlated and both are predictors of arithmetic ability. In this case age may just be a proxy for school year (on the assumption that arithmetic is taught rather than acquired spontaneously as you age).

Combining or transforming predictors. If you have highly correlated predictors it is usually better to combine them in some way rather than drop them from the analysis. There are many ways that predictors could be combined (and statistical procedures such as factor analysis exist that are designed to do exactly this). However, even crude methods such as adding predictors together (or averaging them) can be surprisingly effective (though it may be necessary to rescale them if they are not on the same scale). Other options may also suggest themselves (e.g., using the difference between predictors or some weighted combination) depending on the theory motivating your model.

Do nothing. Sometimes the best thing to do is nothing. You may just wish to honestly report that a set of predictors jointly predicts some outcome and that more data are required to tease their individual effects apart. Alternatively, you may not care that some of your predictors are highly correlated. For instance if you have some predictors of theoretical interest and some that are not (e.g., because they are potential confounding variables), as long as the predictors you are interested in have high tolerance it won’t matter if the other predictors have low tolerance. Such predictors are sometimes called nuisance variables – and what matters is that you have dealt with them in some way (not whether you have estimated them accurately).

Conclusions

There are four main conclusions to take from this tutorial:

1) Multicollinearity is nearly always a problem in multiple regression models

2) Even small degrees of multicollinearity can cause serious problems for an analysis if you are interested in the effects of individual predictors

3) Small samples are particularly vulnerable to multicollinearity problems because multicollinearity reduces your effective sample size for the effects of individual predictors

4) There are no ‘easy’ solutions (e.g., dropping predictors is generally a bad idea)

Update

I have a short note on my book blog about getting multicollinearity diagnostics in R.

Cronbach to the future

2013-08-06T19:58:00.001+01:00

One fascinating thing about working in the area of psychological statistics is how hard it is to move people away from reliance on bad, inefficient or otherwise problematic methods. My own view - informed to some extent by the literature, by experience and by anecdote is that it isn't sufficient merely to establish than the standard approach is wrong. It isn't even sufficient to provide an obviously superior alternative. You also need to three other things: i) get the message out to the people using the method, ii) reduce barriers to implementing the method (provide user-friendly software, easy to understand tutorial sand so forth), and iii) get the new method taught at undergraduate or masters level. A good illustration is the need to provide confidence intervals (CIs) as well as point estimates of statistics. This has been advocated for decades and has only relatively gradually trickled through to standard practice. In addition, CIs are commonly reported only where popular software such as SPSS reports them by default. For instance, few psychology papers report a CI for the correlation coefficient r (probably because it isn't in many introductory texts and isn't part of the default SPSS output).

A case in point is the problem of internal reliability estimation. There are dozens of papers in the psychometrics literature that have shown that the most popular internal consistency reliability measure, coefficient alpha (or Cronbach's alpha) is seriously flawed. A number of alternative approaches or measures have been proposed that are relatively easy to estimate and have good properties when applied to scales in psychology. However, these measures rarely get used in practice. The main barriers here are probably awareness of the problem and availability of appropriate software. My guess is that once these barriers are reduced then alternatives to alpha will also get into text books and be more widely taught.

Tom Dunn (a former PhD student) has just written a paper (co-authored with myself and Viv Brunsden) aiming to change people's attitude to coefficient alpha. This has just been accepted in the British Journal of Psychology. In it we try to summarize with as little jargon as possible the criticisms of coefficient alpha and recommend a simple alternative: McDonald's coefficient omega (McDonald, 1999). Crucially we also provide a mini-tutorial on calculating omega using R. We chose mainly R because it is free, open source and runs on Mac, PC and linux systems. A further, major advantage is that the MBESS package will estimate a bootstrap CI for omega. A reliability estimate (of any kind) is pretty useless if presented as a point estimate because it could be measured very imprecisely. In many cases the lower bound of the 95% CI is a more useful guide to whether a test is reliable. The lower bound will usually be conservative but it is better to be safe than sorry in most cases.

A pre-print of the paper (links to the online version will be added as soon as they are available) can be found here. The R script that runs the example in the paper can be accessed here. The data sets (in a zipped folder called "omega example") can be downloaded here. Unzip this folder and put it on your desktop. (If you move it elsewhere you need to specify the path in the R code or change the R working directory to the folder where the data files are located. You can also download the .csv formatted data file directly from here.

References

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

Dunn, T., Baguley, T., & Brunsden, V. (2013, in press). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology.

McDonald, R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Lawrence Erlbaum Associates.

Why faking data is bad ...

2013-06-21T16:32:00.003+01:00

It never occurred to me until today to write a post about why faking data is bad. However, I noticed an interesting exchange on Andrew Gelman's blog (see the comments on this post about Marc Hauser). One commenter argued that it was not clear that Hauser had faked his data (though I don't think that plausible given the results of the investigations and Hauser's dismissal from Harvard), and - more interestingly - that any data fraud was not serious because his supposedly fraudulent work has been replicated. This argument is in my opinion deeply flawed.

Andrew Gelman's response was:

To a statistician, the data are substance, not form.

I would generalize that to all of science. We'd certainly be better of thinking about data collection and analysis as integral to doing science rather than merely a necessary step in publishing papers, getting tenure or generating outputs for government research assessments.

Replication in this context just means getting the direction of effect correct. When you fake data you mess up the scientific record in multiple ways. A replication doesn't solve this or remove the distortion. For instance a common problem in meta-analysis is that people republish the same data two or more times (e.g., writing it up for different journals or through publishing interim analyses or salami slicing). This can be very hard to spot through accidental or deliberate obscuring of data sources. The upshot is that any biases or quirks of the data are magnified in the meta-analysis. Publishing fake data is worse than this because the biases, quirks, effect sizes, moderator variables are made up. Even publishing an incorrect effect size could be hugely damaging. In fact, most problems with medical and other applied research are related to effect size rather than presence of an effect.

Furthermore, the replication defense (albeit flawed in practice) has additional problems. One is that the replication probably isn't independent of the fake result. It is hard to publish failed replications - and researchers will be more lenient in their criteria to decide that they have replicated an established effect (e.g., using a one tailed test or re-running a failed replication on the assumption that they were unlucky). The most obvious problem is that you can't be sure in advance that the effect is real unless you run the experiment in the first place. I have run several experiments that have failed to show an effect or have gone in the opposite direction from what I believe.

Faking data is a bad idea - even you are remarkably insightful (and undoubtedly Hauser was clever) - the real data are a necessary part of the scientific process. Making up data distorts the scientific record.

Serious stats: using multilevel models to get accurate inferences for repeated measures ANOVA

2013-06-13T12:55:00.001+01:00

This article from my other blog may be of interest to readers of this blog: http://seriousstats.wordpress.com/2013/04/18/using-multilevel-models-to-get-accurate-inferences-for-repeated-measures-anova-designs/

Neuroscience, statistical power and how to increase it

2013-04-21T20:50:00.000+01:00

There has been quite a bit of buzz recently about the Button et al. Nature Reviews Neuroscience paper on statistical power. Several similar reviews have been published in psychology and other disciplines and come to broadly the same conclusion - that most studies are underpowered. The main difference with the Button et al. study is that they don't just find that typical studies are underpowered to detect the average size of effect in a field, but they find extremely low power in neuroscience research (around 20%, and below 10% for some subfields). Contrast this with a typical review from psychology and related disciplines. Sedlmeier and Gigerenzer (1989, Table) report power to detect a medium effect size ranging from 37% to 89%. David Clark-Carter (1997) reviewed papers in the British Journal of Psychology and found power to detect a median effect of 59%. Thus the power of typical research in psychology is not that high, but (if we make fairly reasonable assumptions about the size of typical effects in the discipline) estimates appear to be around 60% rather than the 20% found in the Button et al. paper. What caught me interest, however, was some of the responses to the publication in blogs and blog comments. For example one of the comments on Ed Yong's piece stated

Another argument for parallel recording. Traditional, one-neuron-at-a-time neurophysiological papers study 10s of neurons. Multi-electrode studies have 100s or 1000s of neurons. Enough power? Maybe not, but way more power than single neuron recording.

A similar sentiment arises in Matt Wall's piece:

MRI scanners have significantly improved in the last ten years, with 32 or even 64-channel head-coils becoming common, faster gradient switching, shorter TRs, higher field strength, and better field/data stability all meaning that the signal-to-noise has improved considerably. This serves to cut down one source of noise in fMRI data – intra-subject variance. The inter-subject variance of course remains the same as it always was, but that’s something that can’t really be mitigated against, and may even be of interest in some (between-group) studies. On the analysis side, new multivariate methods are much more sensitive to detecting differences than the standard mass-univariate approach.

Matt's piece is thoughtful and I would agree which much of what he writes, but the idea that increasing observations within a person will do much to resolve the problem is probably not correct (and for reasons that Matt mentions). To understand why, consider the typical nature of the experimental designs being used. As I understand it there are essentially two main types of design: a nested repeated measures design or a factorial design with fully crossed random effects. There are many variants (e.g., additional layers of nesting, additional fully crossed random factors), but the aforementioned characteristics capture characteristics most of the designs I'm familiar with in cognitive neuroscience (and possibly in many other areas of neuroscience).

In a nested repeated measures design there are m multiple measurements within each of n persons. The multiple measurement are correlated in some way so - in general - the power of the design has an effective sample size that is less than N (where N = n * m). It turns out that for most such designs the limiting factor in power and precision is n and not m or N.

This isn't always true, but generally experimental designs get refined quite quickly to reduce the impact of sources of error in the repeated measurements. This could be increasing the number of trials or tightening up the experimental procedures (e.g., instructions, quality of materials) or by technical advances that reduce measurement error for each measurement occasion. Once you get measurement error per trial moderately low, improving measurement error further has very little impact on power. That's because the error at each measurement occasion includes transient error that can't really be eliminated (many behaviours are just inherently variable from occasion to occasion) and because as you reduce these errors the other sources of error in the study become the main limiting factors on power.

For example, when I was a PhD student many reaction time experiments used computers with dodgy clocks that couldn't time more accurately than 1/60th of a second or around 17 ms (and perhaps many still do). If you are looking for a priming effect of say 30 milliseconds this would seem like a major problem. However, you can get pretty accurate inferences without much bias or loss of power as long as the variability of the RTs are fairly large - which they generally are (Ulrich & Giray, 1989). For most neuroscience work involving humans the limiting factors in power (once you are dealing with a reasonably refined experimental set-up) are therefore related to n. A further consideration is that top level n generally needs to be in the 30-50 range or (preferably) greater just to get vaguely reasonable estimates of the variances and covariances if you are dealing with data sampled from approximately normal distributions. Smaller samples also make the study more vulnerably to a atypical 'outlier' at the person level (e.g., a participant using a weird strategy or responding randomly) or to selective bias by the experimenters (dropping a 'noisy' participant because they go against the hypothesis). Having small n at the top level may also make focus on statistical significance rather than interval estimates of effects more attractive (because it reduces precision of measurement). In other words it encourages studies that find 'evidence' of an effect and discourages focus on accurate estimates of the size of an effect.

For fully crossed random factor designs the situation is worse. In these designs you sample both people and stimuli (e.g., faces, words, etc.) from a large (conservatively assumed to be infinite) population. The limiting factor on power now probably depends not on n1 (the number of people) or n2 (the number of stimuli) but the the smaller of n1 and n2 (assuming you want to make inferences that generalise to people and stimuli not in your experiment). Thus having 1000 people has little effect on power if your study uses only two faces (and you want to make general inferences about face perception rather than perception of those two faces). This is a slight oversimplification - as it assumes that the stimuli and people are equally variable in terms of what you measure - however it is a good rule of thumb unless variability in either people or stimuli is large enough to swamp the other source.

There is also an important caveat here - I'm assuming that you do the statistics correctly. Many, many studies still analyse fully crossed random factor designs as if they are nested, resulting in spuriously high power (see here for an earlier blog post on this).

This analysis should hold whenever: i) the basic experimental procedure is fairly well-refined, ii) variability between people (or stimuli in appropriate designs) on the measures of interest are non-negligible. Thus it should hold more often than not in psychology and related areas of neuroscience. There are undoubtedly subfields in which it won't hold (e.g., some areas of vision research where n = 2 studies are common because individual differences on the crucial effects are low).

Postscript

One objection to my conclusion is that if neuroscience power is limited by number of participants and number of stimuli, why do small samples persist? This is a good question. I offer three main answers: i) As with psychology (where power is also generally low, remember) you can have low power for each test if you have multiple tests. Maxwell (2004) pointed out that a typical 2 x 2 factorial design might only have 50% power per test but that means 87.5% chance of at least one significant result. Thus low power generally produces something statistically significant (though it also predicts that replications will generally fail to show consistent patterns of statistical significance), ii) researcher degrees of freedom (see Simmons et al., 2011), and iii) many research teams run many small studies (e.g., undergraduate and masters projects) so (in some cases) there are many unreported studies with null results.

References

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147–63.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-66.

Ulrich, R., & Giray, M. (1989). Time resolution of clocks: Effects on reaction time measurement - Good news for bad clocks. British Journal of Mathematical & Statistical Psychology, 42, 1-12.

Reflecting on the end of history illusion illusion

2013-04-10T21:38:00.002+01:00

A while back Jon Sutton at The Psychologist asked my opinion on the end of history illusion. This was sparked by an interesting Science paper by Quoidbach, Gilbert and Wilson. Blogger and mathematician Jordan Ellenberg had written a blog post arguing that the paper makes a mistake: "a somewhat subtle mistake, but a bad mistake, and one which kills a big chunk of the paper".

Jon wanted a second opinion, and after a bit of reading I replied that Ellenberg's criticisms were valid. I meant to blog about it at the time but got caught up in other things. Consequently I missed the BPS research digest piece on it.

The reason for writing this blog post is because the flaw that Ellenberg spotted is quite interesting in its own right and because both the description by Ellenberg and the description in the Research Digest article probably don't explain it clearly enough for some readers to appreciate. Ellenberg's piece is (I hasten to add) crystal clear but relies on a reader being comfortable with the formal, mathematical approach he takes (which many psychologists won't be). The Research Digest description just gives the brief gist (with a link to Ellenberg for the full picture). Here is my belated attempt at a psychologist-friendly interpretation with no formal notation - and as little maths as possible.

According to the end of history illusion people underestimate how much they will change in the future. For example, someone asked to predict how their personality would change in the next ten years would come up with a prediction closer to their original position than their actual position. Quoidbach et al. tested this mainly by asking people to predict future values on some psychological variable (e.g., a personality test score) and then showing that actual change is much greater than the difference between the original and predicted scores. This seems highly plausible, but Ellenberg pointed out that the difference in the predicted and original scores is a different quantity from the expected (absolute) change in scores.

Why is this? Perhaps the easiest way to understand is to work through a simple example. Imagine that my extraversion score is 50 on a scale that goes from 0 (extremely introverted) to 100 (extremely extraverted). A researcher then asks me to predict my extraversion score in 10 years time. I, being a keen observer of human nature (bear with me on this if you know me - it is just an example), am aware that personality is not fixed and judge that I am likely to change quite a bit - say 15 points - on the scale. However, I might get more extraverted or I might get more introverted (depending on how life treats me over the next ten years). Given that I'm in the middle of the scale, I could end with a score of 35 or a score of 65. Thus I predict that my extraversion score after 10 years will be (35 + 65)/2 = 50. It looks as though I've predicted zero change, when what I've done is give the best prediction I can (one that minimizes my prediction error). Had I instead been asked to give the absolute change I expected, my answer would have been different. It would have been (15 + 15)/2 = 15 (not zero).

Although the example is simple it captures the essence of the problem. Commenters on Ellenberg's blog looked again at the raw data that Quoidback et al. provided. According to their analyses the end of history illusion largely disappears when analyzed correctly (though only some of the data sets support such a reanalysis). Thus if the end of history illusion effect exists (and the basic premise seems highly plausible) it is quite probably a much smaller and more fragile effect than originally thought. That makes sense to me - because I'm not sure that such a bias could be both pervasive and large in the face of the counter-evidence available to people about past change in themselves and change in others.

My continued interest in the effect is slightly different. There seems to be a cognitive illusion at work here - one that makes the difference between the original score and predicted score appear to be a good measure of an entirely different quantity - the expected absolute change in score ...

The growth of Bayesian methods in psychology

2013-01-28T21:27:00.001+00:00

The British Journal of Mathematical and Statistical Psychology has published a target article (with commentaries and reply) by Andrew Gelman and Cosma Shalizi on philosophy and the practice of Bayesian statistics.

Mark Andrews and I introduce the target article with an editorial aimed at providing some background to psychologists who are interested in Bayesian statistics but need a little back story. Our main aim was to try and indicate that the debate about Bayesian statistics has moved on from the frequentist vs. Bayesian argument and on to more interesting territory - illustrated both by the target article and the commentaries.

Also I believe that as of writing access is free to the target article and commentary ...

Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychology. British Journal of Mathematical and Statistical Psychology, 66, 1–7. doi:10.1111/bmsp.12004

Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38. doi:10.1111/j.2044-8317.2011.02037.x

Psychological Statistics

Effect size measures for simple mediation

A quick overview of the egocentric relation event model (EREM) with linked examples in R: part 1

I Will Not Ever, NEVER Run a MANOVA

A brief introduction to logistic regression

Serious Stats: Obtaining CIs for Spearman's rho or Kendall's tau

Serious stats: Type II versus Type III Sums of Squares

Egon Pearson correction for Chi-Square

Provisional programme: ESRC funded conference: Bayesian Data Analysis in the Social Sciences Curriculum (Nottingham, UK 29th Sept 2017)

Announcement: ESRC funded conference: Bayesian Data Analysis in the Social Sciences Curriculum (29th Sept 2017)

STOP PRESS Introductory Bayesian data analysis workshops for social scientists (June 2017 Nottingham UK)

Serious Stats blog: CI for differences in independent R square coefficients

ESRC funded Bayesian data analysis workshops for social scientists

ESRC Prior Exposure workshops: advanced Bayesian data analysis

Stop Press: Additional dates for the 2016 Prior Exposure Bayesian Data Analysis workshops

PLS think twice about partial least squares

Prior exposure workshops 3 and 4 (Bayesian data analysis for social scientists)

Prior exposure: Bayesian data analysis workshops (ESRC Advanced Training Initiative)

Guest post: PNAS, facebook and the ethics of online experimentation

Multicollinearity and collinearity (in multiple regression) - a tutorial

What are collinearity and multicollinearity?

How common is collinearity or multicollinearity?

Do collinearity or multicollinearity matter?

The good news

The bad news

Detecting problems with multicollinearity

Remedies

Conclusions

Further reading

Update

Cronbach to the future

Why faking data is bad ...

Serious stats: using multilevel models to get accurate inferences for repeated measures ANOVA

Neuroscience, statistical power and how to increase it

Reflecting on the end of history illusion illusion

The growth of Bayesian methods in psychology