The post R-Squared for Mixed Effects Models appeared first on The Analysis Factor.

]]>*By Kim Love*

When learning about linear models —that is, regression, ANOVA, and similar techniques—we are taught to calculate an R^{2}. The R^{2} has the following useful properties:

- The range is limited to [0,1], so we can easily judge how relatively large it is.
- It is standardized, meaning its value does not depend on the scale of the variables involved in the analysis.
- The interpretation is pretty clear: It is the proportion of variability in the outcome that can be explained by the independent variables in the model.

The calculation of the R^{2} is also intuitive, once you understand the concepts of variance and prediction. One way to write the formula for R^{2} from a GLM is

where is an actual individual outcome, is the model-predicted outcome that goes with it, and is the average of all the outcomes.

In this formula, the denominator measures all of the variability in without considering the model. The numerator E ach in the numerator represents how much closer the model’s predicted value gets us to the actual outcome than the mean does. Therefore, the fraction is the proportion of all of the variability in the outcome that is explained by the model.

The key to the neatness of this formula is that there are only two sources of variability in a linear model: the fixed effects (“explainable”) and the rest of it, which we often call error (“unexplainable”).

When we try to move to more complicated models, however, defining and agreeing on an R^{2} becomes more difficult. That is especially true with mixed effects models, where there is more than one source of variability (one or more random effects, plus residuals).

These issues, and a solution that many analysis now refer to, are presented in the 2012 article *A general and simple method for obtaining R ^{2} from generalized linear mixed‐effects models* by Nakagawa and Shielzeth (see https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/j.2041-210x.2012.00261.x). These authors present two different options for calculating a mixed-effects R

Before describing these formulas, let’s borrow an example study from the Analysis Factor’s workshop “Introduction to Generalized Linear Mixed Models.” Suppose we are trying to predict the weight of a chick, based on its diet and number of days since hatching. Each chick in the data was weighed on multiple days, producing multiple outcomes for a single chick.

This analysis needs to account for the following sources of variability: the “fixed” effects of diet and time, the differences across the chicks (which we would call “random” because the chicks are randomly selected), and the prediction errors that occur when we try to use the model to predict a chick’s exact weight based on its diet and days since hatching.

The marginal R^{2} for LMMs described by Nakagawa and Shielzeth is calculated by

where is the variance of the fixed effects , is the variance of the random effect and is the variance of the model residuals . In the context of the chick example, is the variability explained by diet and days since hatching, is the variance attributed to differences across chicks, and is the variability of the errors in individual weight predictions. Together, these three sources of variability add up to the total variability (denominator of the marginal R^{2} equation). Dividing the variance of the fixed effects only by this total variability provides us with a measure of the proportion of variability explained by the fixed effects.

However, this leads to a question: is the fixed effects part of the model the only part that is “explained?” Or is the variation across the chicks, which we have been calling “random,” now also “explained?” For those who would claim that random variability is explained, because it has been separated from residual variability, we calculate the conditional R^{2} for LMMs:

The conditional R^{2} is the proportion of total variance explained through both fixed *and* random effects.

The article by Nakagawa and Shielzeth goes on to expand these formulas to situations with more than one random variable, and also to the generalized linear mixed effects model (GLMM).

The GLMM versions should be interpreted with the same caution we use with a pseudo R^{2} from a more basic generalized linear model. Concepts like “residual variability” do not have the same meaning in GLMMs. The article also discusses the advantages and limitations of each of these formulas, and compares their usefulness to other earlier versions of mixed effects R^{2} calculations.

Note that these versions of R^{2} are becoming more common, but are not entirely agreed upon or standard. You will not be able to calculate them directly in standard software. Instead, you need to calculate the components and program the calculation. Importantly, if you choose to report one or both of them, you should not only identify which one you are using, but provide some brief interpretation and a citation of the article.

The post R-Squared for Mixed Effects Models appeared first on The Analysis Factor.

]]>The post How Confident Are You About Confidence Intervals? appeared first on The Analysis Factor.

]]>The results of any statistical analysis should include the confidence intervals for estimated parameters.

How confident are you that you can explain what they mean? Even those of us who have a solid understand of confidence intervals can get tripped up by the wording.

Let’s look at an example.

The average person’s IQ is 100. A new miracle drug was tested on an experimental group. It was found to improve the average IQ 10 points, from 100 to 110. The 95 percent confidence interval of the experimental group’s mean was 105 to 115 points.

Which if any of the following are true:

1. If you conducted the same experiment 100 times, the mean for each sample would fall within the range of this confidence interval, 105 to 115, 95 times.

2. The lower confidence level for 5 of the samples would be less than 105.

3. If you conducted the experiment 100 times, 95 times the confidence interval would contain the population’s true mean.

4. 95% of the observations of the population fall within the 105 to 115 confidence interval.

5. There is a 95% probability that the 105 to 115 confidence interval contains the population’s true mean.

Not sure? To help you visualize what a confidence interval is we will generate a random population of 10,000 observations. The populations mean is 110 with a standard deviation of 25.

From this population we will randomly draw a sample of 100 observations. This is the mean, standard deviation and confidence interval of the sample’s mean.

We seldom draw more than one sample from a population when conducting a study. To help us visualize how confidence intervals change as the sample changes we will randomly draw 99 more samples. We will graph each sample’s mean and confidence interval.

The horizontal red line at 110 is the population mean. The red dots represent the sample’s mean.

What can we observe from this graph?

1. The lower and upper confidence limits are seldom if ever the same.

2. Some confidence intervals have a narrower range than others.

Keep in mind that all samples came from the same population.

Let’s look now at the multiple choices to the quiz, starting with the first choice.

Does the mean of each sample fall within any one sample’s confidence interval 95 out of 100 times?

That would depend upon which sample we chose. If we chose a sample whose mean is near 110 that might be true. If our sample was similar to one from either edge of the graph it would not be true. The problem is, we never know where our sample is in relation to other possible samples.

Response 1 is incorrect.

With regards to response 2, we can see that the lower confidence level is below 105 for a substantial number of samples.

Response 2 is incorrect.

How about response 4, 95% of the population falls within the confidence interval? It was given that the population mean was 110 and the standard deviation was 25. As a result, 95% of the observation of the population are between 60 and 160.

Response 4 cannot be true.

Response 3 is correct. Approximately 95 out of 100 of the confidence intervals contain the population mean. The graph shows 91 out of 100 contain the true mean. If a different seed had been used to draw the samples, the results could have been more than 95 out of 100 confidence intervals containing the true mean. But on average, 95 out of 100 confidence intervals will contain the true mean.

Response #5 is correct as well. 3 and 5 imply the same thing but are said differently. This is a common theme in statistics.

A very important point to remember, expect a sample’s 95% confidence interval to not contain the population mean 5% of the time.

The post How Confident Are You About Confidence Intervals? appeared first on The Analysis Factor.

]]>The post August 2019 Member Training: Elements of Experimental Design appeared first on The Analysis Factor.

]]>Whether or not you run experiments, there are elements of experimental design that affect how you need to analyze many types of studies.

The most fundamental of these are replication, randomization, and blocking. These key design elements come up in studies under all sorts of names: trials, replicates, multi-level nesting, repeated measures. Any data set that requires mixed or multilevel models has some of these design elements.

In this webinar you’ll learn:

- What these fundamental elements really mean and how to recognize them
- How they differ from and work with crossing and nesting
- How they come together to inform the analysis and the inferences you can make
- How simple changes in the design can have big impacts on the complexity of the analysis
- The use of these elements in common designs such as randomized blocks, Latin squares, cross-overs, and multilevel.

**Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.**

**Thursday, August 15, 2019**

**3pm – 4:30pm (US EDT) **(In a different time zone?)

Karen Grace-Martin helps statistics practitioners gain an intuitive understanding of how statistics is applied to real data in research studies.

She has guided and trained researchers through their statistical analysis for over 15 years as a statistical consultant at Cornell University and through The Analysis Factor. She has master’s degrees in both applied statistics and social psychology and is an expert in SPSS and SAS.

It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.

Just head over and sign up for Statistically Speaking.

You’ll get exclusive access to this training webinar, plus live Q&A sessions, a private stats forum, 75+ other stats trainings, and more.

The post August 2019 Member Training: Elements of Experimental Design appeared first on The Analysis Factor.

]]>The post Linear Regression for an Outcome Variable with Boundaries appeared first on The Analysis Factor.

]]>The following statement might surprise you, but it’s true.

To run a linear model, you don’t need an outcome variable Y that’s normally distributed. Instead, you need a dependent variable that is:

- Continuous
- Unbounded
- Measured on an interval or ratio scale

The normality assumption is about the errors in the model, which have the same distribution as Y|X. It’s absolutely possible to have a skewed distribution of Y and a normal distribution of errors because of the effect of X.

This issue came up recently in a free webinar I conducted in our The Craft of Statistical Analysis program about Binary, Ordinal, and Nominal Logistic Regression.

The first thing we did in that webinar was a (very brief) review of linear regression so that we could compare and contrast logistic to linear models.

We had over 1200 people sign up and while I answered a lot of questions, I didn’t get through all of them. So I’m answering some here on the site.

They’re grouped by topic, and you will probably get more out of it if you watch the webinar recording. It’s free.

Today’s group of questions were all about this aspect of linear models and how they relate to the example I used in the webinar.

In my (fake data) example, the DV was GPA and one of the IVs was SAT Math score. Here is the equation we fit:

E(College GPA) = -.03 + .20*HSGPA + .003*SATV + **.002*SATM **-.15*Sports -.26*Male

And this is the bivariate relationship between SATM scores and College GPA.

And my astute listeners asked versions of this question:

**Q: On slide 4 it says the linear model dependent variable needs to be unbounded, but SAT scores and GPA scores are bounded. I’m confused. **

And my multi-part answer:

- Oops.

Yes, it’s true. A better example would have used a dependent variable that truly had no bounds. GPA isn’t one—it’s bounded at 4.0 at the top and 0 at the bottom. See #3 for why I got away with it anyway.

- SAT math score, a predictor, which is also bounded between 200 and 800, is irrelevant.

As it turns out, the distributions of the predictors don’t generally matter. It’s the outcome variable whose distribution matters.

(Notice I threw that “generally” in there. If a huge proportion of the SAT math scores were at the boundary, that might cause influence problems. But it’s unlikely to create problems with the normality of errors assumption that we’re concerned with here.)

- I managed to get away with it here simply because although there
*are*theoretical bounds, only one data value actually hit one of them.

Here’s the histogram of GPA scores. One person in this data set had a 4.0 GPA. No one got close to 0.

Yes, I absolutely may get predicted values from this model that are above 4.0 or below 0.0 and they won’t make sense.

But I can live with that. It’s really an issue of censoring. Maybe some students should have GPAs greater than 4.0 or below 0.0. But we can only measure GPAs in this 0-4 range.

In other words, if there were a lot of students with 4.0 GPAs, perhaps they really shouldn’t all have the same GPA. Perhaps some actually did better in their classes than others, but the top grade anyone could get was a 4.0.

The problem for linear regression is that if we have this ceiling or floor effect, we’ll have a lot of values against the bounds and we’ll have a lot of trouble meeting that assumption of normal errors.

So while I have theoretical bounds, I’m not hitting them with the data. There are a lot of variables that technically have boundaries that no observations hit.

When that happens, you can still trust your coefficients, standard errors, and p-values. You will, of course, have to check your assumptions.

The post Linear Regression for an Outcome Variable with Boundaries appeared first on The Analysis Factor.

]]>The post How to Reduce the Number of Variables to Analyze appeared first on The Analysis Factor.

]]>by Christos Giannoulis

Many data sets contain well over a thousand variables. Such complexity, the speed of contemporary desktop computers, and the ease of use of statistical analysis packages can encourage ill-directed analysis.

It is easy to generate a vast array of poor ‘results’ by throwing everything into your software and waiting to see what turns up.

The thoughtless analysis of data is a problem for a number of reasons. It is easy to plunge into a data analysis without even thinking about what the intended endpoints are.

Analysis without thinking will almost certainly produce biased results.

Powerful multi-variable techniques, such as multiple regression, make it easy to include a very large number of predictor variables in the hope of maximizing the explanatory power of the model.

A similar problem occurs with factor analysis. There is nothing stopping us from factor-analyzing a random set of variables.

Factor analysis will nearly always produce a ‘solution’. However, it may well be a nonsense solution.

Factor analysis is designed to identify sets of variables that are tapping the same underlying phenomenon. It does this by examining the patterns of correlations among a set of variables.

The assumption of factor analysis is that the variables that are identified as belonging to a factor are really measuring the same thing. The factor itself is driving the responses on the individual variables. Therefore, they should not be causally related to each other.

Unfortunately, factor analysis cannot distinguish between variables that are causally related and those that are non-causally related.

This can result in variables being grouped together when they should not be. So it’s up to you, the data analyst, to think about the possible types of relationships among the variables and not just let the software make the decisions.

The selection of independent and dependent variables should be a function of the research question to which the data analysis is directed.

Unless a clear research question is formulated, you will find no answers. It’s as simple as that.

One approach I usually follow is to draw diagrams of the model I plan to evaluate before I begin to analyze the data.

First, I state what my dependent variable is. Then I specify the independent variable and the likely mechanisms by which the independent and dependent variables might be related.

As simple as it sounds, it is of paramount importance as it helps me make sense and guide the selection of variables for further analysis.

When undertaking factor analysis, think about the variables involved. Before subjecting a set of variables to factor analyses you should have some idea of what they might have in common.

You should make some attempt to include variables that make sense together.

You should also avoid including variables where any correlation is more likely due to causal relationships than to the variables having something in common at the conceptual level.

The post How to Reduce the Number of Variables to Analyze appeared first on The Analysis Factor.

]]>The post Member Training: Writing Up Statistical Results: Basic Concepts and Best Practices appeared first on The Analysis Factor.

]]>Many of us love performing statistical analyses but hate writing them up in the Results section of the manuscript. We struggle with big-picture issues (What should I include? In what order?) as well as minutia (Do tables have to be double-spaced?).

Join us as Larry Hatcher provides a straightforward strategy for organizing your findings and reporting them in text, tables, and figures. This training focuses on APA style but will be useful to students and researchers across a variety of disciplines. You’ll learn about today’s best practices and will hear answers to common—and not-so-common—questions:

- What are the big three results that should just about always be reported?
- How much detail do I need to include to provide a “sufficient” set of statistics?
- What rule of thumb can help me decide whether to provide results in a figure instead of a table? In a table instead of a paragraph?
- What’s the difference between reporting treatment fidelity versus manipulation checks?
- When reporting significance tests in a table, should I always provide precise p values? Is it ever okay to just flag the significant results with asterisks?

**Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.**

Larry Hatcher, Ph.D. is Professor of Psychology at Saginaw Valley State University in Michigan. He is author of *APA Style for Papers, Presentations, and Statistical Results: The Complete Guide* and *Advanced Statistics in Research: Reading, Understanding, and Writing Up Data Analysis Results*.

He is also author or co-author of five books that show how to perform data analysis with the SAS® and JMP® applications, including the widely-cited *A Step-by-Step Approach to Using SAS® for Factor Analysis and Structural Equation Modeling, Second Edition*.

Larry has taught elementary and advanced statistics since 1984. He loves teaching statistics because students typically show up feeling scared and confused, and end the semester feeling *I can do this*. He earned his Ph.D. in industrial and organizational psychology from Bowling Green State University in Ohio in 1983.

It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.

Just head over and sign up for Statistically Speaking.

You’ll get exclusive access to this training webinar, plus live Q&A sessions, a private stats forum, 75+ other stats trainings, and more.

The post Member Training: Writing Up Statistical Results: Basic Concepts and Best Practices appeared first on The Analysis Factor.

]]>The post What is a Confounding Variable? appeared first on The Analysis Factor.

]]>By Karen Grace-Martin

*Confounding variable* is one of those statistical terms that confuses a lot of people. Not because it represents a confusing concept, but because of how it’s used. (Well, it’s a bit of a confusing concept, but that’s not the worst part).

First, it has slightly different meanings to different types of researchers. The definition is essentially the same, but the research context can have specific implications for how that definition plays out.

If the person you’re talking to has a different understanding of what it means, you’re going to have a confusing conversation.

Let’s take a look at some examples to unpack this.

In experimental fields, like agriculture and psychology, a confounder is a variable whose effect is indistinguishable from an independent variable’s effect.

An example:

You’re running a memory experiment and want to see whether people can better remember a list of easy to pronounce words or difficult to pronounce words. So you give one group of people a list of easy-to-pronounce words and another a list of difficult-to-pronounce words.

The variable you care about is the Word Pronounceability.

But it turns out that the difficult to pronounce words are also longer. If people remember fewer words from that list, you won’t know if ultimately it’s because they were harder to pronounce or simply because they were longer.

So Word Length is a confounding variable for Word Pronounceability for those lists of words.

To truly be able to conclude that any memory difference is due to Pronounceability, you need to make sure the two lists of words are otherwise comparable in every other way. You either need both long and short words in both lists or you simply need both lists to only have, say, 5-letter words.

This is part of good experimental design.

Now, sometimes the issue is not bad design. Sometimes it’s truly impossible to separate out two variables that always co-occur.

For example, perhaps the confounding variable is not word length, but word frequency. People have an easier time pronouncing common words and a harder time pronouncing uncommon words.

So while your intention is to compare easy-to-pronounce words and hard-to-pronounce words, it’s possible there just aren’t any uncommon easy-to-pronounce words or any common hard-to-pronounce words to put on your list.

In other words, pronounceability and frequency of words are so associated, you can’t separate them out. We don’t know if words are more common because they’re easier to pronounce or if they’re easier to pronounce simply because they’re so common. We just know we can’t separate them so we don’t know which one is really at the heart of the relationship.

The other definition I’ve seen of a confounding variable is more specific and I’ve heard this from people in fields like epidemiology where the variables are not manipulated, but measured.

In this situation, a confounding variable is considered one that is not only related to the independent variable, but is causing it.

So, for example, consider a study that is predicting infant birth weight from maternal weight gain during pregnancy.

And consider that there is a positive relationship—the more a mother gains during pregnancy, the more her baby weighs, on average.

But a potential confounder here is length of gestation. The longer the pregnancy lasts, the more time the mother and the baby have to gain weight.

Now, in a data set that included only full-term infants, this may be only a minor issue. There may be little variance in maternal weight gain that came from length of the pregnancy.

But if the data set contains a lot of pre-term infants, then a lot of the variance in mother’s weight gain will come simply from how long her pregnancy was.

In this example, length of pregnancy is a confounder for weight gain. Another variable that’s related to weight gain, but not causing it, like mother’s age, is not considered a confounder.

It’s a great practice to define your terms. It’s an essential practice when you’re communicating with people not in your field.

Since statistics is used across so many fields with so many data and design issues, it’s easy for the definitions of terms to become a bit insular. Everyone in your field may think of a confounder by one of these definitions, but your statistician or collaborators from other fields may have slightly different understandings. Make sure you’re using the same glossary.

The post What is a Confounding Variable? appeared first on The Analysis Factor.

]]>The post Correlated Errors in Confirmatory Factor Analysis appeared first on The Analysis Factor.

]]>Latent constructs, such as liberalism or conservatism, are theoretical and cannot be measured directly.

But we can use a set of questions on a scale, called indicators, to represent the construct together by combining them into a latent factor.

Often prior research has determined which indicators represent the latent construct. Prudent researchers will run a confirmatory factor analysis (CFA) to ensure the same indicators work in their sample.

You can run a CFA using either the statistical software’s “factor analysis” command or a structural equation model (SEM). There are several advantages to using SEM over the “factor analysis” command. The advantage we will look at is the ability to correlate error terms.

The first step in a CFA is to verify that the indicators have some commonality and are a good representation of the latent construct. We cannot simply add the scores of each indicator to create the latent factor.

Why? The indicators do not equally represent the latent construct. They all have their own “strength,” represented by its loading onto the latent factor. The loading is the variance of the indicator that is shared with the latent factor.

Not all the variance of an indicator is shared. This leftover, unshared variance is known as the unique variance. We consider unique variance as measurement error since it does not help in the measurement of the latent construct.

It is possible that part of the measurement error of one indicator is partially correlated with the measurement error of another indicator. This correlation can be due to pure randomness or it can be a result of something that influences both indicators.

If there is a legitimate reason for indicators’ error terms being related, the error terms can be “correlated” within a structural equation model. What are legitimate reasons?

According to Timothy Brown in his book “Confirmatory Factor Analysis for Applied Research”, some of the nonrandom measurement error that should be correlated can be a result of:

- Acquiescent response: a response bias caused by a person’s response agreeing with attitude statements regardless of the content of the question
- Assessment methods: questionnaire, observer ratings
- Reversed or similarly worded test items
- Personal traits: reading disability or cognitive biases such as groupthink, which affect a respondent’s ability to answer a questionnaire truthfully

To determine which indicators’ error terms have a high correlation, we generate a modification index. The index produces a measurement of how much the model’s goodness of fit will improve if any two specific error terms are correlated.

Please note, you must be able to justify that the error terms can be correlated. For example, if the latent construct is derived from a survey, a review of the questions must be done to determine if the two questions are related to the same topic.

What is the advantage of correlating the error terms?

Correlating indicator error terms can improve the reliability of the latent construct’s scale, which can be measured via goodness of fit statistics. Unfortunately, correlating error terms is not possible when using a software’s “factor” command and can only be done within a structural equation model.

The post Correlated Errors in Confirmatory Factor Analysis appeared first on The Analysis Factor.

]]>The post Member Training: A Predictive Modeling Primer: Regression and Beyond appeared first on The Analysis Factor.

]]>Predicting future outcomes, the next steps in a process, or the best choice(s) from an array of possibilities are all essential needs in many fields. The predictive model is used as a decision making tool in advertising and marketing, meteorology, economics, insurance, health care, engineering, and would probably be useful in your work too!

Join Elaine Eisenbeisz as she presents the rationale and risks of predictive modeling via *supervised learning* techniques. Elaine will also provide an overview of some of the many available modeling techniques including:

- Linear regression
- Logistic regression
- Linear discriminant analysis
- K-Nearest Neighbors
- Resampling methods (Cross-Validation, Bootstrap)
- Subset selection
- Shrinkage methods (Ridge regression, Lasso regression)
- Tree-Based methods (Decision trees, Bagging, Random Forests, Boosting)

**Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.**

Elaine Eisenbeisz is a private practice statistician and owner of Omega Statistics, a statistical consulting firm based in Southern California. Elaine has over 30 years of experience in creating data and information solutions. She designs methodology and analyzes data for studies in the clinical, and biotechnology fields. Additionally, Elaine and Omega Statistics are the go-to resource for ABD students who require assistance with dissertation methodology and analysis.

Throughout her tenure as a private practice statistician, Elaine has published work with researchers and colleagues in peer-reviewed journals. Fitting of her eclectic tastes, her current interests include statistical genetics and psychometric survey development.

Elaine earned her B.S. in Statistics at UC Riverside and her Master’s Certification in Applied Statistics from Texas A&M. She is currently finishing her graduate studies at Rochester Institute of Technology. Elaine is a member in good standing with the American Statistical Association and a member of the Mensa High IQ Society.

It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.

Just head over and sign up for Statistically Speaking.

You’ll get exclusive access to this training webinar, plus live Q&A sessions, a private stats forum, 75+ other stats trainings, and more.

The post Member Training: A Predictive Modeling Primer: Regression and Beyond appeared first on The Analysis Factor.

]]>The post The Importance of Including an Exposure Variable in Count Models appeared first on The Analysis Factor.

]]>When our research question is focused on the frequency of occurrence of an event, we will typically use a count model to analyze the results. There are numerous count models. A few examples are: Poisson, negative binomial, zero-inflated Poisson and truncated negative binomial.

There are specific requirements as to which count model to use. The models are not interchangeable. But regardless of the model we use, there is a very important prerequisite that they all share.

We must identify the period of time or area of space in which the counts were generated.

The term used for modeling the period of time or area of space is exposure. The exposure variable basically modifies each observation so that the count outcome is weighted based on the period of time or area.

For example, if you were to count birds at various locations, you would need to know the area of space in which you are doing the count. Ten birds counted within 100 square feet is more than 10 birds counted within 625 square feet.

Counting the number of births during the month of February (28 days) represents a different length of time as compared to the number of births during the month of January (31 days).

If we don’t take into account the different exposures for the observations within our data, we will have biased results due to some observations having higher or lower non-normalized counts.

Let’s look at a model where the outcome is the number of deaths.

The predictors in the model are whether the deceased smoked and what age bracket they were in. The coefficients of the model represent the incidence rate ratio (IRR) of the category stated as compared to the base index for that categorical variable.

The results tell us that smokers have a rate of death 6.24 times greater than non-smokers when controlling for their age bracket.

We also find out that people who are in the 55- to 64-year-old age bracket have a rate of death that is 6.88 times more than those in the 35- to 44-year-old bracket.

Interesting enough, from the results we see that the rate of death for those in the 75- to 84-year-old bracket is lower than those in the 55- to 64-year-old bracket when controlling for smoking. I would have to think that doesn’t make sense.

Question: Over what period or area were the outcomes measured? Were they measured over the same period of time and over the same size population?

It turns out they were not.

Each observation measures the number of deaths by person-years. The data in this analysis was collected from English counties. The number of smokers and non-smokers per five age categories living within the county as well as the number of deaths was counted over specific period of times.

As you can imagine, the number of people living in county A is going to be different than the number in county B. In addition, not every county was measured for the same number of years.

Including an exposure variable for the total number of people observed, such as person-years, allows the counts of deaths to be comparable. We don’t want to be predicting more deaths just because there are more people in a county or because it was measured for a longer period of time.

After including person-years in our model as the exposure variable, we get very different results.

The incidence rate ratio drops from 6.24 to 1.43 when comparing smokers to non-smokers. In addition, as age increases, the incident rate ratio (as compared to the base category) increases. This intuitively makes sense.

Note: Some statistical software requires the analyst to include the “offset” variable rather than the “exposure” variable. If that is the case with your software, you will need to take the natural log of the variable in order to include it in the model.

The post The Importance of Including an Exposure Variable in Count Models appeared first on The Analysis Factor.

]]>