The post The Secret to Importing Excel Spreadsheets into SAS appeared first on The Analysis Factor.
]]>My poor colleague was pulling her hair out in frustration today.
You know when you’re trying to do something quickly, and it’s supposed to be easy, only it’s not? And you try every solution you can think of and it still doesn’t work?
And even in the great age of the Internet, which is supposed to know all the things you don’t, you still can’t find the answer anywhere?
Cue hair-pulling.
Here’s what happened: She was trying to import an Excel spreadsheet into SAS, and it didn’t work.
Instead she got:
Look familiar? If you’re like my colleague, you’re wondering, what the blank is going on here?
Well, if you have SAS 64-bit and Office 32-bit (or even Office 64-bit), you’ll find that the 64-bit version of SAS does not have the interface to communicate with Office and therefore cannot import spreadsheets.
Yep, you read that right: it can’t do it through the wizard and it can’t do it through Proc Import.
So here’s what you have to do.
Save the Excel spreadsheet as a .csv file, and then import it. (You can only have 1 worksheet in the .csv file, but other than that, you shouldn’t see any differences from an Excel spreadsheet.)
It should look like this:
PROC IMPORT OUT= WORK.QA *desired filename;
DATAFILE= “C:\file to import”
DBMS=CSV REPLACE; guessingrows=1000;
GETNAMES=YES;
DATAROW=2;
RUN;
BTW, the guessingrows option is very useful: it tells SAS to read through the specified number of lines (the limit is 214,7483,647 and the default is 20) to determine the length of variables.
Without this option, if the first 20 values are 2 characters and the 21st is 3 characters, your values will be truncated. Specifying the guessingrows (you can just use guessingrows=Max) ensures that SAS looks at all the values in your dataset (i.e., all rows) before it sets the length of variables.
So spread the word. And save your hair.
Life will be easier if everyone starts using .csv files instead of Excel files.
The post The Secret to Importing Excel Spreadsheets into SAS appeared first on The Analysis Factor.
]]>The post Using Predicted Means to Understand Our Models appeared first on The Analysis Factor.
]]>The expression “can’t see the forest for the trees” often comes to mind when reviewing a statistical analysis. We get so involved in reporting “statistically significant” and p-values that we fail to explore the grand picture of our results.
It’s understandable that this can happen. We have a hypothesis to test. We go through a multi-step process to create the best model fit possible. Too often the next and last step is to report which predictors are statistically significant and include their effect sizes.
I suggest one additional step: take the time to absorb and think about the information you can extract from your model with predicted means.
I use the term “information” because we will not focus on p-values and significance levels. Nor are we summarizing a predictor’s effect simply with a coefficient, but digging deeper into what that coefficient tells us.
In the model below, we are determining which predictors are associated with the number of times someone visits a doctor over a two-week period.
Illness — number of days ill
Actdays — number of days not active
Prescrib — number of prescriptions used
Medical_advice — Number of times sought medical advice in past two-weeks
the type of medical insurance each person has.
Note that virtually every coefficient is significant. We will report the coefficients, p-values and confidence intervals in the final write up. But the coefficients table doesn’t communicate well what the real effects are. Let’s investigate a bit with some predicted values.
Do people on Medicaid with two and four prescriptions have the same predicted number of trips to the doctor’s office? How does that compare to someone that is on private insurance or Medicare? Do these comparisons differ for people who seek medical an average (.52) or a high (4) number of times?
We will start with the left half of the table for people who seek medical advice an average number of times. The predicted number of visits for people with Medicaid and private insurance about doubles when the number of prescriptions they are taking increase from two to four.
People on Medicare have no increase. However, the predicted total number of visits is still less than 0.5 in all situations.
What happens if we change the number of times someone sought medical advice from the mean of 0.52 to 4 times. Let’s look at the right side.
We find that predicted number of doctor visits increases substantially overall. The minimum number of doctor visits is now 2.13. Of people taking two prescriptions, those on Medicaid have the fewest expected visits while those on private insurance have the most.
Interestingly, the predicted number of doctor visits for those on Medicaid more than double while those with Medicare increase only slightly when the number of prescriptions increase from two to four.
The predicted number of doctor visits by people with Medicaid now surpasses those with Medicare. People with full private insurance, who most likely have easier access to see the doctor, remain as the group with the greatest expected number of doctor visits.
Now we have interesting information to give our audience beyond confusing coefficients and p-values. We have brought our numbers and data to life so non-statisticians can learn from our work.
The post Using Predicted Means to Understand Our Models appeared first on The Analysis Factor.
]]>The post The Difference Between Random Factors and Random Effects appeared first on The Analysis Factor.
]]>Mixed models are hard.
They’re abstract, they’re a little weird, and there is not a common vocabulary or notation for them.
But they’re also extremely important to understand because many data sets require their use.
Repeated measures ANOVA has too many limitations. It just doesn’t cut it any more.
One of the most difficult parts of fitting mixed models is figuring out which random effects to include in a model. And that’s hard to do if you don’t really understand what a random effect is or how it differs from a fixed effect.
I have found one issue particularly pervasive in making this even more confusing than it has to be. People in the know use the terms “random effects” and “random factors” interchangeably.
But they’re different.
This difference is probably not something you’ve thought about. But it’s impossible to really understand random effects if you can’t separate out these two concepts.
Here’s the basic idea:
· A factor is a variable.
· An effect is the variable’s coefficient.
Let’s unpack that so it’s meaningful.
When we’re talking about fixed factors and their effects, this doesn’t usually come up. We’re able to see easily the difference between the variables themselves and those variables’ effects.
Here’s an example of a Linear Mixed Model that is predicting an outcome Y (Number of Jobs, in Thousands) over Time (5 decades, coded 0 to 4) for a set of counties. Each county is either Rural or Non-rural and is measured across the 5 decades.
To make it easy to see, the fixed part of the model is in blue and the random part of the model is in orange.
It’s very clear that Time and Rural are both fixed predictor variables in this model and that β_{1} and β_{2} are their coefficients.
Just like in any regression model, those coefficients are called slopes and that is how we measure the effect of each predictor. We have one additional fixed effect in the model, the intercept β_{0}. The intercept simply reports the mean of Y when all predictors are 0.
So just to be clear, in the fixed part of the model, we have:
· three fixed effects: β_{0}, β_{1}, β_{2}
· two fixed variables: Time Rural
One of these variables, Rural, is a factor because it’s categorical. The other, Time, is a covariate because it’s numerical. (Some people use the term covariate to mean a control variable, not a numerical predictor. That’s not how I’m using it here).
This part is also simple because of the way we specify it in the software. Regardless of which software we use, all we have to do is specify which predictors we want in the fixed part of the model and the software will automatically estimate their coefficients.
If we wanted also to add in, say an interaction term between Rural and Time, we also just add that to the model and the software estimates a coefficient for that too.
But what about the random part of the model, in orange?
This part is a little harder, partly because of the notation, partly because of the way we specify it in the software, and partly because of the wording we use.
In the random part of the model, there is one random factor, two random effects, and the residual.
I suspect you’re familiar with residuals from linear models. Let’s focus instead on the two random terms.
Just like each fixed term in the model, each random term is made up of a random factor and a random effect. The random effects aren’t hard to see: Those are μ_{0} the random intercept, and μ_{1} the random slope over time.
There is also a random factor here: County. It doesn’t look like it’s here, but it is.
We use the term “random factor” and not “random variable” because random variables in a mixed model MUST be categorical. They are never covariates.
County is denoted in the model by the subscript i. You’ll notice that all the random terms in the model have an i subscript but none of the fixed terms do.
That’s because the fixed terms average over all the counties, but the random terms are per county.
We could rewrite the random terms like this: μ_{0}County and μ_{1}Time*County.
That random intercept term, μ_{0i} has both an intercept coefficient and a factor: County.
In statistical software, you have to specify both, but it doesn’t look like it. You’ll specify that you want a random intercept, but County is specified as the “subject.”
Likewise, μ_{1i}Time is a slope coefficient across Time for county i. Time itself is NOT a random factor. County is.
So again, when you specify it in the software, County is specified as the subject and Time is the only “variable” you’re putting in as a random effect.
It makes it look like Time is a random factor, but it’s not. You’re fitting a slope across Time for each county. This is equivalent to fitting a Time*County interaction and u_{1} is the interaction effect.
So again, to summarize, in the random part of the model, we have:
· Two random effects: μ_{0} and μ_{1} for Time
· And one random variable: County
Calling County or Time a random effect is not just technically incorrect, but it makes it much harder to conceptualize what each of the real random effects is actually measuring.
The post The Difference Between Random Factors and Random Effects appeared first on The Analysis Factor.
]]>The post January 2019 Member Webinar: Model Building Approaches appeared first on The Analysis Factor.
]]>There is a bit of art and experience to model building. You need to build a model to answer your research question but how do you build a statistical model when there are no instructions in the box?
This webinar will also cover:
Note: This webinar is an exclusive benefit to members of the Statistically Speaking Membership Program.
Wednesday, January 23, 2019
1pm – 2:30pm (US EST) (In a different time zone?)
Audrey Schnell is a professional statistical consultant with a Master’s Degree in Clinical Psychology and a PhD in Epidemiology and Biostatistics.
She moved into the emerging field of genetic epidemiology and statistical genetics, and worked on a wide variety of common diseases believed to have a strong genetic component including hypertension, diabetes and psychiatric disorders. She helped develop software to analyze genetic data and taught classes in the US and Europe. Audrey has also worked for a number of institutions, including Case Western Reserve University, Cedars-Sinai, University of California at San Francisco and Johns Hopkins.
It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.
Just head over and sign up for Statistically Speaking.
You’ll get exclusive access to this month’s webinar, plus live Q&A sessions, a private stats forum, 70+ video recordings of member webinars, and more.
The post January 2019 Member Webinar: Model Building Approaches appeared first on The Analysis Factor.
]]>Recently I have had a few questions about risk ratios less than one.
A predictor variable with a risk ratio of less than one is often labeled a “protective factor” (at least in Epidemiology). This can be confusing because in our typical understanding of those terms, it makes no sense that a risk be protective.
So how can a RISK be protective?
The post How to Understand a Risk Ratio of Less than 1 appeared first on The Analysis Factor.
]]>When a model has a binary outcome, one common effect size is a risk ratio. As a reminder, a risk ratio is simply a ratio of two probabilities. (The risk ratio is also called relative risk.)
Risk ratios are a bit trickier to interpret when they are less than one.
A predictor variable with a risk ratio of less than one is often labeled a “protective factor” (at least in Epidemiology). This can be confusing because in our typical understanding of those terms, it makes no sense that a risk be protective.
Well, by indicating lower risk.
For example, let’s say you’re running a model where the outcome is Conviction of a Felony (yes/no) and among your predictors are Previous Criminal Activity (yes/no) and Graduation from High School (yes/no).
We would expect that a Yes on Previous Criminal Activity is related to an increase in the risk of committing a felony. Likewise, we would expect that a Yes on Graduation from High School is related to a decrease in the risk of committing a felony.
In other words, Previous Criminal Activity would be a risk factor and Graduation from High School would be a protective factor. Yet the effect of both factors would be measured with a risk ratio.
The risk ratio is always defined as the ratio of the comparison category’s probability to the reference category’s probability.
A risk ratio greater than one means the comparison category indicates increased risk.
A risk ratio less than one means the comparison category is protective (i.e., decreased risk).
Say we have the following data for a group of defendants:
Felony Conviction |
|||
Graduation |
Yes |
No |
Total |
No |
300 |
100 |
400 |
Yes |
225 |
175 |
400 |
Total |
525 |
275 |
800 |
From this table, we can calculate the probability that either a graduate or a dropout is convicted of a felony.
P(Felony conviction|Dropout) = 300/400 = .75
P(Felony conviction|Graduate) = 225/400 = .5625
And from those, we can calculate the risk ratio for graduates compared to dropouts.
RR: Graduates/Dropouts =.5625/.75 = .75
As you can see, the probability of a felony conviction is lower for graduates (.5625) than it is for dropouts (.75). Likewise, the risk ratio of felony convictions for graduates compared to dropouts is less than one (.75).
So one interpretation is that graduation is protective — it is associated with a lower risk of conviction.
How much lower? By a factor of .75, or 25% lower risk.
Now if we reversed this comparison, we could say that dropping out of high school increases risk and therefore is a risk factor. We would do this by swapping the comparison and recalculating the risk ratio:
RR Dropouts/Graduates = .75/.56 = 1.33
Here we conclude that dropouts are 33% more likely than graduates to be convicted of a felony.
Some references will advise re-coding the data so that the relative risk is always greater than 1. However, it is important to take into consideration the message you want to deliver. In the example above, it may make sense to drive home the message that graduates are 25% less likely to be convicted.
If, after your initial analysis, you find the risk ratios counterintuitive, you can recode the reference group so that the interpretation makes sense.
The post How to Understand a Risk Ratio of Less than 1 appeared first on The Analysis Factor.
]]>The post Removing the Intercept from a Regression Model When X Is Continuous appeared first on The Analysis Factor.
]]>In a recent article, we reviewed the impact of removing the intercept from a regression model when the predictor variable is categorical. This month we’re going to talk about removing the intercept when the predictor variable is continuous.
Spoiler alert: You should never remove the intercept when a predictor variable is continuous.
Here’s why.
Let’s go back to the cars we talked about earlier. Using the same data, if we regress weight on the continuous variable length (in inches) and include the intercept (labeled _cons), we get the following results:
If we exclude the intercept, we get this:
Notice the slope of length has dropped from 33 to 16. It’s the estimate of the relationship between length and weight of the cars we’re interested in and we’ve biased it. Not good.
There’s another problem: residuals.
The table below is a summary of the residuals with (labelled wc) and without (nc) the intercept.
The standard deviation of the residuals from the without-intercept model will never be as low as those from the with-intercept model. Remember, residual variance is unexplained and we want to minimize it.
Additionally, the mean of the residuals will not equal zero, which is a requirement for an OLS model.
Why all these problems?
When you eliminate an intercept from a regression model, it doesn’t go away. All lines have intercepts. Sure, it’s not on your output. But it still exists.
Instead you’re telling your software that rather than estimate it from the data, assign it a value of 0.
Let’s just repeat that for emphasis:
When you remove an intercept from a regression model, you’re setting it equal to 0 rather than estimating it from the data.
The graph below shows what happens.
The fitted line of the model estimated the intercept passes through most of the actual data while the fitted line for the unestimated intercept model does not.
Forcing the intercept to equal 0 forces the line through the origin, which will never fit as well as a line whose intercept is estimated from the data.
The post Removing the Intercept from a Regression Model When X Is Continuous appeared first on The Analysis Factor.
]]>The post Rescaling Sets of Variables to Be on the Same Scale appeared first on The Analysis Factor.
]]>by Christos Giannoulis, PhD
Attributes are often measured using multiple variables with different upper and lower limits. For example, we may have five measures of political orientation, each with a different range of values.
Each variable is measured in a different way. The measures have a different number of categories and the low and high scores on each measure are different.
Low Score | High Score | Variable Name | |
Trust in Government | 1 | 10 (high trust) | TRUST |
Political Efficacy | 0 | 4 (high efficacy) | POLEF |
Feeling towards President | 0 | 100 (positive energy) | LEAD |
Alienation from politics | 8 | 24 (not alienated) | ALIEN |
Frequency of reading about politics per week | 0 | 20 (frequent) | POLR |
The different scales of the variables present two important problems.
It is very difficult to compare across these variables. The usual way of comparing across variables is to calculate the mean for each variable and to compare the means.
However, since each of the variables is measured on a different scale these means will be extremely difficult to compare.
For example, the mean on trust in government might be 4. The mean for alienation might be 10. But because the scales are different length and have different starting points comparisons of means are not meaningful.
Since each question is measured in different units, it is like comparing apples with oranges.
When creating multi-item scales, items that have different lower and upper points will contribute differently to the final multi-item scale score if used in their raw form.
This means that some items will count for more in the computation of a final score. This is usually not what we want.
It is similar to having three pieces of assessment to arrive at a final mark for a subject at a university. One piece of work might be marked out of 80, another out of 10 and another out of 20.
If we simply added up raw scores, then the piece of work marked out of 80 would count for much more in the final mark.
If this is what was desired then all is well, but if each piece of work was meant to contribute equally to the final mark then we would need to adjust the items to equalize the contribution.
The solution to these problems is to convert the scales into a common measurement scale so that they can be compared. This can be achieved in two ways:
You are probably familiar with the latter option, so let me describe the former.
This solution involves adjusting the scale on each variable, “stretching” some measures and “squeezing” others. For any numerical scale the conversion is achieved using this formula:
Where Y is the adjusted variable, X is the original variable, X_{min} is the minimum observed value on the original variable and X_{range} is the difference between the maximum potential score and the minimum potential score on the original variable and n is the upper limit of the rescaled variable.
This conversion can easily be accomplished with a variable transformation in any statistical software.
For example, let’s suppose we want all variables converted to a scale of 0-10. Let us convert POLEF (table above). From the table with see that the minimum potential score was 0, and the maximum observed score was 4. The range therefore is 4. Our formula is thus:
An individual with a score 4 on POLEF would score (4/4) x 10 = 10 on POLEFADJ; a score of 0 on POLEF would convert to (0/4) x 10 = 0 on POLEFADJ, while a score of 2 on POLEF would become a POLEFADJ score on (2/4) x 10 = 5.
Having converted all five variables to a range of 0-10, it becomes much easier to compare scores and averages across them… It may be possible to compare apples and oranges after all!
The post Rescaling Sets of Variables to Be on the Same Scale appeared first on The Analysis Factor.
]]>The post December 2018 Member Webinar: Those Darn Ratios! appeared first on The Analysis Factor.
]]>Ratios are everywhere in statistics—coefficient of variation, hazard ratio, odds ratio, the list goes on. You see them reported in the literature and in your output.
You comment on them in your reports. You even (kinda) understand them. Or, maybe, not quite?
Please join Elaine Eisenbeisz as she presents an overview of the how and why of various ratios we use often in statistical practice.
Which ones you ask? How about:
Note: This webinar is an exclusive benefit to members of the Statistically Speaking Membership Program.
Wednesday, December 12, 2018
12pm – 1:30pm (US EST) (In a different time zone?)
Elaine Eisenbeisz is a private practice statistician and owner of Omega Statistics, a statistical consulting firm based in Southern California. Elaine has over 30 years of experience in creating data and information solutions. She designs methodology and analyzes data for studies in the clinical, and biotechnology fields. Additionally, Elaine and Omega Statistics are the go-to resource for ABD students who require assistance with dissertation methodology and analysis.
Throughout her tenure as a private practice statistician, Elaine has published work with researchers and colleagues in peer-reviewed journals. Fitting of her eclectic tastes, her current interests include statistical genetics and psychometric survey development.
Elaine earned her B.S. in Statistics at UC Riverside and her Master’s Certification in Applied Statistics from Texas A&M. She is currently finishing her graduate studies at Rochester Institute of Technology. Elaine is a member in good standing with the American Statistical Association and a member of the Mensa High IQ Society.
It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.
Just head over and sign up for Statistically Speaking.
You’ll get exclusive access to this month’s webinar, plus live Q&A sessions, a private stats forum, 60+ video recordings of member webinars, and more.
The post December 2018 Member Webinar: Those Darn Ratios! appeared first on The Analysis Factor.
]]>The post Statistical Models for Truncated and Censored Data appeared first on The Analysis Factor.
]]>by Jeff Meyer
As mentioned in a previous post, there is a significant difference between truncated and censored data.
Truncated data eliminates observations from an analysis based on a maximum and/or minimum value for a variable.
Censored data has limits on the maximum and/or minimum value for a variable but includes all observations in the analysis.
As a result, the models for analysis of these data are different.
For censored data the correct model to use is the tobit regression.
The economist John Tobin created this model, which was originally known as the “Tobin probit” model. It combines components of the binomial probit model and an OLS regression model.
A potential drawback of the Tobit model is you have to use the same variables for both the probit component and the regression component.
Fortunately James Heckman created a model that takes into account the selection bias noted previously and allows the use of different variables in the two step model created by Tobin.
The command in Stata is heckman, the SAS code is PROC QLIM and specify HECKIT. The model can also be run in R but not in SPSS.
For continuous data where you want to use a subset of the data based on a lower or upper boundary, a truncated regression model should be used.
In a truncated regression model you are running the analysis using the full data set but telling the model at what value(s) to truncate. The reported sample size used in the model will be the truncated group. But the results can be used to make inferences about the population.
The command in Stata, R, and SAS is truncreg. For SPSS one needs to attain the Essentials for R package.
To model zero-truncated count data the procedure requires several steps to determine which probability distribution function (pdf) fits the data best.
Some of the choices for the optimal pdf are Poisson, Poisson-Gamma Mixture, Poisson-Inverse Gaussian Mixture, Generalized Poisson, negative binomial, and three-paramenter negative binomial (Famoye).
Stata’s command is trncregress, SAS uses PROC NLMIXED and R uses VGAM.
The post Statistical Models for Truncated and Censored Data appeared first on The Analysis Factor.
]]>A: Yes. Because they're using different reference groups, we have different hypothesis tests and therefore different p-values.
The post Your Questions Answered from the Interpreting Regression Coefficients Webinar appeared first on The Analysis Factor.
]]>Last week I had the pleasure of teaching a webinar on Interpreting Regression Coefficients. We walked through the output of a somewhat tricky regression model—it included two dummy-coded categorical variables, a covariate, and a few interactions.
As always seems to happen, our audience asked an amazing number of great questions. (Seriously, I’ve had multiple guest instructors compliment me on our audience and their thoughtful questions.)
We had so many that although I spent about 40 minutes answering questions, we still only got through half! Since my voice was starting to go out at that point, I announced I would follow up here to answer the unanswered questions.
If you were not on the webinar, these will make a lot more sense if you watch the recording first and grab the slides handout. Watch Interpreting Linear Regression Coefficients: A Walk through Output here.
I will sometimes refer you to a slide number in the answer.
A number of people asked about this. And it gets very confusing because of the way SPSS reports it.
In the original variable, Male=0 and Female=1. (Slide 7)
When SPSS’s General Linear Model procedure dummy codes this variable (which it does because I specified it as categorical), it automatically makes the value that comes last alphabetically of my Gender_N variable equal to 0.
So when we look at the regression output table (Slide 14), you can see that it calls the variable “Gender_N=0”. That is a different variable than Gender_N. This new variable, Gender_N=0 has a value of 1 for Males. When the output gives (Gender_N=0) a coefficient, we know that it used a different internal coding of the Gender_N=0 variable.
So when I interpret the Gender_N=0 variable, I interpret that as Males, compared to Females.
It is very confusing until you get used to it, but it’s worthwhile to pay attention to what your software is doing. In my experience, not every procedure does it the same way, even within the same software.
No, they’re not, at least not at α=.05. In fact only a few are. The p-values are available on Slide 13 if you want to check them out.
Even so, yes, you will do the algebra the same way. A non-significant coefficient may not be significantly different from 0, but that doesn’t mean it actually = 0. If you leave them out of the equation as you do the algebra, you will be setting them =0 and it can throw everything off.
The only thing I would do differently is to come to different conclusions about these coefficients, but the math is the same either way.
A: I think what you’re looking for is an eta-square statistic, which will tell you the percent of variance in Y that is accounted for by each predictor in the model.
A: (Start on Slide 35). It’s true that there isn’t a graph, but I can tell you that it looks pretty much like the one on Slide 27.
On Slide 27 we see the regression lines for Male and Female students in the control group. The difference in their intercepts is -.843 and the difference in their slopes is -.425.
What we see in Slide 35 is that the Group 3 Male students have an intercept .07 lower than Group 4 Male students. The same is true for Female students. Group 3 is .07 lower.
But the difference in intercepts between Males and Females in Group 3 is exactly the same as the difference in intercepts between Males and Females in Group 4—just what we saw in Slide 27. That difference is -.843, the coefficient for males (Gender_N=0).
So we can interpret that -.843 not just as the difference in mean satisfaction between male and female students in the control group, but in each group.
You’ll see the same pattern for the slopes. In both Group 3 and Group 4 (and any other group), the difference in slopes is -.425.
It’s just a difference on which coefficients you focus on.
What we know here is we have an interaction between age and group. I’ve been talking about the effect of age—the slope.
But we do also get from the table the effect of group at one specific age: the mean. These group effects are the differences in the mean satisfaction for each of the treatment groups compared to the control (at the mean age).
Because of the interaction, there is not just one mean difference across groups. Those mean differences change, depending on the age. You can use the marginal means to pick specific ages at which you want to make those comparisons. Or you can center Age at different values in order to make the regression coefficients reflect those differences.
I consider the first option much simpler and you can read more about it in this article I wrote about Spotlight Analysis.
See Slide 27. I wouldn’t honestly put much into those as the graph is ignoring the other effects in the model.
That said, those R^{2}s are from the simple regression models between Age and Satisfaction for each sex. So if you fit a simple regression just for Males, we’d say that 19.4% of the variance in satisfaction scores could be explained by Age. And for females, .4% of the variance in satisfaction scores could be explained by Age (so basically none).
That test is found in the ANOVA table on Slide 12. The F statistic for Group is .014. This tests whether the four intercepts have any differences. (Conclusion: no evidence they do, p=.998).
The F statistic for Group*Age tests whether the four slopes are the same. F=3.795, p=.011. So I would say we have evidence they are different. We don’t know which ones are different until we look at the coefficients.
Generally, yes. See this article: When NOT to Center a Predictor Variable in Regression
Yes and no. It won’t change the overall F tests that indicate if there are any differences among groups (See Q7, above).
But sometimes an F test indicates there is at least one difference somewhere, but you don’t see it in the coefficients. For example, our coefficients are only comparing each treatment condition to the control. It’s possible that if the only difference was between two treatment groups, but neither’s mean was far enough from the controls to be significant, then changing the reference group could lead to a significant result that you didn’t see before.
But be careful here. This usually happens when either the control’s mean is between the two treatment means or the control group just has a much smaller sample size than the treatment groups do. You do not want to be switching around reference groups just searching for a significant effect if it’s not a scientifically important comparison. And if it’s about sample size, keep that in mind and think carefully about the difference between statistical significance and scientifically meaningful effect sizes.
A: See Slide 27. Not much at all. That’s just the age at which the mean satisfaction scores are equal for men and women. I don’t think that is really important in this study, but it could certainly be of interest in other studies.
A: Yes, I would still center Age. Even if this isn’t a key predictor, it’s still useful to be able to interpret its results and it makes the intercept more meaningful.
A. No. See The Difference Between Interaction and Association. And in case this is a little confusing, substitute the term Correlation for Association.
If we measured age at a more granular level, then no, the slope wouldn’t change. It might change slightly at a level of rounding, but that’s it. If you changed the unit of measurement to months, however, so that 25 years 6 months = 306 months, then yes the slope would change as the units change.
A. No. It’s true they might be outliers and in a full analysis. I would run some influence statistics to see how much they’re changing results. And that could lead me to investigate whether they were actually errors. If they weren’t I wouldn’t remove them. I might do something like a quantile regression instead, but unless they’re really problematic, I leave outliers in.
A: Yes, absolutely. Before you make any true conclusions, you will want to look at p-values. I was just describing what the relationships look like.
I also wouldn’t use “influenced” in this study. For all we know, it’s not really about Age. Maybe it’s really about some other variable that happens to be correlated with Age, like previous experience with online learning.
A: Yes. Because they’re using different reference groups, we have different hypothesis tests and therefore different p-values.
Females.
In this example, yes, it was easy to choose the reference group because there was a control. If there isn’t, there may or may not be a clear best reference group. See this article: Strategies for Choosing the Reference Category in Dummy Coding
A: It doesn’t really matter that gender is a control or a key predictor. In both cases, it’s an observed predictor variable, meaning you didn’t randomly assign people to these groups, you just observe which group they were already in.
So whether this observed variable is central to your research question or just a control variable, you’re still going to interpret its coefficient as the difference in means for males vs. females. If it’s central, you’ll probably discuss this difference a lot in your discussion and if it’s a control, you won’t.
I assume you mean the p-value. It is testing the null hypothesis that the intercept = 0. As you see on slide 17, the intercept is the mean satisfaction for females in the control group at the mean age.
That’s not really a hypothesis test we’re usually interested in, whether one specific group’s mean = 0. But it could be in some studies.
Ah, no. This is one of those confusing naming things in statistics. I just ran one general linear model. Every linear model, whether we call it regression or ANOVA, will give you an ANOVA table. It’s the table of Sums of Square, Mean Squares, and F tests.
Some regression procedures don’t give much info on that table—they just give you these statistics for the model as a whole and the error term. But just because they’re only printing out a few for you doesn’t mean that each predictor in the model doesn’t have it’s own SS, MS, and F.
So the GLM procedure prints out the full ANOVA table by default, then optionally (I had to ask for it) prints out the regression coefficients table.
You can get this regression coefficients table if you’re running an ANOVA model as well. We just usually don’t because it’s not very helpful.
Yes, given the same sample size, standard deviation, and alpha level. See: 5 Ways to Increase Power in a Study.
No. It’s true that the linear group terms were small, but they weren’t exactly 0. If you take them out, you are forcing them to equal 0. That would force each of those four regression lines to go through the same point right at the mean age. Sure, they crossed close to that age, but it wasn’t exactly at that age. We get a more accurate regression line if we let those things be estimated by the data, not forcing them to equal 0.
A: It’s because of the interactions. Because Age is involved in two interactions (Age*Gender and Age*Group) the slope of Age is different for each Gender and each Group. If you’re comparing all other lines to the purple line on Slide 33, you’ll get a different slope difference than if you compare all lines to the green line.
Yes, the graph and the equations are slightly different. The intercepts and slopes in the model equations are just for females. The graph is combining males and females together. So they’re in the same order, but they’re not exactly the same.
The post Your Questions Answered from the Interpreting Regression Coefficients Webinar appeared first on The Analysis Factor.
]]>