The post Recoding a Variable from a Survey Question to Use in a Statistical Model appeared first on The Analysis Factor.

]]>Survey questions are often structured without regard for ease of use within a statistical model.

Take for example a survey done by the Centers for Disease Control (CDC) regarding child births in the U.S. One of the variables in the data set is “interval since last pregnancy”. Here is a histogram of the results.

Reading the survey’s user guide it turns out the variable contains both nominal and continuous data. The continuous data has values from 4 to 300 months. The nominal data is “no previous pregnancy” which has the value of 888 and “plural delivery” equal to the value of 3.

It’s not possible to use a variable in a statistical analysis if it is a combination of nominal and continuous data. This data could be an important predictor in an explanatory analysis on birth weight. How can we use it?

One strategy is to create a categorical predictor.

Two categories are easy to determine. One category would represent first pregnancy (no previous pregnancy) and the second would represent multiple births (plural delivery).

How should the balance of the data be categorized? One strategy is to aggregate data based on theory.

Research conducted by Dr. Agustin Conde-Agudelo and colleagues found the risk for low birth weight increased by 3.3% for each month less than 18 months. For each month between pregnancies longer than five years the risk of low birth weight increased by 0.6% to 0.9%.

Using the information from this study there is a theoretical reason to split the data into five categories.

- Multiple births
- First pregnancy
- Less than 18 months since last pregnancy
- Between 18 and 60 months since last pregnancy
- Greater than 60 months since last pregnancy

Regressing birth weight in pounds on the new categorical variable using a 3,000-observation random subsample of the data produced the following results.

Running a pairwise comparison of adjacent categories produces results that concur with the research done by Dr. Conde-Agudelo.

The bottom line: when a variable is a combination of continuous and nominal data, use theory to determine the aggregation of the data when creating a new categorical predictor.

The post Recoding a Variable from a Survey Question to Use in a Statistical Model appeared first on The Analysis Factor.

]]>The post How to Decide Between Multinomial and Ordinal Logistic Regression Models appeared first on The Analysis Factor.

]]>A great tool to have in your statistical tool belt is logistic regression.

It comes in many varieties and many of us are familiar with the variety for binary outcomes.

But multinomial and ordinal varieties of logistic regression are also incredibly useful and worth knowing.

They can be tricky to decide between in practice, however. In some — but not all — situations you could use either.

So let’s look at how they differ, when you might want to use one or the other, and how to decide.

Both multinomial and ordinal models are used for categorical outcomes with more than two categories.

The simplest decision criterion is whether that outcome is nominal (i.e., no ordering to the categories) or ordinal (i.e., the categories have an order).

It *should *be that simple.

Here’s why it isn’t:

**1.**** **While there is only one logistic regression model appropriate for nominal outcomes, there are quite a few for ordinal outcomes.

These models account for the ordering of the outcome categories in different ways. Most software, however, offers you only one model for nominal and one for ordinal outcomes.

**2.** The most common of these models for ordinal outcomes is the proportional odds model. It has a strong assumption with two names — the proportional odds assumption or parallel lines assumption.

It essentially means that the predictors have the same effect on the odds of moving to a higher-order category everywhere along the scale.

The problem?

This assumption is rarely met in real data, yet is a requirement for the only ordinal model available in most software.

**3.** If you have a nominal outcome variable, it never makes sense to choose an ordinal model. Your results would be gibberish and you’ll be violating assumptions all over the place.

(That makes one choice simple!)

In contrast, you *can* run a nominal model for an ordinal variable and not violate any assumptions. But you may not be answering the research question you’re really interested in if it incorporates the ordering.

**4. **The names. Most software refers to a model for an *ordinal variable* as an *ordinal logistic regression *(which makes sense, but isn’t specific enough).

In contrast, they will call a model for a *nominal variable* a *multinomial logistic regression* (wait – what?).

It gets better.

Some software procedures require you to specify the distribution for the outcome and the link function, not the type of model you want to run for that outcome. Both ordinal and nominal variables, as it turns out, have multinomial distributions.

What differentiates them is the version of logit link function they use. So if you don’t specify that part correctly, you may not realize you’re actually running a model that assumes an ordinal outcome on a nominal outcome. Not good.

A link function with a name like “mlogit,” “multinomial logit,” or “generalized logit” assumes no ordering.

A link function with a name like “clogit” or “cumulative logit” assumes ordering, so only use this if your outcome really is ordinal.

Confusing, right?

If you have a nominal outcome, make sure you’re not running an ordinal model.

If you have an ordinal outcome and the proportional odds assumption is met, you can run the cumulative logit version of ordinal logistic regression.

If you have an ordinal outcome and your proportional odds assumption isn’t met, you can:

1. Run a different ordinal model

2. Run a nominal model as long as it still answers your research question

The post How to Decide Between Multinomial and Ordinal Logistic Regression Models appeared first on The Analysis Factor.

]]>The post March 2019 Member Webinar: Determining Levels of Measurement: What Lies Beneath the Surface appeared first on The Analysis Factor.

]]>You probably learned about the four levels of measurement in your very first statistics class: nominal, ordinal, interval, and ratio.

Knowing the level of measurement of a variable is crucial when working out how to analyze the variable. Failing to correctly match the statistical method to a variable’s level of measurement leads either to nonsense or to misleading results.

But the simple framework of the four levels is too simplistic in most real-world data analysis situations.

**In this webinar we will examine:**

- Different ways of describing and thinking about level of measurement
- The statistical and other practical considerations in determining a variable’s level
- When and how we can decide to change a variable’s level

**Note: This training webinar is an exclusive benefit to members of the Statistically Speaking Membership Program.**

**Wednesday, March 27, 2019**

**5pm – 6:30pm (US EDT) **(In a different time zone?)

Dr. Christos Giannoulis has over 14 years experience with advanced exploratory, predictive, and prescriptive analyses using a combination of graphical user interface software (e.g., SPSS modeler, Jmp, Statistica, and Stata) as well as programming languages (e.g., R and Python) on desktop and cloud environments. He has authored and co-authored 5 reports, published more than 30 papers using statistical analysis, and presented at conferences in North America and Europe.

He strives to advance statistical analyses from correlation to causation analyses using Frequentist and Bayesian methods.

Christos’s background in education, statistics, assessment, and evaluation—along with his practical experience in applied data science—have given him an excellent foundation as a statistical consultant and mentor. His goal is to support the efficiency and effectiveness of your projects through exceptional statistical analysis and usable results.

It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.

Just head over and sign up for Statistically Speaking.

You’ll get exclusive access to this training webinar, plus live Q&A sessions, a private stats forum, 70+ other stats trainings, and more.

The post March 2019 Member Webinar: Determining Levels of Measurement: What Lies Beneath the Surface appeared first on The Analysis Factor.

]]>The post Eight Ways to Detect Multicollinearity appeared first on The Analysis Factor.

]]>Multicollinearity can affect any regression model with more than one predictor. It occurs when two or more predictor variables overlap so much in what they measure that their effects are indistinguishable.

When the model tries to estimate their unique effects, it goes wonky (yes, that’s a technical term).

So for example, you may be interested in understanding the separate effects of altitude and temperature on the growth of a certain species of mountain tree.

Altitude and temperature are distinct concepts, but the mean temperature is so correlated with the altitude at which the tree is growing that there is no way to separate out their effects.

But it’s not always easy to tell that the wonkiness in your model comes from multicollinearity.

One popular detection method is based on the bivariate correlation between two predictor variables. If it’s above .8 (or .7 or .9 or some other high number), the rule of thumb says you have multicollinearity.

And it is certainly true that a high correlation between two predictors is an indicator of multicollinearity. But there are two problems with treating this rule of thumb as a rule.

First, how high that correlation has to be before you’re finding inflated variances depends on the sample size. There is no one good cut off number.

Second, it’s possible that while no two variables are highly correlated, three or more together are multicollinear. Weird idea, I know. But it happens.

You’ll completely miss the multicollinearity in that situation if you’re just looking at bivariate correlations.

So like a lot of things in statistics, when you’re checking for multicollinearity, you have to check multiple indicators and look for patterns among them. Sometimes just one is all it takes and sometimes you need to see patterns among a few.

Here are seven more indicators of multicollinearity.

**1. Very high standard errors for regression coefficients**

When standard errors are orders of magnitude higher than their coefficients, that’s an indicator.

**2. The overall model is significant, but none of the coefficients are**

Remember that a p-value for a coefficient tests whether the unique effect of that predictor on Y is zero. If all predictors overlap in what they measure, there is little unique effect, even if the predictors as a group have an effect on Y.

**3. Large changes in coefficients when adding predictors**

If the predictors are completely independent of each other, their coefficients won’t change at all when you add or remove one. But the more they overlap, the more drastically their coefficients will change.

**4. Coefficients have signs opposite what you’d expect from theory**

Be careful here as you don’t want to disregard an unexpected finding as problematic. Not all effects opposite theory indicate a problem with the model. That said, it could be multicollinearity and warrants taking a second look at other indicators.

**5. Coefficients on different samples are wildly different**

If you have a large enough sample, split the sample in half and run the model separately on each half. Wildly different coefficients in the two models could be a sign of multicollinearity.

**6. High Variance Inflation Factor (VIF) and Low Tolerance**

These two useful statistics are reciprocals of each other. So either a high VIF or a low tolerance is indicative of multicollinearity. VIF is a direct measure of how much the variance of the coefficient (ie. its standard error) is being inflated due to multicollinearity.

**7. High Condition Indices**

Condition indices are a bit strange. The basic idea is to run a Principal Components Analysis on all predictors. If they have a lot of shared information, the first Principal Component will be much higher than the last. Their ratio, the Condition Index, will be high if multicollinearity is present.

The post Eight Ways to Detect Multicollinearity appeared first on The Analysis Factor.

]]>The post A Strategy for Converting a Continuous to a Categorical Predictor appeared first on The Analysis Factor.

]]>At times it is necessary to convert a continuous predictor into a categorical predictor. For example, income per household is shown below.

This data is censored, all family income above $155,000 is stated as $155,000. A further explanation about censored and truncated data can be found here. It would be incorrect to use this variable as a continuous predictor due to its censoring.

This does not mean this data cannot be used as a predictor. The data can be converted into a categorical variable. How can we determine the number of categories and the increments of income that are in each category?

The first question to ask: is the analysis theory driven or exploratory? If it is theory driven the break points should take into consideration findings from the literature.

If the analysis is exploratory, the following is one method that can be used for determining break points.

The number and the increments of the categories are determined by the dependent variable. We want groups that are statistically different from each other in relationship to the dependent variable. We can begin this process by graphing.

Below is a scatter plot of household income on the X axis and the dependent variable, Occupational Prestige Score, on the Y axis. Included is a LOWESS curve.

LOWESS is an acronym for “locally weighted scatterplot smoothing.” A LOWESS curve is the result of a moving average and polynomial regression.

Notice that it is a curve and not a straight line which allows us to look for parts of the curve that are consistent across a specific range. For example, the curve from 100 to 155 is flat. The curve from zero to 15k is similar.

We can then graph the different increments to see how consistent the data is across its range. This time we are going to use the linear best fit line rather than the LOWESS curve. If the best fit line is flat we can deduce that the data is consistent across the range.

Here is the graph for family income greater than $100k.

It might be possible to extend the range from $80k to $155k.

Here is an example of the range from $25k to $35k.

After determining our “best guess” for the various categories we can run pairwise comparison tests of adjacent categories to determine whether the categories are significantly different. If two adjacent categories are not statistically significant it is advisable to combine the categories.

The post A Strategy for Converting a Continuous to a Categorical Predictor appeared first on The Analysis Factor.

]]>The post A Useful Graph for Interpreting Interactions between Continuous Variables appeared first on The Analysis Factor.

]]>What’s a good method for interpreting the results of a model with two continuous predictors and their interaction?

Let’s start by looking at a model without an interaction. In the model below, we regress a subject’s hip size on their weight and height. Height and weight are centered at their means.

Imagine you have a group of people with the same height, but varying weights. Logically we would expect people’s hip sizes to be larger if they are heavier. In other words, “keeping height constant, as weight increases, hip size increases”.

Likewise, imagine a group of people with the same weight, but varying heights. Logically we would expect hip sizes to be smaller if they are taller (remember, they’re all the same weight. For a taller person to be the same weight, they are going to be thinner). This interpretation can be, “keeping weight constant, as height increases hip size decreases.

Indeed, we see these logical results. We see from the output that keeping weight constant, for every one unit increase in height, hip size decreases by 0.48 inches.

In addition, keeping height constant, increasing weight by one pound increases hip size by 0.13 inches.

But this model requires that the effect of increasing weight on hip size is constant at each height. Maybe it’s not. Is the change in hip size per one pound change in weight the same for people that are 60 inches tall as it is for people that are 72 inches tall? We need to add an interaction to determine that.

Indeed, we have a small but significant interaction. This tells us that the effect of an additional pound of weight on hip size is not the same for each height.

As height increases each inch, the effect of an additional pound of weight on hip size decreases by .01 inches. But what does that mean?

This is a situation where a graph can be worth a thousand words. Let’s graph these results using a contour graph.

The graph below is the predicted hip size for the 403 subjects in the data set. Ranges of predicted hip size are color coded to help interpret the relationship that weight and height have on hip size. The color coded scale is shown on the right side of the graph.

Each color represents a 6” range of predicted values. Blue represents the smallest hip sizes and red represents the largest.

We can see from the graph that the rate of change from one range of hip size to the next is lower for taller people as compared to shorter people.

For example, people who are 60” tall have quickly increasing hip sizes as weight increases a pound. Start at the left side of the graph along the solid line and move to the right as weight increases. Hip sizes quickly go from the blue range (30-36”) to green, and on to red (60” plus). So a one-pound increase in weight has a large impact on the hip size of short people.

That’s not true for tall people. For example, people who are 72” tall (six feet), move more slowly from one hip size to the next. Start at the left side of the graph along the dashed grey line and move to the right as weight increases. The color bands are wider because a one-pound increase in weight doesn’t affect their hip size as much. In fact, people who are taller than 70 inches have a maximum hip size of 54 inches.

Explaining interactions between continuous variables to those not well versed on the subject or understanding them yourself can be a cause for sweaty palms and increased heart rate. But, if you use this visual method for interpretation of your results, your only known side effects will be “a sense of calm and tranquility.”

The post A Useful Graph for Interpreting Interactions between Continuous Variables appeared first on The Analysis Factor.

]]>The post February 2019 Member Webinar: What’s the Best Statistical Package for You? appeared first on The Analysis Factor.

]]>Choosing statistical software is part of The Fundamentals of Statistical Skill and is necessary to learning a second software (something we recommend to anyone progressing from Stage 2 to Stage 3 and beyond).

You have many choices for software to analyze your data: R, SAS, SPSS, and Stata, among others. They are all quite good, but each has its own unique strengths and weaknesses.

In this webinar, you’ll see the areas where these four packages shine and where they fall short, decide which ones are best for your particular needs, and review their historical development.

This training also includes statistical software JMP, MPlus, AMOS, S, S-PLUS, and PASW, as well as mentioning non-statistical software Python, git/github, Jupyter, Linux, and SQL.

You’ll also learn why you should keep your data very, very, far away from Microsoft Excel, and what non-statistical programming skills you should develop.

This webinar will teach you how to choose a statistical software package that is best suited to your research.

**Note: This training webinar is an exclusive benefit to members of the Statistically Speaking Membership Program.**

**Wednesday, February 13, 2019**

**3pm – 4:30pm (US EST) **(In a different time zone?)

Steve Simon works as an independent statistical consultant and as a part-time faculty member in the Department of Biomedical and Health Informatics at the University of Missouri-Kansas City. He has previously worked at Children’s Mercy Hospital, the National Institute for Occupational Safety and Health, and Bowling Green State University.

Steve has over 90 peer-reviewed publications, four of which have won major awards. He has written one book, Statistical Evidence in Medical Trials, and is the author of a major website about Statistics, Research Design, and Evidence Based Medicine, www.pmean.com. One of his current areas of interest is using Bayesian models to forecast patient accrual in clinical trials. Steve received a Ph.D. in Statistics from the University of Iowa in 1982.

It’s never too early (or late) to set yourself up for successful analysis with support and training from expert statisticians.

Just head over and sign up for Statistically Speaking.

You’ll get exclusive access to this training webinar, plus live Q&A sessions, a private stats forum, 70+ other stats trainings, and more.

The post February 2019 Member Webinar: What’s the Best Statistical Package for You? appeared first on The Analysis Factor.

]]>The post Descriptives Before Model Building appeared first on The Analysis Factor.

]]>One approach to model building is to use all predictors that make theoretical sense in the first model. For example, a first model for determining birth weight could include mother’s age, education, marital status, race, weight gain during pregnancy and gestation period.

The main effects of this model show that a mother’s education level and marital status are insignificant.

Dropping marital status and running a new model we find that the mother’s education is now significant. We also see an improvement in the significance of the mother’s age from p=0.033 to p=0.004.

What if mother’s education level was dropped instead of marital status?

Marital status is now significant. Mother’s age has a lower p-value as well when compared to the model using both education and marital status. Why is there a conflict between the three models? Why the improvement with the statistical significance of mother’s age?

Examining a cross tabulation shows us the two predictors are not duplicates. If they were duplicates we would have zero counts in the bottom left and upper right cells. If one variable was the reverse code of the other variable the cells in the upper left and bottom right would have zero counts.

Perhaps the two predictors are providing the same information about birth weight. The tables below examine the mean birth weight per category for each predictor. The mean birth weights are almost identical for the two predictors.

The mean age for the two predictors is almost identical as well.

The summary statistics for this data shows the same mean birth weight for babies born to mothers with more than a high school education is the same as the mean birth weight for married mothers. The same is true when comparing unmarried mothers to mothers with no post-high school education.

Does theory suggest these relationships exist? If they run counter to theory the researcher should explain the conflicts found within the data.

Running a series of descriptive statistics before running models can help identify issues such as these shown here. Otherwise, we might reach inaccurate conclusions due to an unusual sample.

The post Descriptives Before Model Building appeared first on The Analysis Factor.

]]>The post The Secret to Importing Excel Spreadsheets into SAS appeared first on The Analysis Factor.

]]>My poor colleague was pulling her hair out in frustration today.

You know when you’re trying to do something quickly, and it’s supposed to be easy, only it’s not? And you try every solution you can think of and it *still* doesn’t work?

And even in the great age of the Internet, which is supposed to know all the things you don’t, you *still* can’t find the answer anywhere?

Cue hair-pulling.

Here’s what happened: She was trying to import an Excel spreadsheet into SAS, and it didn’t work.

Instead she got:

Look familiar? If you’re like my colleague, you’re wondering, what the *blank* is going on here?

Well, if you have SAS 64-bit and Office 32-bit (or even Office 64-bit), you’ll find that the 64-bit version of SAS does not have the interface to communicate with Office and therefore cannot import spreadsheets.

Yep, you read that right: it can’t do it through the wizard and it can’t do it through Proc Import.

So here’s what you have to do.

Save the Excel spreadsheet as a .csv file, and *then* import it. (You can only have 1 worksheet in the .csv file, but other than that, you shouldn’t see any differences from an Excel spreadsheet.)

It should look like this:

**PROC** **IMPORT** OUT= WORK.QA *desired filename;

DATAFILE= “C:\file to import”

DBMS=CSV REPLACE; guessingrows=**1000**;

GETNAMES=YES;

DATAROW=**2**;

**RUN**;

BTW, the guessingrows option is very useful: it tells SAS to read through the specified number of lines (the limit is 214,7483,647 and the default is 20) to determine the length of variables.

Without this option, if the first 20 values are 2 characters and the 21st is 3 characters, your values will be truncated. Specifying the guessingrows (you can just use guessingrows=Max) ensures that SAS looks at all the values in your dataset (i.e., all rows) before it sets the length of variables.

So spread the word. And save your hair.

Life will be easier if everyone starts using .csv files instead of Excel files.

The post The Secret to Importing Excel Spreadsheets into SAS appeared first on The Analysis Factor.

]]>The post Using Predicted Means to Understand Our Models appeared first on The Analysis Factor.

]]>The expression “can’t see the forest for the trees” often comes to mind when reviewing a statistical analysis. We get so involved in reporting “statistically significant” and p-values that we fail to explore the grand picture of our results.

It’s understandable that this can happen. We have a hypothesis to test. We go through a multi-step process to create the best model fit possible. Too often the next and last step is to report which predictors are statistically significant and include their effect sizes.

I suggest one additional step: take the time to absorb and think about the information you can extract from your model with predicted means.

I use the term “information” because we will not focus on p-values and significance levels. Nor are we summarizing a predictor’s effect simply with a coefficient, but digging deeper into what that coefficient tells us.

In the model below, we are determining which predictors are associated with the number of times someone visits a doctor over a two-week period.

Illness — number of days ill

Actdays — number of days not active

Prescrib — number of prescriptions used

Medical_advice — Number of times sought medical advice in past two-weeks

the type of medical insurance each person has.

Note that virtually every coefficient is significant. We will report the coefficients, p-values and confidence intervals in the final write up. But the coefficients table doesn’t communicate well what the real effects are. Let’s investigate a bit with some predicted values.

Do people on Medicaid with two and four prescriptions have the same predicted number of trips to the doctor’s office? How does that compare to someone that is on private insurance or Medicare? Do these comparisons differ for people who seek medical an average (.52) or a high (4) number of times?

We will start with the left half of the table for people who seek medical advice an average number of times. The predicted number of visits for people with Medicaid and private insurance about doubles when the number of prescriptions they are taking increase from two to four.

People on Medicare have no increase. However, the predicted total number of visits is still less than 0.5 in all situations.

What happens if we change the number of times someone sought medical advice from the mean of 0.52 to 4 times. Let’s look at the right side.

We find that predicted number of doctor visits increases substantially overall. The minimum number of doctor visits is now 2.13. Of people taking two prescriptions, those on Medicaid have the fewest expected visits while those on private insurance have the most.

Interestingly, the predicted number of doctor visits for those on Medicaid more than double while those with Medicare increase only slightly when the number of prescriptions increase from two to four.

The predicted number of doctor visits by people with Medicaid now surpasses those with Medicare. People with full private insurance, who most likely have easier access to see the doctor, remain as the group with the greatest expected number of doctor visits.

Now we have interesting information to give our audience beyond confusing coefficients and p-values. We have brought our numbers and data to life so non-statisticians can learn from our work.

The post Using Predicted Means to Understand Our Models appeared first on The Analysis Factor.

]]>