The post 3 ways to obtain the Hessian at the MLE solution for a regression model appeared first on The DO Loop.

]]>- For some SAS regression procedures, you can store the model and use the SHOW HESSIAN statement in PROC PLM to display the Hessian.
- Some regression procedures support the COVB option ("covariance of the betas") on the MODEL statement. You can compute the Hessian as the inverse of that covariance matrix.
- The NLMIXED procedure can solve general regression problems by using MLE. You can use the HESS option on the PROC NLMIXED statement to display the Hessian.

The next section discusses the relationship between the Hessian and the estimate of the covariance of the regression parameters. Briefly, they are inverses of each other. You can download the complete SAS program for this blog post.

The Hessian at the optimal MLE value is related to the covariance of the parameters. The literature that discusses this fact can be confusing because the objective function in MLE can be defined in two ways. You can maximize the log-likelihood function, or you can minimize the NEGATIVE log-likelihood.

In statistics, the inverse matrix is related to the covariance matrix of the parameters. A full-rank covariance matrix is always positive definite. If you maximize the log-likelihood, then the Hessian and its inverse are both negative definite. Therefore, statistical software often minimizes the negative log-likelihood function. Then the Hessian at the minimum is positive definite and so is its inverse, which is an estimate of the covariance matrix of the parameters. Unfortunately, not every reference uses this convention.

For details about the MLE process and how the Hessian at the solution relates to the covariance of the parameters, see the PROC GENMOD documentation. For a more theoretical treatment and some MLE examples, see the Iowa State course notes for Statistics 580.

I previously discussed how to use the STORE statement to save a generalized linear model to an item store, and how to use PROC PLM to display information about the model. Some procedures, such as PROC LOGISTIC, save the Hessian in the item store. For these procedures, you can use the SHOW HESSIAN statement to display the Hessian. The following call to PROC PLM continues the PROC LOGISTIC example from the previous post. (Download the example.) The call displays the Hessian matrix at the optimal value of the log-likelihood. It also saves the "covariance of the betas" matrix in a SAS data set, which is used in the next section.

/* PROC PLM provides the Hessian matrix evaluated at the optimal MLE */ proc plm restore=PainModel; show Hessian CovB; ods output Cov=CovB; run; |

Not every SAS procedure stores the Hessian matrix when you use the STORE statement. If you request a statistic from PROC PLM that is not available, you will get a message such as the following:
`
NOTE: The item store WORK.MYMODEL does not contain a
Hessian matrix. The option in the SHOW statement is
ignored.
`

Many SAS regression procedures support the COVB option on the MODEL statement. As indicated in the previous section, you can use the SHOW COVB statement in PROC PLM to display the covariance matrix. A full-rank covariance matrix is positive definite, so the inverse matrix will also be positive definite. Therefore, the inverse matrix represents the Hessian at the minimum of the NEGATIVE log-likelihood function. The following SAS/IML program reads in the covariance matrix and uses the INV function to compute the Hessian matrix for the logistic regression model:

proc iml; use CovB nobs p; /* open data; read number of obs (p) */ cols = "Col1":("Col"+strip(char(p))); /* variable names are Col1 - Colp */ read all var cols into Cov; /* read COVB matrix */ read all var "Parameter"; /* read names of parameters */ close; Hessian = inv(Cov); /* Hessian and covariance matrices are inverses */ print Hessian[r=Parameter c=Cols F=BestD8.4]; quit; |

You can see that the inverse of the COVB matrix is the same matrix that was displayed by using SHOW HESSIAN in PROC PLM. Be aware that the parameter estimates and the covariance matrix depend on the parameterization of the classification variables. The LOGISTIC procedure uses the EFFECT parameterization by default. However, if you instead use the REFERENCE parameterization, you will get different results. If you use a singular parameterization, such as the GLM parameterization, some rows and columns of the covariance matrix will contain missing values.

SAS provides procedures for solving common generalized linear regression models, but you might need to use MLE to solve a nonlinear regression model. You can use the NLMIXED procedure to define and solve general maximum likelihood problems. The PROC NLMIXED statement supports the HESS and COV options, which display the Hessian and covariance of the parameters, respectively.

To illustrate how you can get the covariance and Hessian matrices from PROC NLMIXED, let's define a logistic model and see if we get results that are similar to PROC LOGISTIC. We shouldn't expect to get *exactly* the same values unless we use exactly the same optimization method, convergence options, and initial guesses for the parameters. But if the model fits the data well, we expect that the NLMIXED solution will be close to the LOGISTIC solution.

The NLMIXED procedure does not support a CLASS statement, but you can use another SAS procedure to generate the design matrix for the desired parameterization. The following program uses the OUTDESIGN= option in PROC LOGISTIC to generate the design matrix. Because PROC NLMIXED requires a numerical response variable, a simple data step encodes the response variable into a binary numeric variable. The call to PROC NLMIXED then defines the logistic regression model in terms of a binary log-likelihood function:

/* output design matrix and EFFECT parameterization */ proc logistic data=Neuralgia outdesign=Design outdesignonly; class Pain Sex Treatment; model Pain(Event='Yes')= Sex Age Duration Treatment; /* use NOFIT option for design only */ run; /* PROC NLMIXED required a numeric response */ data Design; set Design; PainY = (Pain='Yes'); run; ods exclude IterHistory; proc nlmixed data=Design COV HESS; parms b0 -18 bSexF bAge bDuration bTreatmentA bTreatmentB 0; eta = b0 + bSexF*SexF + bAge*Age + bDuration*Duration + bTreatmentA*TreatmentA + bTreatmentB*TreatmentB; p = logistic(eta); /* or 1-p to predict the other category */ model PainY ~ binary(p); run; |

Success! The parameter estimates and the Hessian matrix are very close to those that are computed by PROC LOGISTIC. The covariance matrix of the parameters, which requires taking an inverse of the Hessian matrix, is also close, although there are small differences from the LOGISTIC output.

In summary, this article shows three ways to obtain the Hessian matrix at the optimum for an MLE estimate of a regression model. For some SAS procedures, you can store the model and use PROC PLM to obtain the Hessian. For procedures that support the COVB option, you can use PROC IML to invert the covariance matrix. Finally, if you can define the log-likelihood equation, you can use PROC NLMIXED to solve for the regression estimates and output the Hessian at the MLE solution.

The post 3 ways to obtain the Hessian at the MLE solution for a regression model appeared first on The DO Loop.

]]>The post 4 reasons to use PROC PLM for linear regression models in SAS appeared first on The DO Loop.

]]>PROC PLM enables you to analyze a generalized linear model (or a generalized linear mixed model) long after you quit the SAS/STAT procedure that fits the model. PROC PLM was released with SAS 9.22 in 2010. This article emphasizes four features of PROC PLM:

- You can use the SCORE statement to score the model on new data.
- You can use the EFFECTPLOT statement to visualize the model.
- You can use the ESTIMATE, LSMEANS, SLICE, and TEST statements to estimate parameters and perform hypothesis tests.
- You can use the SHOW statement to display statistical tables such as parameter estimates and fit statistics.

For an introduction to PROC PLM, see "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models" (Tobias and Cai, 2010). The documentation for the PLM procedure includes more information and examples.

To use PROC PLM you must first use the STORE statement in a regression procedure to create an item store that summarizes the model. The following procedures support the STORE statement: GEE, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG.

The example in this article uses PROC LOGISTIC to analyze data about pain management in elderly patients who have neuralgia. In the PROC LOGISTIC documentation, PROC LOGISTIC fits the model and performs all the post-fitting analyses and visualization. In the following program, PROC LOGIST fits the model and stores it to an item store named `PainModel`. In practice, you might want to store the model to a permanent libref (rather than WORK) so that you can access the model days or weeks later.

Data Neuralgia; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; P F 68 1 No B M 74 16 No P F 67 30 No P M 66 26 Yes B F 67 28 No B F 77 16 No A F 71 12 No B F 72 50 No B F 76 9 Yes A M 71 17 Yes A F 63 27 No A F 69 18 Yes B F 66 12 No A M 62 42 No P F 64 1 Yes A F 64 17 No P M 74 4 No A F 72 25 No P M 70 1 Yes B M 66 19 No B M 59 29 No A F 64 30 No A M 70 28 No A M 69 1 No B F 78 1 No P M 83 1 Yes B F 69 42 No B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes A M 70 12 No A F 69 12 No B F 65 14 No B M 70 1 No B M 67 23 No A M 76 25 Yes P M 78 12 Yes B M 77 1 Yes B F 69 24 No P M 66 4 Yes P F 65 29 No P M 60 26 Yes A M 78 15 Yes B M 75 21 Yes A F 67 11 No P F 72 27 No P F 70 13 Yes A M 75 6 Yes B F 65 7 No P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No A M 65 15 No P F 67 1 Yes A M 67 10 No P F 72 11 Yes A F 74 1 No B M 80 21 Yes A F 69 3 No ; title 'Logistic Model on Neuralgia'; proc logistic data=Neuralgia; class Sex Treatment; model Pain(Event='Yes')= Sex Age Duration Treatment; store PainModel / label='Neuralgia Study'; /* or use mylib.PaimModel for permanent storage */ run; |

The LOGISTIC procedure models the presence of pain based on a patient's medication (Drug A, Drug B, or placebo), gender, age, and duration of pain. After you fit the model and store it, you can use PROC PLM to perform all sorts of additional analyses, as shown in the subsequent sections.

An important application of regression models is to predict the response variable for new data. The following DATA step defines three new patients. The first two are females who are taking Drug B. The third is a male who is taking Drug A:

/* 1.Use PLM to score future obs */ data NewPatients; input Treatment $ Sex $ Age Duration; datalines; B F 63 5 B F 79 16 A M 74 12 ; proc plm restore=PainModel; score data=NewPatients out=NewScore predicted LCLM UCLM / ilink; /* ILINK gives probabilities */ run; proc print data=NewScore; run; |

The output shows the predicted pain level for the three patients. The younger woman is predicted to have a low probability (0.01) of pain. The model predicts a moderate probability of pain (0.38) for the older woman. The model predicts a 64% chance that the man will experience pain.

Notice that the PROC PLM statement does not use the original data. In fact, the procedure does not support a DATA= option but instead uses the RESTORE= option to read the item store. The PLM procedure cannot create plots or perform calculations that require the data because the data are not part of the item store.

I've previously written about how to use the EFFECTPLOT statement to visualize regression models. The EFFECTPLOT statement has many options. However, because PROC PLM does not have access to the original data, the EFFECTPLOT statement in PROC PLM cannot add observations to the graphs.

Although the EFFECTPLOT statement is supported natively in the LOGISTIC and GENMOD procedure, it is not directly supported in other procedures such as GLM, MIXED, GLIMMIX, PHREG, or the SURVEY procedures. Nevertheless, because these procedures support the STORE statement, you can use the EFFECTPLOT statement in PROC PLM to visualize the models for these procedures. The following statement uses the EFFECTPLOT statement to visualize the probability of pain for female and male patients that are taking each drug treatment:

/* 2. Use PROC PLM to create an effect plot */ proc plm restore=PainModel; effectplot slicefit(x=Age sliceby=Treatment plotby=Sex); run; |

The graphs summarize the model. For both men and women, the probability of pain increases with age. At a given age, the probability of pain is lower for the non-placebo treatments, and the probability is slightly lower for the patients who use Drug B as compared to Drug A. These plots are shown at the mean value of the Duration variable.

One of the main purposes of PROC PLM Is to perform postfit estimates and hypothesis tests. The simplest is a pairwise comparison that estimates the difference between two levels of a classification variable. For example, in the previous graph the probability curves for the Drug A and Drug B patients are close to each other. Is there a significant difference between the two effects? The following ESTIMATE statement estimates the (B vs A) effect. The EXP option exponentiates the estimate so that you can interpret the 'Exponentiated' column as the odds ratio between the drug treatments. The CL option adds confidence limits for the estimate of the odds ratio. The odds ratio contains 1, so you cannot conclude that Drug B is significantly more effective that Drug A at reducing pain.

/* 3. Use PROC PLM to create contrasts and estimates */ proc plm restore=PainModel; /* 'Exponentiated' column is odds ratio between treatments */ estimate 'Pairwise A vs B' Treatment 1 -1 / exp CL; run; |

One of the more useful features of PROC PLM is that you can use the SHOW statement to display tables of statistics from the original analysis. If you want to see the ParameterEstimates table again, you can do that (SHOW PARAMETERS). You can even display statistics that you did not compute originally, such as an estimate of the covariance of the parameters (SHOW COVB). Lastly, if you have the item store but have forgotten what program you used to generate the model, you can display the program (SHOW PROGRAM). The following statements demonstrate the SHOW statement. The results are not shown.

/* 4. Use PROC PLM to show statistics or the original program */ proc plm restore=PainModel; show Parameters COVB Program; run; |

In summary, the STORE statement in many SAS/STAT procedures enables you to store various regression models into an item store. You can use PROC PLM to perform additional postfit analyses on the model, including scoring new data, visualizing the model, hypothesis testing, and (re)displaying additional statistics. This technique is especially useful for long-running models, but it is also useful for confidential data because the data are not needed for the postfit analyses.

The post 4 reasons to use PROC PLM for linear regression models in SAS appeared first on The DO Loop.

]]>The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.

]]>Feature generation is not new. Classical regression often uses transformations of the original variables. In the past, I have written about applying a logarithmic transformation when a variable's values span several orders of magnitude. Statisticians generate spline effects from explanatory variables to handle general nonlinear relationships. Polynomial effects can model quadratic dependence and interactions between variables. Other classical transformations in statistics include the square-root and inverse transformations.

SAS/STAT procedures provide several ways to generate new features from existing variables, including the EFFECT statement and the "stars and bars" notation.
However, an undisciplined approach to feature generation can lead to a geometric explosion of features. For example, if you generate all pairwise quadratic interactions of *N* continuous variables, you obtain "*N* choose 2" or *N**(*N*-1)/2 new features. For *N*=100 variables, this leads to 4950 pairwise quadratic effects!

In addition to the sheer number of features that you can generate, another problem with generating features willy-nilly is that some of the generated effects might be highly correlated with each other. This can lead to difficulties if you use automated model-building methods to select the "important" features from among thousands of candidates.

I was reminded of this fact recently when I wrote an article about model building with PROC GLMSELECT in SAS. The data were simulated: X from a uniform distribution on [-3, 3] and Y from a cubic function of X (plus random noise). I generated the polynomial effects x, x^2, ..., x^7, and the procedure had to build a regression model from these candidates. The stepwise selection method added the x^7 effect first (after the intercept). Later it added the x^5 effect. Of course, the polynomials x^7 and x^5 have a similar shape to x^3, but I was surprised that those effects entered the model before x^3 because the data were simulated from a cubic formula.

After thinking about it, I realized that the odd-degree polynomial effects are highly correlated with each other and have high correlations with the target (response) variable. The same is true for the even-degree polynomial effects. Here is the DATA step that generates 1000 observations from a cubic regression model, along with the correlations between the effects (x1-x7) and the target variable (y):

%let d = 7; %let xMean = 0; data Poly; call streaminit(54321); array x[&d]; do i = 1 to 1000; x[1] = rand("Normal", &xMean); /* x1 ~ U(-3, 3] */ do j = 2 to &d; x[j] = x[j-1] * x[1]; /* x[i] = x1**i, i = 2..7 */ end; y = 2 - 1.105*x1 - 0.2*x2 + 0.5*x3 + rand("Normal"); /* response is cubic function of x1 */ output; end; drop i j; run; proc corr data=Poly nosimple noprob; var y; with x:; run; |

You can see from the output that the x^5 and x^7 effects have the highest correlations with the response variable. Because the squared correlations are the R-square values for the regression of Y onto each effect, it makes intuitive sense that these effects are added to the model early in the process. Towards the end of the model-building process, the x^3 effect enters the model and the x^5 and x^7 effects are removed. To the model-building algorithm, these effects have similar predictive power because they are highly correlated with each other, as shown in the following correlation matrix:

proc corr data=Poly nosimple noprob plots(MAXPOINTS=NONE)=matrix(NVAR=ALL); var x:; run; |

I've highlighted certain cells in the lower triangular correlation matrix to emphasize the large correlations. Notice that the correlations between the even- and odd-degree effects are close to zero and are not highlighted. The table is a little hard to read, but you can use PROC IML to generate a heat map for which cells are shaded according to the pairwise correlations:

This checkerboard pattern shows the large correlations that occur between the polynomial effects in this problem. This image shows that many of the generated features do not add much new information to the set of explanatory variables. This can also happen with other transformations, so think carefully before you generate thousands of new features. Are you producing new effects or redundant ones?

Notice that the generated effects are not statistically independent. They are all generated from the same X variable, so they are functionally dependent. In fact, a scatter plot matrix of the data will show that the pairwise relationships are restricted to parametric curves in the plane. The graph of (x, x^2) is quadratic, the graph of (x, x^3) is cubic, the graph of (x^2, x^3) is an algebraic cusp, and so forth. The PROC CORR statement in the previous section created a scatter plot matrix, which is shown below: Again, I have highlighted cells that have highly correlated variables.

I like this scatter plot matrix for two reasons. First, it is soooo pretty! Second, it visually demonstrates that two variables that have low correlation are not necessarily independent.

Feature generation is important in many areas of machine learning. It is easy to create thousands of effects by applying transformations and generating interactions between all variables. However, the example in this article demonstrates that you should be a little cautious of using a brute-force approach. When possible, you should use domain knowledge to guide your feature generation. Every feature you generate adds complexity during model fitting, so try to avoid adding a bunch of redundant highly-correlated features.

For more on this topic, see the article "Automate your feature engineering" by my colleague, Funda Gunes. She writes: "The process [of generating new features]involves thinking about structures in the data, the underlying form of the problem, and how best to expose these features to predictive modeling algorithms. The success of this tedious human-driven process depends heavily on the domain and statistical expertise of the data scientist." I agree completely. Funda goes on to discuss SAS tools that can help data scientists with this process, such as Model Studio in SAS Visual Data Mining and Machine Learning. She shows an example of feature generation in her article and in a companion article about feature generation. These tools can help you generate features in a thoughtful, principled, and problem-driven manner, rather than relying on a brute-force approach.

The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.

]]>The post Model selection with PROC GLMSELECT appeared first on The DO Loop.

]]>I previously discussed how you can use validation data to choose between a set of competing regression models. In that article, I manually evaluated seven models for a continuous response on the training data and manually chose the model that gave the best predictions for the validation data. Fortunately, SAS software provides ways to automate this process! This article describes how PROC GLMSELECT builds models on training data and uses validation data to choose a final model. The animated GIF to the right visualizes the sequence of models that are built.

You can download the complete SAS program that creates the results in this blog post.

The GLMSELECT procedure in SAS/STAT is a workhorse procedure that implements many variable-selection methods, including least angle regression (LAR), LASSO, and elastic nets. Even though PROC GLMSELECT was introduced in SAS 9.1 (Cohen, 2006), many of its options remain relatively unknown to many SAS data analysts.

Some statisticians argue against automated model building and selection.
Both Cohen (2006) and the PROC GLMSELECT documentation contain a discussion of the dangers and controversies regarding automated model building. Nevertheless, machine learning techniques routinely use this sort of process to build accurate predictive models from among hundreds of variables (or *features*, as they are known in the ML community).

The following four statements create the analyses and graphs in this article. The (x, y) interval variables are simulated from the cubic polynomial y = 2 - 1.105*x - 0.2*x^{2} + 0.5*x^{3} + N(0,1).

title "Select Model from 8 Effects"; proc glmselect data=Have seed=1 plots(StartStep=1)=(ASEPlot Coefficients); effect poly = polynomial(x / degree=7); /* generate monomial effects: x, x^2, ..., x^7 */ partition fraction(validate=0.4); /* use 40% of data for validation */ model y = poly / selection=stepwise(select=SBC choose=validate) details=steps(Candidates ParameterEstimates); /* OPTIONAL */ run; |

The analysis uses the following options:

- The PLOTS= option requests two plots. The ASEPlot is a plot of the average square error (ASE) of each model on the validation data. The STARTSTEP= option displays the ASE beginning with the first step. (The zeroth step is usually the intercept-only model, which typically has a very large ASE.) The Coefficients plot shows how various effects enter or leave the model at each step of the model-building process. By default, a panel is created, but the lower part of the panel duplicates part of the ASE plot so I've unpacked the panel into separate plots.
- The EFFECT statement generates the monomial effects {x, x^2, ..., x^7} from the variable x.
- The PARTITION statement randomly divides the input data into two subsets. The validation set contains 40% of the data and the training set contains the other 60%. The SEED= option on the PROC GLMSELECT statement specifies the seed value for the random split.
- The SELECTION= option specifies the algorithm that builds a model from the effects. This example uses the stepwise selection algorithm because it is easy to understand. At each step in the model-building process, the stepwise algorithm builds a new model by modifying the model from the previous step. The new model will either contain a new effect that was not in the previous model or will remove an effect from the previous model.
- The SELECT=SBC option specifies that the procedure will use the SBC criterion to assess the candidate effects and determine which effect should be added (or removed) from the previous model.
- The CHOOSE=VALIDATE option specifies that the models are scored on the validation data. The ASE values for the models are used to choose the most predictive model from among the models that were built.
- The DETAILS= option displays information about the models that are built at each step. If you only care about the final model, you do not need this option.

The chosen model is a cubic polynomial. The parameter estimates (below) are very close to the parameter values that are used to simulate the cubic data, so the procedure "discovered" the correct model.

How did PROC GLMSELECT obtain that model? The output of the DETAILS=STEPS option shows that the GLMSELECT procedure built eight models. It then used the validation data to decide that the four-parameter cubic model was the model that best balances parsimony (simple versus complex models) and prediction accuracy (underfitting versus overfitting). I won't display all the output, but you can summarize the details by using the ASE plot and the Coefficients plot.

The ASE plot (shown to the right) visualizes the prediction accuracy of the models. The initial model (zeroth step) is the intercept-only model. The horizontal axis of the ASE plot shows how the models are formed from the previous model. The label for the first tick mark is "1+x^5", which means "the model at Step=1 adds the x^5 term to the previous model. The label for the second tick mark is "2+x^2", which means "the model at Step=2 adds the x^2 term to the previous model." A minus sign means that an effect is removed. For example, the label "6-x^7" means that "the model at Step=6 removes the x^7 effect from the previous model." The vertical axis tracks the change of the ASE for each successive model. The model-building process stops when it can no longer decrease the ASE on the validation data. For this example, that happens at Step=7.

If you prefer a table, the SelectionSummary table summarizes the models that are built. The columns labeled ASE and Validation ASE contain the precise values in the ASE plot.

The models are least squares estimates for the included effects on the training data. The DETAILS=STEPS option displays the parameter estimates for each model. For these models, which are all polynomial effects for a single continuous variable, you can graph the eight models and overlay the fitted curves on the data. You can do this in eight separate plots or you can use PROC SGPLOT in SAS to create an animated gif that animates the sequence of models. The animation is shown at the top of this article.

In general, you can't visualize models with many effects, but the Coefficients plot displays the values of the estimates for each model. The plot (shown to the right; click to enlarge) labels only the effects in the final model, but I have manually added labels for the x^5 and x^7 effects to help explain the plot.

Because the magnitudes of the parameter estimates can vary greatly, the vertical axis of the plot shows standardized estimates. As before, the horizontal axis represents the various models. Each line visualizes the evolution of values for a particular effect. For example, the brown line dominates the upper left portion of the graph. This line shows the progression of the standardized coefficient of the x^5 term. In models 1–4, the coefficient of the x^5 effect is large and positive. In models 5 and 6, the standardized coefficient of x^5 is small and negative. In the last model, the coefficient of x^5 is set to zero because the effect is removed. Similarly, the magenta line displays the standardized coefficients for x^7. The coefficient is negative for Steps 3 and 4, positive for Step 5, and is zero for the last two models, which do not include the x^7 effect. By using this plot, you can visually discern which effects are included in each model and study the relative change of the coefficients between models.

This article shows how to use PROC GLMSELECT in SAS to build a sequence of models for a continuous response on training data. From among the models, a final model is chosen that best predicts a validation data set. The example uses the stepwise selection technique because it is easy to understand, but the GLMSELECT procedure supports other model selection algorithms.

You can visualize the model selection process by using the ASE plot. You can visualize the progression of candidate models by using the coefficient plot. For this univariate regression model, you can also visualize each candidate model by overlaying curves or by using an animation.

For more information about the model selection procedures in SAS, see the SAS/STAT documentation or the following articles:

- Cohen, R. (2006) "Introducing the GLMSELECT PROCEDURE for Model Selection."
*Proceedings of the Thirty-first Annual SAS Users Group International Conference*. - Cohen, R. (2009) "Applications of the GLMSELECT Procedure for Megamodel Selection."
*Proceedings of the SAS Global Forum 2009 Conference*. - SAS/STAT software supports other procedures that also build models, including HPGENSELECT and (HP)QUANTSELECT. For an overview, see "Statistical model building and the SELECT procedures in SAS."

The post Model selection with PROC GLMSELECT appeared first on The DO Loop.

]]>The post Model assessment and selection in machine learning appeared first on The DO Loop.

]]>
The idea for this example is from the excellent book, *A First Course in Machine Learning* by Simon Rogers and Mark Girolami (Second Edition, 2016, p. 31-36 and 86), although I have seen similar examples in many places.
The data (training and validation) are pairs (x, y): The independent variable X is randomly sampled in [-3,3] and the response Y is a cubic polynomial in X to which normally distributed noise is added. The training set contains 50 observations; the validation set contains 200 observations. The goal of the example is to demonstrate how machine learning uses validation data to discover that a cubic model fits the data better than polynomials of other degrees.

This article is part of a series that introduces concepts in machine learning to SAS statistical programmers. Machine learning is accompanied by a lot of hype, but at its core it combines many concepts that are already familiar to statisticians and data analysts, including modeling, optimization, linear algebra, probability, and statistics. If you are familiar with those topics, you can master machine learning.

The first step of the example is to simulate data from a cubic regression model. The following SAS DATA step simulates the data. PROC SGPLOT visualizes the training and validation data.

data Have; length Type $10.; call streaminit(54321); do i = 1 to 250; if i <= 50 then Type = "Train"; /* 50 training observations */ else Type = "Validate"; /* 200 validation observations */ x = rand("uniform", -3, 3); /* 2 - 1.105 x - 0.2 x^2 + 0.5 x^3 */ y = 2 + 0.5*x*(x+1.3)*(x-1.7) + rand("Normal"); /* Y = b0 + b1*X + b2*X**2 + b3*X**3 + N(0,1) */ output; end; run; title "Training and Validation Data"; title2 "True Model Is Cubic Polynomial"; proc sgplot data=Have; scatter x=x y=y / group=Type grouporder=data; xaxis grid; yaxis grid; run; |

In a classical regression analysis, you would fit a series of polynomial models to the 250 observations and use a fit statistic to assess the goodness of fit. Recall that as you increase the degree of a polynomial model, you have more parameters (more degrees of freedom) to fit the data. A high-degree polynomial can produce a small sum of square errors (SSE) but probably overfits the data and is unlikely to predict future data well. To try to prevent overfitting, statistics such as the AIC and SBC include two terms: one which rewards low values of SSE and another that penalizes models that have a large number of parameters. This is done at the end of this article.

The machine learning approach is different: use some data to train the model (fit the parameters) and use different data to validate the model. A popular validation statistic is the average square error (ASE), which is formed by scoring the model on the validation data and then computing the average of the squared residuals. The GLMSELECT procedure supports the PARTITION statement, which enables you to fit the model on training data and assess the fit on validation data. The GLMSELECT procedure also supports the EFFECT statement, which enables you to form a POLYNOMIAL effect to model high-order polynomials. For example, the following statements create a third-degree polynomial model, fit the model parameters on the training data, and evaluate the ASE on the validation data:

%let Degree = 3; proc glmselect data=Have; effect poly = polynomial(x / degree=&Degree); /* model is polynomial of specified degree */ partition rolevar=Type(train="Train" validate="Validate"); /* specify training/validation observations */ model y = poly / selection=NONE; /* fit model on training data */ ods select FitStatistics ParameterEstimates; run; |

The first table includes the classical statistics for the model, evaluated only on the training data. At the bottom of the table is the "ASE (Validate)" statistic, which is the value of the ASE for the validation data. The second table shows that the parameter estimates are very close to the parameters in the simulation.

You can repeat the regression analysis for various other polynomial models. It is straightforward to write a macro loop that repeats the analysis as the degree of the polynomial ranges from 1 to 7. For each model, the training and validation data are the same. If you plot the ASE value for each model (on both the training and validation data) against the degree of each polynomial, you obtain the following graph. The vertical axis is plotted on a logarithmic scale because the ASE ranges over an order of magnitude,

The graph shows that when you score the linear and quadratic models on the validation data, the fit is not very good, as indicated by the relatively large values of the average square error for Degree = 1 and 2. For the third-degree model, the ASE drops dramatically. Higher-degree polynomials do not substantially change the ASE. As you would expect, the ASE on the training data continues to decrease as the polynomial degree increases. However, the high-degree models, which overfit the training data, are less effective at predicting the validation data. Consequently, the ASE on the validation data actually increases when the degree is greater than 3.

Notice that the minimum value of the ASE on the validation data corresponds to the correct model. By choosing the model that minimizes the validation ASE, we have "discovered" the correct model! Of course, real life is not so simple. In real life, the data almost certainly does not follow a simple algebraic equation. Nevertheless, the graph summarizes the essence of the machine learning approach to model selection: train each model on a subset of the data and select the one that gives the smallest prediction error for the validation sample.

Validation methods are quite effective, but classical statistics are powerful, too. The graph to the right shows three popular fit statistics evaluated on the training data. In each case, the fit statistic reaches a minimum value for a cubic polynomial. Consequently, the classical fit statistics would also choose the cubic polynomial as being the best model for these data.

However, it's clear that the validation ASE is a simple technique that is applicable to any model that has a continuous response. Whether you use a tree model, a nonparametric model, or even an ensemble of models, the ASE on the validation data is easy to compute and easy to understand. All you need is a way to score the model on the validation sample. In contrast, it can be difficult to construct the classical statistics because you must estimate the number of parameters (degrees of freedom) in the model, which can be a challenge for nonparametric models.

In summary, a fundamental difference between classical statistics and machine learning is how each discipline assesses and compares models. Machine learning tends to fit models on a subset of the data and assess the goodness of fit by using a second subset. The average square error (ASE) on the validation data is easy to compute and to explain for models that have a continuous response. When you fit and compare several models, you can use the ASE to determine which model is best at predicting new data.

The simple example in this article also illustrates some of the fundamental principles that are used in a model selection procedure such as PROC GLMSELECT in SAS. It turns out that PROC GLMSELECT can automate many of the steps in this article. The next article shows how to use PROC GLMSELECT for model selection.

You can download the SAS program for this example.

The post Model assessment and selection in machine learning appeared first on The DO Loop.

]]>The post Simulate data for a regression model with categorical and continuous variables appeared first on The DO Loop.

]]>This discussion and SAS program are based on Chapter 11 in *Simulating Data with SAS* (Wicklin, 2013, p. 208-209). The main steps in the simulation are as follows. The comments in the SAS DATA program indicate each step:

- Macro variables are used to define the number of continuous and categorical regressors. Another macro variable is used to specify the number of levels for each categorical variable. This program simulates eight continuous regressors (x1-x8) and four categorical regressors (c1-c4). Each categorical regressor in this simulation has three levels.
- The continuous regressors are simulated from a normal distribution, but you can use any distribution you want. The categorical levels are 1, 2, and 3, which are generated uniformly at random by using the "Integer" distribution. The discrete "Integer" distribution was introduced in SAS 9.4M5; for older versions of SAS, use the %RandBetween macro as shown in the article "How to generate random integers in SAS." You can also generate the levels non-uniformly by using the "Table" distribution.
- The response variable, Y, is defined as a linear combination of some explanatory variables. In this simulation, the response depends linearly on the x1 and x8 continuous variables and on the levels of the C1 and C4 categorical variables. Noise is added to the model by using a normally distributed error term.

/* Program based on Simulating Data with SAS, Chapter 11 (Wicklin, 2013, p. 208-209) */ %let N = 10000; /* 1a. number of observations in simulated data */ %let numCont = 8; /* number of continuous explanatory variables */ %let numClass = 4; /* number of categorical explanatory variables */ %let numLevels = 3; /* (optional) number of levels for each categorical variable */ data SimGLM; call streaminit(12345); /* 1b. Use macros to define arrays of variables */ array x[&numCont] x1-x&numCont; /* continuous variables named x1, x2, x3, ... */ array c[&numClass] c1-c&numClass; /* CLASS variables named c1, c2, ... */ /* the following statement initializes an array that contains the number of levels for each CLASS variable. You can hard-code different values such as (2, 3, 3, 2, 5) */ array numLevels[&numClass] _temporary_ (&numClass * &numLevels); do k = 1 to &N; /* for each observation ... */ /* 2. Simulate value for each explanatory variable */ do i = 1 to &numCont; /* simulate independent continuous variables */ x[i] = round(rand("Normal"), 0.001); end; do i = 1 to &numClass; /* simulate values 1, 2, ..., &numLevels with equal prob */ c[i] = rand("Integer", numLevels[i]); /* the "Integer" distribution requires SAS 9.4M5 */ end; /* 3. Simulate response as a function of certain explanatory variables */ y = 4 - 3*x[1] - 2*x[&numCont] + /* define coefficients for continuous effects */ -3*(c[1]=1) - 4*(c[1]=2) + 5*c[&numClass] /* define coefficients for categorical effects */ + rand("Normal", 0, 3); /* normal error term */ output; end; drop i k; run; proc glm data=SimGLM; class c1-c&numClass; model y = x1-x&numCont c1-c&numClass / SS3 solution; ods select ModelANOVA ParameterEstimates; quit; |

The ModelANOVA table from PROC GLM (not shown) displays the Type 3 sums of squares and indicates that the significant terms in the model are x1, x8, c1, and c4.

The parameter estimates from PROC GLM are shown to the right. You can see that each categorical variable has three levels, and you can use PROC FREQ to verify that the levels are uniformly distributed. I have highlighted parameter estimates for the significant effects in the model:

- The estimates for the significant continuous effects are highlighted in red. The estimates for the coefficients of x1 and x8 are about -3 and -2, respectively, which are the parameter values that are specified in the simulation.
- The estimates for the levels of the C1 CLASS variable are highlighted in green. They are close to (-3, -4, 0), which are the parameter values that are specified in the simulation.
- The estimates for the Intercept and the C4 CLASS variable are highlighted in magenta. Notice that they seem to differ from the parameters in the simulation. As discussed previously, the simulation and PROC GLM use different parameterizations of the C4 effect. The simulation assigns Intercept = 4. The contribution of the first level of C4 is 5*1, the contribution for the second level is 5*2, and the contribution for the third level is 5*3. As explained in the previous article, the GLM parameterization reparameterizes the C4 effect as (4 + 15) + (5 - 15)*(C4=1) + (10 - 15)*(C4=2). The estimates are very close to these parameter values.

Although this program simulates a linear regression model, you can modify the program and simulate from a generalized linear model such as the logistic model. You just need to compute the linear predictor, eta (η), and then use the link function and the RAND function to generate the response variable, as shown in a previous article about how to simulate data from a logistic model.

In summary, this article shows how to simulate data for a linear regression model in the SAS DATA step when the model includes both categorical and continuous regressors. The program simulates arbitrarily many continuous and categorical variables. You can define a response variable in terms of the explanatory variables and their interactions.

The post Simulate data for a regression model with categorical and continuous variables appeared first on The DO Loop.

]]>The post Coding and simulating categorical variables in regression models appeared first on The DO Loop.

]]>An example makes this clearer. The following SAS DATA step simulates 300 observations for a categorical variable C with levels 'C1', 'C2', and 'C3' in equal proportions. The simulation creates a response variable, Y, based on the levels of the variable C. The GLM procedure estimates the parameters from the simulated data:

data Have; call streaminit(1); do i = 1 to 100; do C = 'C1', 'C2', 'C3'; eps = rand("Normal", 0, 0.2); /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 This is NOT the GLM parameterization. */ Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps; /* C='C1' is a 0/1 binary variable */ output; end; end; keep C Y; run; proc glm data=Have plots=none; class C; model Y = C / SS3 solution; ods select ParameterEstimates; quit; |

The output from PROC GLM shows that the parameter estimates are very close to the following values: Intercept=11, C1=7, C2=5, and C3=0. Although these are not the parameter values that were specified in the simulation, these estimates make sense if you remember the following:

- When C='C3', the expected response is the sum of the intercept and the 'C3' parameter, which is 10 + 1 = 11. To convert to the GLM parameterization, you need to incorporate the reference level (C3) into the Intercept term. Thus, the Intercept term for the GLM parameterization is 11, not 10.
- In the GLM parameterization, "the parameter estimates of main effects estimate the
*difference*in the effects of each level compared to the last level (in alphabetical order)." To obtain the difference, subtract 1 (the C3 coefficient) from the C1 and C2 coefficients to obtain 7 (8 – 1) and 5 (6 – 1) for the GLM parameters.

In other words, you can use the parameter values in the simulation to convert to the corresponding parameters for the GLM parameterization. In the following DATA step, the Y and Y2 variables contain exactly the same values, even though the formulas look different. The Y2 variable is simulated by using a GLM parameterization of the C variable:

data Have; call streaminit(1); refEffect = 1; do i = 1 to 100; do C = 'C1', 'C2', 'C3'; eps = rand("Normal", 0, 0.2); /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 */ Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps; /* GLM parameterization for the same response: Intercept=11, C1=7, C2=5, C3=0 */ Y2 = (10 + refEffect) + (8-refEffect)*(C='C1') + (6-refEffect)*(C='C2') + eps; diff = Y - Y2; /* Diff = 0 when Y=Y2 */ output; end; end; keep C Y Y2 diff; run; proc means data=Have; var Y Y2 diff; run; |

The output from PROC MEANS shows that the Y and Y2 variables are exactly equal. The coefficients for the Y2 variable are 11 (the intercept), 7, 5, and 0, which are the parameter values that are estimated by PROC GLM.

Of course, other parameterizations are possible. For example, you can create the simulation by using other parameterizations such as the EFFECT coding. (The EFFECT coding is the default coding in PROC LOGISTIC.) For the effect coding, parameter estimates of main effects indicate the difference of each level as compared to the average effect over all levels. The following statements show the effect coding for the variable Y3. The values of the Y3 variable are exactly the same as Y and Y2:

avgEffect = 5; /* average effect for C is (8 + 6 + 1)/3 = 15/3 = 5 */ ... /* EFFECT parameterization: Intercept=15, C1=3, C2=1, C3=0 */ Y3 = 10 + avgEffect + (8-avgEffect)*(C='C1') + (6-avgEffect)*(C='C2') + eps; |

In summary, when you write a simulation that includes categorical data, there are many equivalent ways to parameterize the categorical effects. When you use a regression procedure to analyze the simulated data, the procedure and simulation might use different parameterizations. If so, the estimates from the procedure might be quite different from the parameters in your simulation. This article demonstrates this fact by using the GLM parameterization and the EFFECT parameterization, which are two commonly used parameterizations in SAS. See the SAS/STAT documentation for additional details about the different parameterizations of classification variables in SAS.

The post Coding and simulating categorical variables in regression models appeared first on The DO Loop.

]]>The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.

]]>**Training data**is used to fit each model.**Validation data**is a random sample that is used for model selection. These data are used to select a model from among candidates by balancing the tradeoff between model complexity (which fit the training data well) and generality (but they might not fit the validation data). These data are potentially used several times to build the final model**Test data**is a hold-out sample that is used to assess final model and estimate its prediction error. It is only used at the end of the model-building process.

I've seen many questions about how to use SAS to split data into training, validation, and testing data. (A common variation uses only training and validation.) There are basically two approaches to partitioning data:

- Specify the proportion of observations that you want in each role. For each observation, randomly assign it to one of the three roles. The number of observations assigned to each role will be a multinomial random variable with expected value
*N p*_{k}, where N is the number of observations and*p*_{k}(*k*= 1, 2, 3) is the probability of assigning an observation to the k_th role. For this method, if you change the random number seed you will usually get a different number of observations each role because the size is a random variable. - Specify the number of observations that you want in each role and randomly allocate that many observations.

This article uses the SAS DATA step to accomplish the first task and uses PROC SURVEYSELECT to accomplish the second. I also discuss how to split data into only two roles: training and validation.

It is worth mentioning that many model-selection routines in SAS enable you to split data by using the PARTITION statement. Example include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure. However, be aware that the procedures might ignore observations that have missing values for the variables in the model.

When you partition data into various roles, you can choose to add an indicator variable, or you can physically create three separate data sets. The following DATA step creates an indicator variable with values "Train", "Validate", and "Test". The specified proportions are 60% training, 30% validation, and 10% testing. You can change the values of the SAS macro variables to use your own proportions. The RAND("Table") function is an efficient way to generate the indicator variable.

data Have; /* the data to partition */ set Sashelp.Heart; /* for example, use Heart data */ run; /* If propTrain + propValid = 1, then no observation is assigned to testing */ %let propTrain = 0.6; /* proportion of trainging data */ %let propValid = 0.3; /* proportion of validation data */ %let propTest = %sysevalf(1 - &propTrain - &propValid); /* remaining are used for testing */ /* Randomly assign each observation to a role; _ROLE_ is indicator variable */ data RandOut; array p[2] _temporary_ (&propTrain, &propValid); array labels[3] $ _temporary_ ("Train", "Validate", "Test"); set Have; call streaminit(123); /* set random number seed */ /* RAND("table") returns 1, 2, or 3 with specified probabilities */ _k = rand("Table", of p[*]); _ROLE_ = labels[_k]; /* use _ROLE_ = _k if you prefer numerical categories */ drop _k; run; proc freq data=RandOut order=freq; tables _ROLE_ / nocum; run; |

A shown by the output of PROC FREQ, the proportion of observations in each role is approximately the same as the specified proportions. For this random number seed, the proportions are 59.1%, 30.4%, and 10.6%. If you change the random number seed, you will get a different assignment of observations to roles and also a different proportion of data in each role.

The observant reader will notice that there are only two elements in the array of probabilities (`p`) that is used in the RAND("Table") call. This is intentional. The documentation for the RAND("Table") function states that if the sum of the specified probabilities is less than 1, then the quantity 1 – Σ *p*_{i} is used as the probability of the last event. By specifying only two values in the `p` array, *the same program works for partitioning the data into two pieces* (training and validation) or three pieces (and testing).

Some practitioners choose to create three separate data sets instead of adding an indicator variable to the existing data. The computation is exactly the same, but you can use the OUTPUT statement to direct each observation to one of three output data sets, as follows:

/* create a separate data set for each role */ data Train Validate Test; array p[2] _temporary_ (&propTrain, &propValid); set Have; call streaminit(123); /* set random number seed */ /* RAND("table") returns 1, 2, or 3 with specified probabilities */ _k = rand("Table", of p[*]); if _k = 1 then output Train; else if _k = 2 then output Validate; else output Test; drop _k; run; |

NOTE: The data set WORK.TRAIN has 3078 observations and 17 variables. NOTE: The data set WORK.VALIDATE has 1581 observations and 17 variables. NOTE: The data set WORK.TEST has 550 observations and 17 variables. |

This example uses the same random number seed as the previous example. Consequently, the three output data sets have the same observations as are indicated by the partition variable (_ROLE_) in the previous example.

Instead of specifying a proportion,
you might want to specify the exact number of observations that are randomly assigned to each role. The advantage of this technique is that changing the random number seed does not change the number of observations in each role (although it does change *which* observations are assigned to each role). The SURVEYSELECT procedure supports the GROUPS= option, which you can use to specify the number of observations.

The GROUPS= option requires that you specify integer values.
For this example, the original data contains 5209 observations but 60% of 5209 is not an integer. Therefore, the following DATA step computes the number of observations as ROUND(*N p*) for the training and validation sets. These integer values are put into macro variables and used in the GROUPS= option on the PROC SURVEYSELECT statement.
You can, of course, skip the DATA step and specify your own values
such as groups=(3200, 1600, 409).

/* Specify the sizes of the train/validation/test data from proportions */ data _null_; if 0 then set sashelp.heart nobs=N; /* N = total number of obs */ nTrain = round(N * &propTrain); /* size of training data */ nValid = round(N * &propValid); /* size of validataion data */ call symputx("nTrain", nTrain); /* put integer into macro variable */ call symputx("nValid", nValid); call symputx("nTest", N - nTrain - nValid); run; /* randomly assign observations to three groups */ proc surveyselect data=Have seed=12345 out=SSOut groups=(&nTrain, &nValid, &nTest); /* if no Test data, use GROUPS=(&nTrain, &nValid) */ run; proc freq data=SSOut order=freq; tables GroupID / nocum; /* GroupID is name of indicator variable */ run; |

The training, validation, and testing groups contain 3125, 1563, and 521 observations, respectively. These numbers are the closest integer approximations to 60%, 30% and 10% of the 5209 observations. Notice that the output from the SURVEYSELECT procedure uses the values 1, 2, and 3 for the GroupID indicator variable. You can use PROC FORMAT to associate those numbers with labels such as "Train", "Validate", and "Test".

In summary, there are two basic programming techniques for randomly partitioning data into training, validation, and testing roles. One way uses the SAS DATA step to randomly assign each observation to a role according to proportions that you specify. If you use this technique, the size of each group is random. The other way is to use PROC SURVEYSELECT to randomly assign observations to roles. If you use this technique, you must specify the number of observation in each group.

The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.

]]>The post Three ways to add a line to a Q-Q plot appeared first on The DO Loop.

]]>- A line that connect the 25th and 75th percentiles of the data and reference distributions
- A least squares regression line
- A line whose intercept and slope are determined by maximum likelihood estimates of the location and scale parameters of the target distribution.

If you need to review Q-Q plots, see my previous article that describes what a Q-Q plot is, how to construct a Q-Q plot in SAS, and how to interpret a Q-Q plot.

Let me be clear: It is not necessary to overlay a line on a Q-Q plot. You can display only the points on a Q-Q plot and, in fact, that is the default behavior in SAS when you create a Q-Q plot by using the QQPLOT statement in PROC UNIVARIATE.

The following DATA step generates 97 random values from an exponential distribution with shape parameter σ = 2 and three artificial "outliers." The call to PROC UNIVARIATE creates a Q-Q plot, which is shown:

data Q(keep=y); call streaminit(321); do i = 1 to 97; y = round( rand("Expon", 2), 0.001); /* Y ~ Exp(2), rounded to nearest 0.001 */ output; end; do y = 10,11,15; output; end; /* add outliers */ run; proc univariate data=Q; qqplot y / exp grid; /* plot data quantiles against Exp(1) */ ods select QQPlot; ods output QQPlot=QQPlot; /* for later use: save quantiles to a data set */ run; |

The vertical axis of the Q-Q plot displays the sorted values of the data; the horizontal axis displays evenly spaced quantiles of the standardized target distribution, which in this case is the exponential distribution with scale parameter σ = 1. Most of the points appear to fall on a straight line, which indicates that these (simulated) data might be reasonably modeled by using an exponential distribution. The slope of the line appears to be approximately 2, which is a crude estimate of the scale parameter (σ). The Y-intercept of the line appears to be approximately 0, which is a crude estimate of the location parameter (the threshold parameter, θ).

Although the basic Q-Q plot provides all the information you need to decide that these data can be modeled by an exponential distribution, some data sets are less clear. The Q-Q plot might show a slight bend or wiggle, and you might want to overlay a reference line to assess how severely the pattern deviates from a straight line. The problem is, what line should you use?

Cleveland (*Visualiizing Data*, 1993, p. 31) recommends overlaying a line that connects the first and third quartiles. That is, let p_{25} and p_{75} be the 25th and 75th percentiles of the target distribution, respectively, and let
y_{25} and y_{75} be the 25th and 75th percentiles of the ordered data values.
Then Cleveland recommends plotting the line through the ordered pairs
(p_{25}, y_{25}) and (p_{75}, y_{y5}).

In SAS, you can use PROC MEANS to compute the 25th and 75th percentiles for the X and Y variables in the Q-Q plot. You can then use the DATA step or PROC SQL to compute the slope of the line that passes between the percentiles. The following statements analyze the Q-Q plot data that was created by using the ODS OUTPUT statement in the previous section:

proc means data=QQPlot P25 P75; var Quantile Data; /* ODS OUTPUT created the variables Quantile (X) and Data (Y) */ output out=Pctl P25= P75= / autoname; run; data _null_; set Pctl; slope = (Data_P75 - Data_P25) / (Quantile_P75 - Quantile_P25); /* dy / dx */ /* if desired, put point-slope values into macro variables to help plot the line */ call symputx("x1", Quantile_P25); call symputx("y1", Data_P25); call symput("Slope", putn(slope,"BEST5.")); run; title "Q-Q Plot with Reference Line"; title2 "Reference Line through First and Third Quartiles"; title3 "Slope = &slope"; proc sgplot data=QQPlot; scatter x=Quantile y=Data; lineparm x=&x1 y=&y1 slope=&slope / lineattrs=(color=Green) legendlabel="Percentile Estimate"; xaxis grid label="Exponential Quantiles"; yaxis grid; run; |

Because the line passes through the first and third quartiles, the slope of the line is robust to outliers in the tails of the data. The line often provides a simple visual guide to help you determine whether the central portion of the data matches the quantiles of the specified probability distribution.

Keep in mind that this is a visual guide. The slope and intercept for this line should not be used as parameter estimates for the location and scale parameters of the probability distribution, although they could be used as an initial guess for an optimization that estimates the location and scale parameters for the model distribution.

Let's be honest, when a statistician sees a scatter plot for which the points appear to be linearly related, there is a Pavlovian reflex to fit a regression line to the values in the plot. However, I can think of several reasons to avoid adding a regression line to a Q-Q plot:

- The values in the Q-Q plot do not satisfy the assumptions of ordinary least squares (OLS) regression. For example, the points are not a random sample and there is no reason to assume that the errors in the Y direction are normally distributed.
- In practice, the tails of the probability distribution rarely match the tails of the data distribution. In fact, the points to the extreme left and right of a Q-Q plot often exhibit a systematic bend away from a straight line. In an OLS regression, these extreme points will be high-leverage points that will unduly affect the OLS fit.

If you choose to ignore these problems, you can use the REG statement in PROC SGPLOT to add a reference line. Alternatively, you can use PROC REG in SAS (perhaps with the NOINT option if the location parameter is zero) to obtain an estimate of the slope:

proc reg data=QQPlot plots=NONE; model Data = Quantile / NOINT; /* use NOINT when location parameter is 0 */ ods select ParameterEstimates; quit; title2 "Least Squares Reference Line"; proc sgplot data=QQPlot; scatter x=Quantile y=Data; lineparm x=0 y=0 slope=2.36558 / lineattrs=(color=Red) legendlabel="OLS Estimate"; xaxis grid label="Exponential Quantiles"; yaxis grid; run; |

For these data, I used the NOINT option to set the threshold parameter to 0. The zero-intercept line with slope 2.36558 is overlaid on the Q-Q plot. As expected, the outliers in the upper-right corner of the Q-Q plot have pulled the regression line upward, so the regression line has a steeper slope than the reference line based on the first and third quartiles. Because the tails of an empirical distribution often differ from the tails of the target distribution, the regression-based reference line can be misleading. I do not recommend its use.

The previous sections describe two ways to overlay a reference line during the exploratory phase of the data analysis. The purpose of the reference line is to guide your eye and help you determine whether the points in the Q-Q plot appear to fall on a straight line. If so, you can move to the modeling phase.

In the modeling phase, you use a parameter estimation method to fit the parameters in the target distribution. Maximum likelihood estimation (MLE) is often the method-of-choice for estimating parameters from data. You can use the HISTOGRAM statement in PROC UNIVARIATE to obtain a maximum likelihood estimate of the shape parameter for the exponential distribution, which turns out to be 2.21387. If you specify the location and scale parameters in the QQPLOT statement, PROC UNIVARIATE will automatically overlay a line that represents that fitted values:

proc univariate data=Q; histogram y / exp; qqplot y / exp(threshold=0 scale=est) odstitle="Q-Q Plot with MLE Estimate" grid; ods select ParameterEstimates GoodnessOfFit QQPlot; run; |

The ParameterEstimates table shows the maximum likelihood estimate. The GoodnessOfFit table shows that there is no evidence to reject the hypothesis that these data came from an Exp(σ=2.21) distribution.

Notice the distinction between this line and the previous lines. This line is the result of fitting the target distribution to the data (MLE) whereas the previous lines were visual guides. When you display a Q-Q plot that has a diagonal line, you should state how the line was computed.

In conclusion, you can display a Q-Q plot without adding any reference line. If you choose to overlay a line, there are three common methods. During the exploratory phase of analysis, you can display a line that connects the 25th and 75th percentiles of the data and target distributions. (Some practitioners use an OLS regression line, but I do not recommend it.) During the modeling phase, you can use maximum likelihood estimation or some other fitting method to estimate the location and scale of the target distribution. Those estimates can be used as the intercept and slope, respectively, of a line on the Q-Q plot. PROC UNIVARIATE in SAS displays this line automatically when you fit a distribution.

The post Three ways to add a line to a Q-Q plot appeared first on The DO Loop.

]]>The post How to align the Y and Y2 axes in PROC SGPLOT appeared first on The DO Loop.

]]>The simplest situation is a single set of data that you want to display in two different units. For example, you might use one axis to display the data in imperial units (pounds, gallons, degrees Fahrenheit, etc.) and the other axis to display the data in metric units (kilograms, liters, degrees Celsius, etc.).

To plot the data, define one variable for each unit. For example, the Sashelp.Class data records the weight for 19 students in pounds. The following DATA view creates a new variable that records the same data in kilograms. The subsequent call to PROC SGPLOT plots the pounds on the Y axis (left axis) and the kilograms on the Y2 axis (right axis). However, as you will see, there is a problem with the default scaling of the two axes:

data PoundsKilos / view=PoundsKilos; set Sashelp.Class(rename=(Weight=Pounds)); Kilograms = 0.453592 * Pounds; /* convert pounds to kilos */ run; title "Independent Axes"; title2 "Markers Do Not Align Correctly!"; /* the tick marks on each axis are independent */ proc sgplot data=PoundsKilos; scatter x=Height y=Pounds; scatter x=Height y=Kilograms / Y2Axis; run; |

The markers for the kilogram measurements should exactly overlap the markers for pounds, but they don't. The Y and Y2 axes are independently scaled because PROC SGPLOT does not know that pounds and kilograms are linearly related. The SGPLOT procedure displays each variable by using a range of round numbers (multiples of 10 or 20). The range for the Y2 axis is [20, 70] kilograms, which corresponds to a range of [44.1, 154.3] pounds. However, the range for the Y axis is approximately [50, 150] pounds. Because the axes display different ranges, the markers do not overlap.

To improve this graph, use the VALUES= and VALUESDISPLAY= options on the YAXIS statement (or Y2AXIS statement) to force the ticks marks on one axis to align with the corresponding tick marks on the other axis. In the following DATA step, I use the kilogram scale as the standard and compute the corresponding pounds.

data Ticks; do Kilograms = 20 to 70 by 10; /* for each Y2 tick */ Pounds = Kilograms / 0.453592; /* convert kilos to pounds */ Approx = round(Pounds, 0.1); /* use rounded values to display tick values */ output; end; run; proc print; run; |

You can use the Pounds column in the table to set the VALUES= list on the YAXIS statement. You can use the Approx column to set the VALUESDISPLAY= list, as follows:

/* align tick marks on each axis */ title "Both Axes Use the Same Scale"; proc sgplot data=PoundsKilos noautolegend; scatter x=Height y=Pounds; /* Make sure the plots overlay exactly! Then you can set SIZE=0 */ scatter x=Height y=Kilograms / markerattrs=(size=0) Y2Axis; yaxis grid values=(44.092 66.139 88.185 110.231 132.277 154.324) valuesdisplay=('44.1' '66.1' '88.2' '110.2' '132.3' '154.3'); run; |

Success! The markers for the two variables align exactly. After verifying that they align, you can use the MARKERATTRS=(SIZE=0) option to suppress the display of one of the markers.

Notice that the Y axis (pounds) no longer displays "nice numbers" because I put the tick marks at the same vertical heights on both axes. A different way to solve the misalignment problem is to use the MIN=, MAX=, THRESHOLDMIN=, and THRESHOLDMAX= options on both axes. This will enable both axes to use "nice numbers" while still aligning the data. If you want to try this approach, here are the YAXIS and Y2AXIS statements:

/* set the axes ranges to coresponding values */ yaxis grid thresholdmin=0 thresholdmax=0 min=44.1 max=154.3; y2axis grid thresholdmin=0 thresholdmax=0 min=20 max=70; |

Another situation that requires two Y axes is the case of two series that use different units. For example, you might want to plot the revenue for a US company (in dollars) and the revenue for a Japanese company (in yen) for a certain time period. You can use the conversion rate between yen and dollars to align the values on the axes. Of course, the conversion from Japanese yen to the US dollars changes each day, but you can use an average conversion rate to set the correspondence between the axes.

This situation also occurs when two devices use different methods to measure the same quantity. The following example shows measurements for a patient who receives a certain treatment. The quantity of a substance in the patient's blood is measured at baseline and for every hour thereafter. The quantity is measured in two ways: by using a traditional blood test and by using a new noninvasive device that measures electrical impedance. The following statements define and plot the data. The two axes are scaled by using the default method:

data BloodTest1; label t="Hours after Medication" x="micrograms per deciliter" y="kiloOhms"; input x y @@; t = _N_ - 1; datalines; 169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 76.1 16.1 72.2 16.7 70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 ; title "Overlay Measurements for Two Medical Devices"; title2 "Default Scaling"; proc sgplot data=BloodTest1; series x=t y=x / markers legendlabel="Standard Lab Value"; series x=t y=y / markers Y2Axis legendlabel="New Device"; xaxis values=(0 to 12 by 2); yaxis grid label="micrograms per deciliter"; y2axis grid label="kiloOhms"; run; |

In this graph, the Y axes are scaled independently. However, the company that manufactures the device used Deming regression to establish that the measurements from the two devices are linearly related by the equation Y = –10.56415 + 0.354463*X, where X is the measurement from the blood test. You can use this linear equation to set the scales for the two axes.

The following DATA step uses the Deming regression estimates to convert the tick marks on the Y axis into values for the Y2 axis. (Click here for the PROC PRINT output.) The call to PROC SGPLOT creates a graph in which the Y2 axis is aligned with the Y axis according to the Deming regression estimates.

data Ticks; do Y1 = 60 to 160 by 20; /* use Deming regression to find one set of ticks in terms of the other */ Y2 = -10.56415 + 0.354463 * Y1; /* kiloOhms as a function of micrograms/dL */ Approx = round(Y2, 0.1); output; end; run; proc print; run; title "Align Y Axes for Different Series"; title2 "Measurements are Linearly Related"; proc sgplot data=BloodTest1; series x=t y=x / markers legendlabel="Standard Lab Value"; series x=t y=y / markers Y2Axis legendlabel="New Device"; xaxis values=(0 to 12 by 2); yaxis grid label="micrograms per deciliter" offsetmax=0.1 values=(60 to 160 by 20); /* the same offsets must be used in both YAXIS and Y2AXIS stmts */ y2axis grid label="kiloOhms" offsetmax=0.1 values=(10.7036 17.7929 24.8822 31.9714 39.0607 46.1499) valuesdisplay=('10.7' '17.8' '24.9' '32.0' '39.1' '46.1'); run; |

In this new graph, the measurements are displayed on compatible scales and the reference lines connect round numbers on one axis to the corresponding values on the other axis.

The post How to align the Y and Y2 axes in PROC SGPLOT appeared first on The DO Loop.

]]>