The post The essential guide to bootstrapping in SAS appeared first on The DO Loop.
]]>Recall that a bootstrap analysis enables you to investigate the sampling variability of a statistic without making any distributional assumptions about the population. For example, if you compute the skewness of a univariate sample, you get an estimate for the skewness of the population. You might want to know the range of skewness values that you might observe from a second sample (of the same size) from the population. If the range is large, the original estimate is imprecise. If the range is small, the original estimate is precise. Bootstrapping enables you to estimate the range by using only the observed data.
In general, the basic bootstrap method consists of four steps:
The links in the previous list provide examples of best practices for bootstrapping in SAS. In particular, do not fall into the trap of using a macro loop to "resample, analyze, and append." You will eventually get the correct bootstrap estimates, but you might wait a long time to get them!
The remainder of this article is organized by the three ways to perform bootstrapping in SAS:
The articles in this section describe how to program the bootstrap method in SAS for basic univariate analyses, for regression analyses, and for related resampling techniques such as the jackknife and permutation tests. This section also links to articles that describe how to generate bootstrap samples in SAS.
When you bootstrap regression statistics, you have two choices for generating the bootstrap samples:
An important part of a bootstrapping is generating multiple bootstrap samples from the data. In SAS, there are many ways to obtain the bootstrap samples:
The SAS-supplied macros %BOOT, %JACK, and %BOOTCI, can perform basic bootstrap analyses and jackknife analyses. However, they require a familiarity with writing and using SAS macros. If you are interested, I wrote an example that shows how to use the %BOOT and %BOOTCI macros for bootstrapping. The documentation also provides several examples.
Many SAS procedures not only compute statistics but also provide standard errors or confidence intervals that enable you to infer whether an estimate is precise. Many confidence intervals are based on distributional assumptions about the population. ("If the errors are normally distributed, then....") However, the following SAS procedures provide an easy way to obtain a distribution-free confidence interval by using the bootstrap. See the SAS/STAT documentation for the syntax for each procedure.
Resampling techniques such as bootstrap methods and permutation tests are widely used by modern data analysts. But how you implement these techniques can make a huge difference between getting the results in a few seconds versus a few hours. This article summarizes and consolidates many previous articles that demonstrate how to perform an efficient bootstrap analysis in SAS. Bootstrapping enable you to investigate the sampling variability of a statistic without making any distributional assumptions. In particular, the bootstrap is often used to estimate standard errors and confidence intervals for parameters.
The post The essential guide to bootstrapping in SAS appeared first on The DO Loop.
]]>The post Visualize Christmas songs appeared first on The DO Loop.
]]>In the Christmas movie Elf (2003), Jovie (played by Zooey Deschanel) must "spread Christmas cheer" to help Santa. She chooses to sing "Santa Claus is coming to town," and soon all of New York City is singing along.
The best sing-along songs are short and have lyrics that repeat. Jovie's choice, "Santa Claus is coming to town," satisfies both criteria. The musical structure of the song is simple:
There is a fun way to visualize repetition in song lyrics. For a song that has N words, you can define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. You can visualize the matrix by using a two-color heat map. Colin Morris has a web site devoted to these visualizations.
The following image visualizes the lyrics of "Santa Claus is coming to town." I have added some vertical and horizontal lines to divide the lyrics into seven sections: the verses (V1 and V2), the tag line (S), and the bridge (B).
The image shows the structure of the repetition in the song lyrics:
Now that you understand what a repetition matrix looks like and how to interpret it, let's visualize a few other classic Christmas songs that contain repetitive lyrics! To help "spread Christmas cheer," I'll use shades of red and green to visualize the lyrics, rather than the boring white and black colors.
If you make a list of Christmas songs that have repetition, chances are "The Twelve Days of Christmas" will be at the top of the list. The song is formulaic: each new verse adds a few new words before repeating the words from the previous verse. As a result, the repetition matrix is almost boring in its regularity. Here is the visualization of the classic song (click to enlarge):
Another highly repetitive Christmas song is "The Little Drummer Boy," which features an onomatopoeic phrase (Pa rum pum pum pum) that alternates with the other lyrics. A visualization of the classic song is shown below:
In addition to repeating the title, "Silver Bells" repeats several phrases. Most notably, the phrase "Soon it will be Christmas Day" is repeated multiple times at the end of the song. Because only certain phrases are repeated, the visualization has a pleasing structure that complements the song's lyrical qualities:
To contrast the hustle, bustle, and commercialism of Christmas, I enjoy hearing songs that are musically simple. One of my favorites is "Silent Night." Each verse is distinct, yet each begins with "Silent night, holy night!" and ends by repeating a phrase. The resulting visualization is devoid of clutter. It is visually empty and matches the lyrical imagery, "all is calm, all is bright."
You can download the SAS program that creates these images. The program also computes visualizations of some contemporary songs such as "Last Christmas" by Wham!, "Someday at Christmas" (Stevie Wonder version), "Rockin' Around the Christmas Tree" (Brenda Lee version), and "Happy XMas (War Is Over)" by John Lennon and Yoko Ono. If you have access to SAS, you can even add your own favorite lyrics to the program! If you don't have access to SAS, Colin Morris's website enables you to paste in the lyrics and see the visualization.
In a little-known "deleted scene" from Elf, Buddy says that the second-best way to spread Christmas cheer is posting images for all to share! So post a comment and share your favorite visualization of a Christmas song!
Happy holidays to all my readers. I am grateful for you. Merry Christmas to all, and to all a good night!
The post Visualize Christmas songs appeared first on The DO Loop.
]]>The post When is a histogram not a histogram? When it's a table! appeared first on The DO Loop.
]]>proc univariate data=Sashelp.Cars noprint; var MPG_City; histogram MPG_City / barlabel=count outhist=MidPtOut; run; proc print data=MidPtOut label; label _MIDPT_ = "Midpoint" _COUNT_="Frequency"; var _MIDPT_ _COUNT_; run; |
As I've previously discussed, PROC UNIVARIATE supports two options for specifying the locations of bins. The MIDPOINTS option specifies that "nice" numbers (for example, multiples of 2, 5, or 10) are used for the midpoints of the bins; the ENDPOINTS option specifies that nice numbers are used for the endpoints of the bins; By default, midpoints are used, as shown in the previous section. The following call to PROC UNIVARIATE uses the ENDPOINTS option and writes the new bin counts to a data set. The histogram is not shown.
proc univariate data=Sashelp.Cars noprint; var MPG_City; histogram MPG_City / barlabel=count endpoints outhist=EndPtOut; run; proc print data=EndPtOut; label _MINPT_ = "Left Endpoint" _COUNT_="Frequency"; var _MINPT_ _COUNT_; run; |
If you want to "manually" count the number of observations in each bin, you have a few choices. If you already know the bin width and anchor position for the bins, then you can use a DATA step array to accumulate the counts. You can also use PROC FORMAT to define a format to bin the observations and use PROC FREQ to tabulate the counts.
The harder problem is when you do not have a prior set of "nice" values to use as the endpoints of bins. It is usually not satisfactory to use the minimum and maximum data values as endpoints of the binning intervals because that might result in intervals whose endpoints are long decimal values such as [3.4546667 4.0108333].
Fortunately, the SAS/IML language provides the GSCALE subroutine, which computes "nice" values from a vector of data and the number of bins. The GSCALE routine returns a three-element vector. The first element is the minimum value of the leftmost interval, the second element is the maximum value of the rightmost interval, and the third element is the bin width. For example, the following SAS/IML statements compute nice intervals for the data in the MPG_City variable:
proc iml; use Sashelp.Cars; read all var "MPG_City" into X; close; /* GSCALE subroutine computes "nice" tick values: s[1]<=min(x); s[2]>=max(x) */ call gscale(s, x, 10); /* ask for about 10 intervals */ print s[rowname={"Start" "Stop" "Increment"}]; |
The output from the GSCALE subroutine suggests that a good set of intervals to use for binning the data are [10, 15), [15, 20), ..., [55, 60]. These are the same endpoints that are generated by using the ENDPOINTS option in PROC UNIVARIATE. (Actually, the procedure uses half-open intervals for all bins, so it adds the extra interval [60, 65) to the histogram.)
I've previously shown how to use the BIN and TABULATE functions in SAS/IML to count the observations in a set of bins. The following statements use the values from the GSCALE routine to form evenly spaced cutpoints for the binning:
cutPoints = do(s[1], s[2], s[3]); /* use "nice" cutpoints from GSCALE */ *cutPoints = do(s[1], s[2]+s[3], s[3]); /* ALTERNATIVE: add additional cutpoint to match UNIVARIATE */ b = bin(x, cutPoints); /* find bin for each obs */ call tabulate(bins, freq, b); /* count how many obs in each bin */ binLabels = char(cutPoints[bins]); /* use left endpoint as labels for bins */ print freq[colname = binLabels label="Count"]; |
Except for the last interval, the counts are the same as for the ENDPOINTS option in PROC UNIVARIATE. It is a matter of personal preference whether you want to treat the last interval as a closed interval or whether you want all intervals to be half open. If you want to exactly match PROC UNIVARIATE, you can modify the definition of the cutPoints variable, as indicated in the program comments.
Notice that the TABULATE routine only reports the bins that have nonzero counts. If you prefer to obtain counts for ALL bins—even bins with zero counts—you can use the TabulateLevels module, which I described in a previous blog post.
In summary, you can use PROC UNIVARIATE or SAS/IML to create a tabular representation of a histogram. Both procedures provide a way to obtain "nice" values for the bin endpoints. If you already know the endpoints for the bins, you can use other techniques in SAS to produce the table.
The post When is a histogram not a histogram? When it's a table! appeared first on The DO Loop.
]]>The post 5 tips for customizing legends in PROC SGPLOT in SAS appeared first on The DO Loop.
]]>By default, the SGPLOT procedure displays a legend when there are multiple plots that are overlaid in the graph. This can be caused by multiple statements or by using the GROUP= option on a statement. If the information in the default legend is redundant, and you might want to suppress it. For example, the following legend is unnecessary because the title explains the data and the regression line. You can uncomment the NOAUTOLEGEND option to suppress the legend.
title "Linear Regression for Weight and Height"; title2 "The legend is unnecessary"; proc sgplot data=Sashelp.Class /* NOAUTOLEGEND */; scatter x=Height y=Weight; reg x=Height y=Weight / nomarkers; footnote J=L "Use the NOAUTOLEGEND option to suppress the legend"; run; footnote; |
In some graphs that overlay multiple components, some components are self -explanatory and do not need to appear in the legend. You can choose which components appear in the legend by using the NAME= option on the statements and using the KEYLEGEND statement to specify the contents of the legend. For example, the following statements create a graph that consists of a scatter plot, a confidence ellipse, and a regression line. If you only want the confidence ellipse and regression line to appear in the legend, use the NAME= option to identify each component and use the KEYLEGEND statement to specify the contents of the legend:
title "Weight versus Height"; title2 "Overlay Least Squares Fit and Confidence Ellipse"; proc sgplot data=Sashelp.Class; scatter x=Height y=Weight / name="scatter"; ellipse x=Height y=Weight / name="ellipse"; reg x=Height y=Weight / name="reg" nomarkers lineattrs=GraphData2; keylegend "reg" "ellipse"; /* list item in the order you want them */ run; |
The KEYLEGEND statement supports the LOCATION= and POSTITION= options, which enable you to place the legend almost anywhere in the graph. The LOCATION= option controls whether the legend appears inside or outside of the graph area. The POSITION= option controls the placement of the legend on the graph (left, right, top, bottom,...). However, I can never remember which option controls which attribute! Therefore, I created a mnemonic, which I hope will help you remember, too:
The following graph is the same as in the previous example, except that the location of the legend is inside the graph area and the position of the legend is in the lower-right corner. When you move the legend to the left or right side of the graph, it is often useful to use the ACROSS=1 option to force the legend to list the items vertically. Similarly, if you position the legend at the top or bottom of a graph, you might want to use the DOWN=1 option to list the items horizontally.
keylegend "reg" "ellipse" / location=inside position=bottomright across=1; |
When you use the GROUP= option to display groups, you might want to exclude some of the group categories from the legend. The KEYLEGEND statement supports the EXCLUDE= option that you can use to exclude certain items. Three situations come to mind:
ods graphics / attrpriority=none; title "Patient Status"; proc sgplot data=Sashelp.Heart(obs=200 where=(Systolic<=200)); styleattrs datasymbols=(X CircleFilled); scatter x=Systolic y=Diastolic / group=Status; keylegend / exclude=("Alive"); run; |
The previous section shows how to exclude one or more levels in a categorical variable that is specified on the GROUP= option. You also might want to customize the items that appear in the legend in order to combine, for example, marker and line attributes. A situation where this comes up is when you want to overlay a group of curves on a scatter plot.
The LEGENDITEM statement (supported in SAS 9.4M5) enables you to specify what combination of markers and line patterns you want to appear for every item in a legend. It is a "super customization" statement that gives you complete control over the legend items.
The following statements show how to use the LEGENDITEM statement to create a customized legend. By default, if you use the REG statement with the GROUP= option, the legend will show only the colors and line patterns for the regression lines. In the following example, I have used the ATTRPRIORITY=NONE option to force the marker symbols to differ between groups. I want the legend to show not only the colors and patterns of the regression lines but also the marker symbols for each group:
/* ensure order of BP_Status is High, Normal, Optimal */ proc sort data=Sashelp.Heart(obs=200 where=(Systolic<=200)) out=Heart; by BP_Status; run; ods graphics / attrpriority=none; title "Patients by Blood Pressure Status"; proc sgplot data=Heart; styleattrs datalinepatterns=(solid) ; reg x=Systolic y=Diastolic / group=BP_Status; legenditem type=markerline name="H" / label="High" lineattrs=GraphData1 markerattrs=GraphData1; legenditem type=markerline name="N" / label="Normal" lineattrs=GraphData2 markerattrs=GraphData2; legenditem type=markerline name="O" / label="Optimal" lineattrs=GraphData3 markerattrs=GraphData3; keylegend "O" "N" "H" / title="BP Status"; run; |
In summary, PROC SGPLOT in SAS supports several ways to create, suppress, position, and customize the items in a legend. Do you have a favorite way to customize a legend in PROC SGPLOT? Leave a comment!
The post 5 tips for customizing legends in PROC SGPLOT in SAS appeared first on The DO Loop.
]]>The post Singular parameterizations, generalized inverses, and regression estimates appeared first on The DO Loop.
]]>Note: The X'X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations. Terms whose estimates are
followed by the letter 'B' are not uniquely estimable.
Singular matrix? Generalized inverse? Estimates not unique? "What's going on?" I thought.
In spite of the ominous note, nothing is wrong. The note merely tells you that the GLM procedure has computed one particular estimate; other valid estimates also exist. This article explains what the note means in terms of the matrix computations that are used to estimate the parameters in a linear regression model.
The note is caused by the fact that the GLM model includes a classification variable. Recall that a classification variable in a regression model is a categorical variable whose levels are used as explanatory variables. Examples of classification variables (called CLASS variables in SAS) are gender, race, and treatment. The levels of the categorical variable are encoded in the columns of a design matrix. The columns are often called dummy variables. The design matrix is used to form the "normal equations" for least squares regression. In terms of matrices, the normal equations are written as (X`*X)*b = X`*Y, where X is a design matrix, Y is the vector of observed responses, and b is the vector of parameter estimates, which must be computed.
There are many ways to construct a design matrix from classification variables. If X is a design matrix that has linearly dependent columns, the crossproduct matrix X`X is singular. Some ways of creating the design matrix always result in linearly dependent columns; these constructions are said to use a singular parameterization.
The simplest and most common parameterization encodes each level of a categorical variable by using a binary indicator column. This is known as the GLM parameterization. It is a singular parameterization because if X_{1}, X_{2}, ..., X_{k} are the binary columns that indicate the k levels, then Σ X_{i} = 1 for each observation.
Not surprisingly, the GLM procedure in SAS uses the GLM parameterization. Here is an example that generates the "scary" note. The data are a subset of the Sashelp.Heart data set. The levels of the BP_Status variable are "Optimal", "Normal", and "High":
data Patients; set Sashelp.Heart; keep BP_Status Cholesterol; if ^cmiss(BP_Status, Cholesterol); /* discard any missing values */ run; proc glm data=Patients plots=none; class BP_Status; model Cholesterol = BP_Status / solution; quit; |
If you have linearly dependent columns among the explanatory variables, the parameter estimates are not unique. The easiest way to see this is to change the reference level for a classification variable. In PROC GLM, you can use the REF=FIRST or REF=LAST option on the CLASS statement to change the reference level. However, the following example uses PROC GLMSELECT (without variable selection) because you can simultaneously use the OUTDESIGN= option to write the design matrix to a SAS data set. The first call writes the design matrix that PROC GLM uses (internally) for the default reference levels. The second call writes the design matrix for an alternate reference level:
/* GLMSELECT can fit the data and output a design matrix in one step */ title "Estimates for GLM Paremeterization"; title2 "Default (Last) Reference Levels"; ods select ParameterEstimates(persist); proc glmselect data=Patients outdesign(fullmodel)=GLMDesign1; class BP_Status; model Cholesterol = BP_Status / selection=none; quit; /* Change reference levels. Different reference levels result in different estimates. */ title2 "Custom Reference Levels"; proc glmselect data=Patients outdesign(fullmodel)=GLMDesign2; class BP_Status(ref='Normal'); model Cholesterol = BP_Status / selection=none; quit; ods select all; /* compare a few rows of the design matrices */ proc print data=GLMDesign1(obs=10 drop=Cholesterol); run; proc print data=GLMDesign2(obs=10 drop=Cholesterol); run; |
The output shows that changing the reference level results in different parameter estimates. (However, the predicted values are identical for the two estimates.) If you use PROC PRINT to display the first few observations in each design matrix, you can see that the matrices are the same except for the order of two columns. Thus, if you have linearly dependent columns, the GLM estimates might depend on the order of the columns.
You might wonder why the parameter estimates change when you change reference levels (or, equivalently, change the order of the columns in the design matrix). The mathematical answer is that there is a whole family of solutions that satisfy the (singular) regression equations, and from among the infinitely many solutions, the GLM procedure chooses the solution for which the estimate of the reference level is exactly zero.
Last week I discussed generalized inverses, including the SWEEP operator and the Moore-Penrose inverse. The SWEEP operator is used by PROC GLM to obtain the parameter estimates. The SWEEP operator produces a generalized inverse that is not unique. In particular, the SWEEP operator computes a generalized inverse that depends on the order of the columns in the design matrix.
The following SAS/IML statements read in the design matrices for each GLM parameterization and use the SWEEP function to reproduce the parameter estimates that are computed by the GLM procedure. For each design matrix, the program computes solutions to the normal equations (X`*X)*b = (X`*Y). The program also computes the Moore-Penrose solution for each design matrix.
proc iml; /* read design matrix and form X`X and X`*Y */ use GLMDesign1; read all var _NUM_ into Z[c=varNames]; close; p = ncol(Z); X = Z[, 1:(p-1)]; Y = Z[, p]; vNames = varNames[,1:(p-1)]; A = X`*X; c = X`*Y; /* obtain G2 and Moore-Penrose solution for this design matrix */ Sweep1 = sweep(A)*c; GInv1 = ginv(A)*c; print Sweep1[r=vNames], GInv1; /* read other design matrix and form X`X and X`*Y */ use GLMDesign2; read all var _NUM_ into Z[c=varNames]; close; p = ncol(Z); X = Z[, 1:(p-1)]; Y = Z[, p]; vNames = varNames[,1:(p-1)]; A = X`*X; c = X`*Y; /* obtain G2 and Moore-Penrose solution for this design matrix */ Sweep2 = sweep(A)*c; GInv2 = ginv(A)*c; print Sweep2[r=vNames], GInv2; |
The results demonstrate that the SWEEP solution depends on the order of columns in a linearly dependent design matrix. However, the Moore-Penrose solution does not depend on the order. The Moore-Penrose solution is the same no matter which reference levels you choose for the GLM parameterization of classification effects.
It is worth mentioning that although the parameter estimates are not unique, there are other statistics that are unique and that do not depend on the order in which the crossproduct matrix is swept or the reference level. An example is a difference between the means of two class levels. The SAS/STAT documentation contains a chapter that discusses these so-called estimable functions.
In summary, the scary note that PROC GLM produces reminds you of the following mathematical facts:
If the last fact bothers you (it shouldn't), an alternative estimate is available by using the GINV function to compute the Moore-Penrose inverse. The corresponding estimate is unique and does not depend on the reference level.
The post Singular parameterizations, generalized inverses, and regression estimates appeared first on The DO Loop.
]]>The post A funnel plot for immunization rates appeared first on The DO Loop.
]]>Last week my colleague, Robert Allison, visualized data regarding immunization rates for kindergarten classes in North Carolina. One of his graphs was a scatter plot that displayed the proportion of unimmunized students versus the size of the class for 1,885 kindergarten classes in NC. This scatter plot is the basis for a statistical plot that is known as a funnel plot for proportions. The plot to the right shows a funnel plot that I created based on Robert's analysis.
The basic idea of a funnel plot is that small random samples have more variation than large random samples from the same population. If students are randomly chosen from a population in which some small proportion of children have an attribute, it might not be unusual if 40% of the students in a five-student class have the attribute (that's 2 out of 5) whereas it might be highly unusual to see such a high proportion in a 100-student class. The funnel plot enhances a scatter plot by adding curves that indicate a reasonable range of values for the proportion, given the size of a random sample.
For a discussion of funnel plots and how to create a funnel plot in SAS, see the article "Funnel plots for proportions." You can download the immunization data and the SAS program that I used to create the funnel plot for proportions.
This funnel plot uses the shape (and color) of a marker to indicate whether the school is public (circle), private (upward-pointing triangle), or a charter school (right-pointing triangle). The plot includes tool tips so that you can hover the mouse over a marker and see the name and county of the school.
The graph indicates that there are dozens of schools for which the proportion of unimmunized students far exceeds the state average. A graph like this enables school administrators and public health officials to identify schools that have a larger-than-expected proportion of unimmunized students. Identifying schools is the first step to developing initiatives that can improve the immunization rate in school-age children.
You can use a WHERE statement or BY-group processing in PROC SGPLOT to create a funnel plot for each county. A graph that shows only the schools in a particular district is more relevant for local school boards and administrators. The following graphs show the proportion of unimmunized students in Mecklenburg County (near Charlotte) and Wake County (near Raleigh), which are the two largest school districts in NC.
The first graph shows that Mecklenburg County has several schools that contain more than 60 kindergarten students and for which 25% or more of the students are unimmunized. In fact, some large schools have more than 40% unimmunized! In Wake County, fewer schools have extreme proportions, but there are still many schools for which the proportion of unimmunized students is quite large relative to the statewide average.
As Robert pointed out in his blog post, these are not official figures, so it is possible that some of the extreme proportions are caused by data entry errors rather than by hordes of unimmunized students.
The following graph shows a panel of plots for public, private, and charter schools. There are many public schools whose proportion of unimmunized students is well above the statewide average. For the private and charter schools, about 10 schools stand out in each group.
I think the plot of private schools is particularly interesting. When the media reports on outbreaks of childhood diseases in schools, there is often a mention of a "religious exemption," which is when a parent or guardian states that a child will not be immunized because of their religious beliefs. The report often mentions that private schools are often affiliated with a particular religion or church. I've therefore assumed that private schools have a larger proportion of unimmunized students because of the religious exemption. These data do not indicate which students are unimmunized because of a religious exemption, but the panel of funnel plots indicates that, overall, not many private schools have an abnormally large proportion of unimmunized students. In fact, the private schools show smaller deviations from the expected value than the public and charter schools.
In summary, I started with one of Robert Allison's graphs and augmented it to create a funnel plot for proportions. A funnel plot shows the average proportion and confidence limits for proportions (given the sample size). If the students in the schools were randomly sampled from a population where 4.58% of students are unimmunized, then few schools would be outside of the confidence curve. Of course, in reality, schools are not random samples. Many features of school performance—including unimmunized students—depend on local socioeconomic conditions. By taking into account the size of the classes, the funnel plot identifies schools that have an exceptionally large proportion of unimmunized students. A funnel plot can help school administrators and public health officials identify schools that might benefit from intervention programs and additional resources, or for which the data were incorrectly entered.
The post A funnel plot for immunization rates appeared first on The DO Loop.
]]>The post Generalized inverses for matrices appeared first on The DO Loop.
]]>Recall that the inverse matrix of a square matrix A is a matrix G such as A*G = G*A = I, where I is the identity matrix. When such a matrix exists, it is unique and A is said to be nonsingular (or invertible). If there are linear dependencies in the columns of A, then an inverse does not exist. However, you can define a series of weaker conditions that are known as the Penrose conditions:
Any matrix, G, that satisfies the first condition is called a generalized inverse (or sometimes a "G1" inverse) for A. A matrix that satisfies the first and second condition is called a "G2" inverse for A. The G2 inverse is used in statistics to compute parameter estimates for regression problems (see Goodnight (1979), p. 155). A matrix that satisfies all four conditions is called the Moore-Penrose inverse or the pseudoinverse. When A is square but singular, there are infinitely many matrices that satisfy the first two conditions, but the Moore-Penrose inverse is unique.
In regression problems, the parameter estimates are obtained by solving the "normal equations." The normal equations are the linear system (X`*X)*b = (X`*Y), where X is the design matrix, Y is the vector of observed responses, and b is the parameter estimates to be solved. The matrix A = X`*X is symmetric. If the columns of the design matrix are linearly dependent, then A is singular. The following SAS/IML program defines a symmetric singular matrix A and a right-hand-side vector c, which you can think of as X`*Y in the regression context. The call to the DET function computes the determinant of the matrix. A zero determinant indicates that A is singular and that there are infinitely many vectors b that solve the linear system:
proc iml; A = {100 50 20 10, 50 106 46 23, 20 46 56 28, 10 23 28 14 }; c = {130, 776, 486, 243}; det = det(A); /* demonstrate that A is singular */ print det; |
For nonsingular matrices, you can use either the INV or the SOLVE functions in SAS/IML to solve for the unique solution of the linear system. However, both functions give errors when called with a singular matrix. SAS/IML provides several ways to compute a generalized inverse, including the SWEEP function and the GINV function. The SWEEP function is an efficient way to use Gaussian elimination to solve the symmetric linear systems that arise in regression. The GINV function is a function that computes the Moore-Penrose pseudoinverse. The following SAS/IML statements compute two different solutions for the singular system A*b = c:
b1 = ginv(A)*c; /* solution even if A is not full rank */ b2 = sweep(A)*c; print b1 b2; |
The SAS/IML language also provides a way to obtain any of the other infinitely many solutions to the singular system A*b = c. Because A is a rank-1 matrix, it has a one-dimensional kernel (null space). The HOMOGEN function in SAS/IML computes a basis for the null space. That is, it computes a vector that is mapped to the zero vector by A. The following statements compute the unit basis vector for the kernel. The output shows that the vector is mapped to the zero vector:
xNull = homogen(A); /* basis for nullspace of A */ print xNull (A*xNull)[L="A*xNull"]; |
All solutions to A*b = c are of the form b + α*xNull, where b is any particular solution.
You can verify that the Moore-Penrose matrix GINV(A) satisfies the four Penrose conditions, whereas the G2 inverse (SWEEP(A)) only satisfies the first two conditions. I mentioned that the singular system has infinitely many solutions, but the Moore-Penrose solution (b1) is unique. It turns out that the Moore-Penrose solution is the solution that has the smallest Euclidean norm. Here is a computation of the norm for three solutions to the system A*b = c:
/* GINV gives the estimate that has the smallest L2 norm: */ GINVnorm = norm(b1); sweepNorm = norm(b2); /* you can add alpha*xNull to any solution to get another solution */ b3 = b1 + 2*xNull; /* here's another solution (alpha=2) */ otherNorm = norm(b3); print ginvNorm sweepNorm otherNorm; |
Because all solutions are of the form b1 + α*xNull, where xNull is the basis for the nullspace of A, you can graph the norm of the solutions as a function of α. The graph is shown below and indicates that the Moore-Penrose solution is the minimal-norm solution:
alpha = do(-2.5, 2.5, 0.05); norm = j(1, ncol(alpha), .); do i = 1 to ncol(alpha); norm[i] = norm(b1 + alpha[i] * xNull); end; title "Euclidean Norm of Solutions b + alpha*xNull"; title2 "b = Solution from Moore-Penrose Inverse"; title3 "xNull = Basis for Nullspace of A"; call series(alpha, norm) other="refline 0 / axis=x label='b=GINV';refline 1.78885 / axis=x label='SWEEP';"; |
In summary, a singular linear system has infinitely many solutions. You can obtain a particular solution by using the sweep operator or by finding the Moore-Penrose solution. You can use the HOMOGEN function to obtain the full family of solutions. The Moore-Penrose solution is expensive to compute but has an interesting property: it is the solution that has the smallest Euclidean norm. The sweep solution is more efficient to compute and is used in SAS regression procedures such as PROC GLM to estimate parameters in models that include classification variables and use a GLM parameterization. The next blog post explores regression estimates in more detail.
The post Generalized inverses for matrices appeared first on The DO Loop.
]]>The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.
]]>A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.
Here's an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:
data roc; input alb tp totscore popind @@; totscore = 10 - totscore; datalines; 3.0 5.8 10 0 3.2 6.3 5 1 3.9 6.8 3 1 2.8 4.8 6 0 3.2 5.8 3 1 0.9 4.0 5 0 2.5 5.7 8 0 1.6 5.6 5 1 3.8 5.7 5 1 3.7 6.7 6 1 3.2 5.4 4 1 3.8 6.6 6 1 4.1 6.6 5 1 3.6 5.7 5 1 4.3 7.0 4 1 3.6 6.7 4 0 2.3 4.4 6 1 4.2 7.6 4 0 4.0 6.6 6 0 3.5 5.8 6 1 3.8 6.8 7 1 3.0 4.7 8 0 4.5 7.4 5 1 3.7 7.4 5 1 3.1 6.6 6 1 4.1 8.2 6 1 4.3 7.0 5 1 4.3 6.5 4 1 3.2 5.1 5 1 2.6 4.7 6 1 3.3 6.8 6 0 1.7 4.0 7 0 3.7 6.1 5 1 3.3 6.3 7 1 4.2 7.7 6 1 3.5 6.2 5 1 2.9 5.7 9 0 2.1 4.8 7 1 2.8 6.2 8 0 4.0 7.0 7 1 3.3 5.7 6 1 3.7 6.9 5 1 3.6 6.6 5 1 ; ods graphics on; ods trace on; proc logistic data=roc; model popind(event='0') = alb tp totscore / nofit; roc 'Albumin' alb; roc 'K-G Score' totscore; roc 'Total Protein' tp; roccontrast reference('K-G Score') / estimate e; run; |
The SAS log displays the names of the tables and graphs. A portion of log is shown below:
Output Added:
-------------
Name: OddsRatios
Label: Odds Ratios
Template: Stat.Logistic.OddsRatios
Path: Logistic.ROC3.OddsRatios
-------------
Output Added:
-------------
Name: ROCCurve
Label: ROC Curve
Template: Stat.Logistic.Graphics.ROC
Path: Logistic.ROC3.ROCCurve
-------------
Output Added:
-------------
Name: ROCOverlay
Label: ROC Curves
Template: Stat.Logistic.Graphics.ROCOverlay
Path: Logistic.ROCComparisons.ROCOverlay
-------------
Output Added:
-------------
Name: ROCAssociation
Label: ROC Association Statistics
Template: Stat.Logistic.ROCAssociation
Path: Logistic.ROCComparisons.ROCAssociation
-------------
Output Added:
-------------
Name: ROCContrastCoeff
Label: ROC Contrast Coefficients
Template: Stat.Logistic.ROCContrastCoeff
Path: Logistic.ROCComparisons.ROCContrastCoeff
-------------
Only a few of the 27 ODS objects are shown here. Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:
ods select ROCCurve ROCOverlay; /* specify the names literally */ |
Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as 'ROC', in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains 'ROC' as a substring.
You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern. The object's name is available in a special variable named _NAME_. Similarly, the object's label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.
In SAS, the following operators and functions are useful for matching strings:
For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:
/* use any one of the following statements inside the PROC LOGISTIC call */ ods select where=(_name_ =: 'ROC'); /* name starts with 'ROC' */ ods select where=(_name_ like 'ROC%'); /* name starts with 'ROC' */ ods select where=(_path_ ? 'ROC'); /* path contains 'ROC' */ ods select where=(_label_ ? 'ROC'); /* label contains 'ROC' */ ods select where=(_name_ in: ('Odds', 'ROC')); /* name starts with 'Odds' or 'ROC' */ ods select where=(substr(_name_,4,8)='Contrast'); /* name has subtring 'Contrast' at position 4 */ |
For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld's graphics-focused blog post and the section of the SAS/STAT User's Guide that discusses selecting ODS graphics.
Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a Perl Regular Expression "cheat sheet" that summarizes the syntax and commons search queries for the PRXMATCH function.
You can put any of the following statements inside the PROC LOGISTIC call:
/* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */ ods select where=(prxmatch('/ROC/', _name_)); /* name contains 'ROC' anywhere */ ods select where=(prxmatch('/^ROC/', _name_)); /* name starts with 'ROC' */ ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */ ods select where=(prxmatch('/ROC/', _name_)=0); /* name does NOT contain 'ROC' anywhere */ ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */ |
In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful. Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.
The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.
]]>The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.
]]>If you want to review the basic constructions of an ROC curve, you can see a previous article that constructs an empirical ROC curve from first principles. The PROC LOGISTIC documentation provides formulas used for constructing an ROC curve.
Before discussing how to create an ROC plot from an arbitrary vector of predicted probabilities, let's review how to create an ROC curve from a model that is fit by using PROC LOGISTIC. The following data and model are taken from the the PROC LOGISTIC documentation. The data are for 43 cancer patients who also had an intestinal obstruction. The response variable popInd is a postoperative indicator variable: popInd = 1 for patients who died within two months after surgery. The explanatory variables are three pre-operative screening tests. The goal of the study is to determine patients who might benefit from surgery, where "benefit" is measured by postoperative survival of at least two months.
data roc; input alb tp totscore popind @@; totscore = 10 - totscore; datalines; 3.0 5.8 10 0 3.2 6.3 5 1 3.9 6.8 3 1 2.8 4.8 6 0 3.2 5.8 3 1 0.9 4.0 5 0 2.5 5.7 8 0 1.6 5.6 5 1 3.8 5.7 5 1 3.7 6.7 6 1 3.2 5.4 4 1 3.8 6.6 6 1 4.1 6.6 5 1 3.6 5.7 5 1 4.3 7.0 4 1 3.6 6.7 4 0 2.3 4.4 6 1 4.2 7.6 4 0 4.0 6.6 6 0 3.5 5.8 6 1 3.8 6.8 7 1 3.0 4.7 8 0 4.5 7.4 5 1 3.7 7.4 5 1 3.1 6.6 6 1 4.1 8.2 6 1 4.3 7.0 5 1 4.3 6.5 4 1 3.2 5.1 5 1 2.6 4.7 6 1 3.3 6.8 6 0 1.7 4.0 7 0 3.7 6.1 5 1 3.3 6.3 7 1 4.2 7.7 6 1 3.5 6.2 5 1 2.9 5.7 9 0 2.1 4.8 7 1 2.8 6.2 8 0 4.0 7.0 7 1 3.3 5.7 6 1 3.7 6.9 5 1 3.6 6.6 5 1 ; ods graphics on; proc logistic data=roc plots(only)=roc; LogisticModel: model popind(event='0') = alb tp totscore; output out=LogiOut predicted=LogiPred; /* output predicted value, to be used later */ run; |
You can see the documentation for details about how to interpret the output from PROC LOGISTIC, but the example shows that you can use the PLOTS=ROC option (or the ROC statement) to create an ROC curve for a model that is fit by PROC LOGISTIC. For this model, the area under the ROC curve is 0.77. Because a random "coin flip" prediction has an expected area of 0.5, this model predicts the survival of surgery patients better than random chance.
A logistic model is not the only way to predict a binary response. You could also use a decision tree, a generalized mixed model, a nonparametric regression model, or even ask a human expert for her opinion. An ROC curve only requires two quantities: for each observation, you need the observed binary response and a predicted probability. In fact, if you carefully read the PROC LOGISTIC documentation, you will find these sentences:
In other words, you can use PROC LOGISTIC to create an ROC curve regardless of how the predicted probabilities are obtained! For argument's sake, let's suppose that you ask a human expert to predict the probability of each patient surviving for at least two months after surgery. (Notice that there is no statistical model here, only a probability for each patient.) The following SAS DATA step defines the predicted probabilities, which are then merged with the output from the earlier PROC LOGISTIC call:
data ExpertPred; input ExpertPred @@; datalines; 0.95 0.2 0.05 0.3 0.1 0.6 0.8 0.5 0.1 0.25 0.1 0.2 0.05 0.1 0.05 0.1 0.4 0.1 0.2 0.25 0.4 0.7 0.1 0.1 0.3 0.2 0.1 0.05 0.1 0.4 0.4 0.7 0.2 0.4 0.1 0.1 0.9 0.7 0.8 0.25 0.3 0.1 0.1 ; data Survival; merge LogiOut ExpertPred; run; /* create ROC curve from a variable that contains predicted values */ proc logistic data=Survival; model popind(event='0') = ExpertPred / nofit; roc 'Expert Predictions' pred=ExpertPred; ods select ROCcurve; run; |
Notice that you only need to supply two variables on the MODEL statements: the observed responses and the variable that contains the predicted values. On the ROC statement, I've used the PRED= option to indicate that the ExpertPred variable is not being fitted by the procedure. Although PROC LOGISTIC creates many tables, I've used the ODS SELECT statement to suppress all output except for the ROC curve.
You might want to overlay and compare ROC curves from multiple predictive models (either from PROC LOGISTIC or from other sources). PROC LOGISTIC can do that as well. You just need to merge the various predicted probabilities into a single SAS data set and then specify multiple ROC statements, as follows:
/* overlay two or more ROC curves by using variables of predicted values */ proc logistic data=Survival; model popind(event='0') = LogiPred ExpertPred / nofit; roc 'Logistic' pred=LogiPred; roc 'Expert' pred=ExpertPred; ods select ROCOverlay; /* optional: for a statistical comparison, use ROCCONTRAST stmt and remove the ODS SELECT stmt */ *roccontrast reference('Expert Model') / estimate e; run; |
This ROC overlay shows that the "expert" prediction is almost always superior or equivalent to the logistic model in terms of true and false classification rates. As noted in the comments of the previous call to PROC LOGISTIC, you can use the ROCCONTRAST statement to obtain a statistical analysis of the difference between the areas under the curves (AUC).
In summary, you can use the ROC statement in PROC LOGISTIC to generate ROC curves for models that were computed outside of PROC LOGISTIC. All you need are the predicted probabilities and observed response for each observation. You can also overlay and compare two or more ROC curves and use the ROCCONTRAST statement to analyze the difference between areas under the curves.
The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.
]]>The post Define custom color ramps by using the RANGEATTRMAP statement in GTL appeared first on The DO Loop.
]]>However, the SGPLOT procedure is not the only way to create color ramps. Recently, I needed to create a custom color ramp and specify a color for values that were outside of a certain range. I also needed to change the color used for missing values in the response variable. This article shows how to perform these tasks by using the RANGEATTRMAP statement in the SAS Graph Template Language (GTL).
SAS provides several ways to visualize the values of a response variable in scatter plots, contour plots, heat maps, and other graphs. An example is shown to the right. In this scatter plot, the markers represent data for patients in a heart study. The two-dimensional coordinates of the markers represent the systolic blood pressure and the MRW ratio (a body-mass index) of 80 patients. The colors indicate measurements of cholesterol. Three observations are highlighted by using arrows. One point has an extreme value of cholesterol (400). Two others have missing values of cholesterol and are shown in a grey color by default.
The code that generates the graph follows. The COLORRESPONSE= option is used to name the response variable. The values of the response variable determine the color for each marker according to the color ramp that is specified on the COLORMODEL= option.
data Have; /* example data: a subset of the Heart data set */ set Sashelp.Heart(firstobs=5000 rename=(Systolic=X MRW=Y Cholesterol=Z) where=(X < 200 AND Y < 200)); label X="Systolic" Z="Cholesterol"; keep X Y Z; run; title "Markers Colored by Using the COLORRESPONSE= Option"; proc sgplot data=Have; scatter x=X y=Y / colorresponse=Z /* specify response variable */ colormodel=ThreeAltColorRamp /* specify color ramp */ filledoutlinedmarkers markerattrs=(symbol=CircleFilled size=12); xaxis grid; yaxis grid; run; |
The SGPLOT option provide basic functionality for creating a custom color ramp. For more control, you can use either of the following advanced techniques:
As shown in the previous example, using a color ramp requires that you specify two pieces of information: a response variable and a color ramp. In GTL, use the RANGEATTRVAR statement to specify the response variable. Use the RANGEATTRMAP statement to define a custom color ramp.
Although the documentation discusses it, I was initially confused about the difference between color ramps and "alt" color ramps. I erroneously assumed that they are functionally equivalent. After all, when I use the COLORMODEL= option in PROC SGPLOT, I can choose any predefined color ramp. For example, the graph at the start of this article is created by using COLORMODEL=ThreeAltColorRamp, which is a blue-black-red color ramp. However, I could just as easily specify COLORMODEL=ThreeColorRamp to use a blue-white-red color ramp.
However, in the RANGEATTRMAP statement, there is a substantive difference between the RANGECOLOR= option and the RANGEALTCOLOR= option:
The following GTL defines a template that creates a scatter plot that is similar to the one at the beginning of this article. Because I want to color markers, I use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options. The RANGEATTRMAP statement contains three RANGE statements. Each RANGE statement associates a color or colors with ranges of data values:
proc template; define statgraph scatter3Dcol; begingraph; rangeattrmap name="ResponseRange"; range min-350 / rangeAltColorModel=(CXFFFFB2 CXFED976 CXFEB24C CXFD8D3C CXFC4E2A CXE31A1C CXB10026); range OTHER / rangeAltColor=Black; /* or use the OVER or UNDER keyword */ range MISSING / rangeAltColor=Lime; /* color for missing values */ endrangeattrmap; rangeattrvar var=Z /* specify response variable in data set */ attrmap="ResponseRange" /* specify custom color ramp */ attrvar=RangeVar; /* alias for this variable/ramp combination */ entrytitle "Markers Colored by Using the RANGEATTRMAP Statement"; layout overlay; scatterplot x=X y=Y / markercolorgradient=RangeVar /* color by Z and custom color ramp */ filledoutlinedmarkers=true markerattrs=(symbol=circlefilled size=12) name="Scatter"; continuouslegend "Scatter" / title='Cholesterol'; endlayout; endgraph; end; run; proc sgrender data=Have template=scatter3Dcol; run; |
You can do more with the RANGEATTRMAP statement, but I'll stop here. In summary, you can use the RANGEATTRMAP (and RANGEATTRVAR) statement in the Graph Template Language to define a custom color ramp. The RANGEATTRMAP statement supports features that are not surfaced in PROC SGPLOT, such as enabling you to specify a color for out-of-range values and missing values. If you are going to use the color ramp in an "area plot" such as a heat map, use the RANGECOLOR= and RANGECOLORMODEL= options to define the color ramp. If you are going to use the color ramp to color lines and markers, use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options.
The post Define custom color ramps by using the RANGEATTRMAP statement in GTL appeared first on The DO Loop.
]]>