The post Visualize a design matrix appeared first on The DO Loop.

]]>
It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures.
In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or *encodings* that result in different design matrices.

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */ data Temp / view=Temp; set Sashelp.heart(obs=150 keep=BP_Status Chol_Status Smoking_Status Weight_Status); rename BP_Status=BP Chol_Status=Chol Smoking_Status=Smoking Weight_Status=Weight; FakeY = 0; run; ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP Chol Smoking Weight / param=GLM; model FakeY = BP Chol Smoking Weight; run; ods exclude none; ods graphics / width=500px height=800px; proc iml; /* use HEATMAPDISC call to create heat map of design */ use Design; read all var _NUM_ into X[c=varNames]; close; run HeatmapDisc(X) title="GLM Design Matrix" xvalues=varNames displayoutlines=0 colorramp={"White" "Black"}; QUIT; |

Click on the heat map to enlarge it.
Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In
the design matrix for the GLM parameterization, a categorical variable with *k* levels is represented by *k* columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

- This first column is all black, which indicates the intercept column.
- Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
- In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP Chol Smoking Weight / param=REF; model FakeY = BP Chol Smoking Weight; run; ods exclude none; |

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have *k* levels are replaced by *k*–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP(ref='Normal') Chol(ref='Desirable') Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT; model FakeY = BP Chol Smoking Weight; run; ods exclude none; proc iml; /* use HEATMAPDISC call to create heat map of design */ use Design; read all var _NUM_ into X[c=varNames]; close; run HeatmapDisc(X) title="Effect Design Matrix" xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"}; QUIT; |

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/ data Long; set Design; array dummy[*] _NUMERIC_; do varNum = 1 to dim(dummy); rowNum = _N_; value = dummy[varNum]; output; end; keep varNum rowNum value; run; proc sgplot data=Long; /* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */ styleattrs DATACOLORS=(Black White LightBlue); heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex; xaxis type=discrete; /* values=(1 to 11) valuesdisplay=("A" "B" ... "J" "K"); */ yaxis reverse; run; |

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

]]>The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

]]>Recall that an ANOVA (ANalysis Of VAriance) model is used to understand differences among group means and the variation among and between groups. The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG. The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:

data recover; input T1 $ T2 $ Time @@; datalines; 0 0 20.2 0 0 23.9 0 0 21.9 0 0 42.4 1 0 27.2 1 0 34.0 1 0 27.4 1 0 28.5 0 1 25.9 0 1 34.5 0 1 25.1 0 1 34.2 1 1 35.0 1 1 33.9 1 1 38.3 1 1 39.9 ; |

The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the "category" variable and T2 to be a "group" variable in a clustered box plot, as follows:

title "Response for Two Groups"; title2 "Use VBOX Statement with Categories and Groups"; proc sgplot data=recover; vbox Time / category=T1 group=T2; run; |

The graph shows the distribution of response for the four joint combinations of T1 and T2. The graph is a little hard to interpret because the category levels are 0/1. The two box plots on the left are for T1=0, which means "Did not receive the T1 treatment." The two box plots on the right are for mice who received the T1 treatment. Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2. Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.

Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I'd prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I'd also like to optionally plot the raw response values.

The LOGISTIC and GENMOD procedures in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement). The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:

/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */ proc genmod data=recover; class T1 T2; model Time = T1 T2 T1*T2; effectplot box / cluster; effectplot interaction / obs(jitter); /* or use interaction plot to see raw data */ run; |

The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates an "interaction plot" that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.

Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the "interaction plot".) The next section shows an alternative way to create the box plots.

You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are '0 0', '0 1', '1 0', and '1 1'. You can define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:

data recover2 / view=recover2; length Treatment $3; /* specify length of concatenated variable */ set recover; Treatment = catx(' ',T1,T2); /* combine into one group */ run; proc format; /* make the joint levels more readable */ value $ TreatFmt '0 0' = 'Control' '1 0' = 'T1 Only' '0 1' = 'T2 Only' '1 1' = 'T1 and T2'; run; proc sgplot data=recover2 noautolegend; format Treatment $TreatFmt.; vbox Time / category=Treatment; scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10); xaxis discreteorder=data; run; |

By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the "Control" group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.

In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.

]]>The post Regression with restricted cubic splines in SAS appeared first on The DO Loop.

]]>Since SAS 9.3, many SAS regression procedures provide a native implementation of restricted cubic splines by using the EFFECT statement in SAS. This article provides examples of using splines in regression models. Because some older procedures (such as PROC REG) do not support the EFFECT statement, the article also shows how to use SAS procedures to generate the spline basis, just like the %RCSPLINE macro does.

If you are not familiar with splines and knots, read the overview article "Understanding splines in the EFFECT statement." You can also read the documentation for the EFFECT statement, which includes an overview of spline effects as well as a brief description of restricted cubic splines, which are also called "natural cubic splines." I think the fact that the SAS documentation refers to the restricted cubic splines as "natural cubic splines" has prevented some practitioners from realizing that SAS supports restricted cubic splines.

This section provides an example of using splines in PROC GLMSELECT to fit a GLM regression model. Because the functionality is contained in the EFFECT statement, the syntax is the same for other procedures. For example, if you have a binary response you can use the EFFECT statement in PROC LOGISTIC. If you a fitting a generalized linear model or a mixed model, you can use PROC GLMMIX.

This example fits the MPG_CITY variable as a function of the WEIGHT variable for vehicles in the Sashelp.Cars data set. The data and a scatter plot smoother are shown in the adjacent graph. (To produce the graph, the following statements sort the data, but that is not required.) The smoother is based on a restricted spline basis with five knots. Notice that the SELECTION=NONE option in the MODEL statement turns off the variable selection features of PROC GLMSELECT, which makes the procedure fit one model just like PROC GLM.

/* create sample data; sort by independent variable for graphing */ proc sort data=sashelp.cars out=cars(keep=mpg_city weight model); by weight; run; /* Fit data by using restricted cubic splines. The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, PLS, QUANTREG, ROBUSTREG, ... */ ods select ANOVA ParameterEstimates SplineKnots; proc glmselect data=cars; effect spl = spline(weight / details naturalcubic basis=tpf(noint) knotmethod=percentiles(5) ); model mpg_city = spl / selection=none; /* fit model by using spline effects */ output out=SplineOut predicted=Fit; /* output predicted values for graphing */ quit; title "Restricted Cubic Spline Regression"; proc sgplot data=SplineOut noautolegend; scatter x=weight y=mpg_city; series x=weight y=Fit / lineattrs=(thickness=3 color=red); run; |

The EFFECT statement with the SPLINE option is used to generate spline effects from the WEIGHT variable. The effects are assigned the collective name 'spl'. Putting 'spl' on the MODEL statement says "use the spline effects that I created." You can specify multiple EFFECT statements. Spline effects depend on three quantities: the kind of spline, the number of knots, and the placement of the knots.

- You specify restricted cubic splines by using the NATURALCUBIC BASIS=TPF(NOINT) options. Technically you do not need the NOINT suboption, but I recommend it because it makes the parameter estimates table simpler.
- The KNOTMETHOD= option enables you to specify the number and placement of knots. In this example, PERCENTILES(5) places knots at five evenly spaced percentiles of the explanatory variable, which are the 16.6%, 33.3%, 50%, 66.6%, and 83.3% percentiles. Equivalently, the knots are placed at the 1/6, 2/6, 3/6, 4/6, and 5/6 quantiles.

Most researchers use a small number of knots, often three to five.
The exact location of knots is not usually critical: if you change the positions slightly the predicted values of the regression change only slightly. A common scheme is to place the knots at fixed percentiles of the data, as shown above. Alternatively, Harrell (*Regression Modeling Strategies*, 2010 and 2015) suggests heuristic percentiles as shown below, and this scheme seems to be popular among biostatisticians.

The KNOTMETHOD= option on the EFFECT statement provides several options for specifying the locations of knots. Try each of the following options and look at the "SplineKnots" table to see the positions of the knots:

- KNOTMETHOD=PERCENTILES(5): Places knots at the percentiles 1/6, 2/6, ..., 5/6. An example is shown above.
- KNOTMETHOD=EQUAL(5): Places knots evenly within the range of the data. If δ = (max-min)/6, then the
knots are located at min +
*i*δ for*i*=1, 2, ..., 5. - KNOTMETHOD=RANGEFRACTIONS(0.05 0.275 0.50 0.725 0.95): If you want knots placed unevenly with the range of the data, use this option. For example, the value 0.05 means "place a knot 5% along the data range" and 0.272 means "place a knot 27.5% along the data range." You can separate list values by using spaces or commas.
- KNOTMETHOD=LIST(2513, 3174, 3474.5, 3903, 5000): This option enables you to list the locations (in data coordinates) of the knots. You can use this option to specify Harrell's schemes. For example, Harrel suggests the 5th, 27.5th, 50th, 72.5th, and 95th percentiles when you use five knots. You can use PROC UNIVARIATE to find these percentiles for the WEIGHT variable and then type the results into the KNOTMETHOD=LIST option.

The adjacent graph shows the predicted values for the four different knot placement methods. (Click to enlarge.) You can see that the general shape of the regression curves is similar regardless of the placement of knots. The differences can be understood by looking at the placement of the first and last knots. The slope of the natural cubic spline fit is linear past the extreme knots, so the placement of the extreme knots dictate where the predictions become linear. For the EQUAL and RANGEFRACTION methods, the largest knots are placed at WEIGHT=6300 and WEIGHT=6923, respectively. Consequently, the predictions for large values of WEIGHT look more cubic than linear. In contrast, the largest knot for the PERCENTILES method is placed at 4175 and the largest LIST knot is 5000. The predictions past those values are linear.

In the previous section, I manually typed in the values that correspond to the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of the WEIGHT variable, as suggested by Harrell's scheme. However, it is not difficult to automate that step. You can use PROC UNIVARIATE to write the percentile values to a SAS data set and then use a short DATA _NULL_ step to store those values into a macro variable. The macro variable can then be used as the argument to the KNOTMETHOD=LIST option, as follows:

proc univariate data=cars noprint; var weight; output out=percentiles pctlpts=5 27.5 50 72.5 95 pctlpre=p_; /* specify the percentiles */ run; data _null_; set percentiles; call symput('pctls',catx(',', of _numeric_)); /* put all values into a comma-separated list */ run; %put &=pctls; /* optional: display the list of values */ ... effect spl = spline(weight / ... knotmethod=list(&pctls) ); /* use the macro variable here */ ... |

PCTLS=2513,3174,3474.5,3903,5000 |

As mentioned eaerlier, not every SAS procedure supports the EFFECT statement. However, the GLMSELECT, LOGISTIC, and GLIMMIX procedures all provide an OUTDESIGN= option, which enables you to output the design matrix that contains the spline basis functions. Furthermore, PROC LOGISTIC supports the OUTDESIGNONLY option and PROC GLIMMIX supports the NOFIT option so that the procedures do not fit the model but only output the design matrix.

You can use the variables in the design matrix to perform regression or other analyses. For example, the following example writes the restricted cubic basis functions to a data set, then uses PROC REG to fit the model:

/* create SplineBasis = data set that contains spline basis functions */ proc glmselect data=cars outdesign(addinputvars fullmodel)=SplineBasis; effect spl = spline(weight / naturalcubic basis=tpf(noint) knotmethod=percentiles(5)); model mpg_city = spl / selection=none; quit; /* use design variables in other procedures */ proc reg data=SplineBasis; model mpg_city = spl_:; output out=out p=p; run; |

Notice the options to the OUTDESIGN option in PROC GLMSELECT. The ADDINPUTVARS option copies the original variables into the design matrix. The FULLMODEL option tells the procedure to output the design matrix for all variables on the MODEL statement, regardless of whether they appear in the final "selected" model.

One last comment: the basis functions that are generated by the EFFECT statement are not equal to the basis functions created by the %RCSPLINE macro, but they are equivalent. The EFFECT statement uses the definition from *The Elements of
Statistical Learning* (Hastie, Tibshirani, and Friedman, 2009, 2nd Ed, pp. 144-146). The %RCSPLINE macro implements scaled versions of the basis function. Thus parameter estimates will be different but the predicted values will be the same.

You can use the NATURALCUBIC BASIS=TPF(NOINT) option in the EFFECT statement in SAS to perform regression with restricted cubic splines, which are also called natural cubic splines. You can use the KNOTMETHOD= option to specify the number and placement of the knots. The EFFECT statement is supported by the GLMSELECT, LOGISTIC, and GLIMMIX procedures, among others. These procedures enable you to output a design matrix that contains the spline basis functions, which you can use in procedures that do not support the EFFECT statement.

This article was inspired by several talks that I heard at SAS Global Forum. For more information on this topic, see the following:

- Bilenas, J. V. (2014), "Scatter Plot smoothing using PROC LOESS and Restricted Cubic Splines."
- Croxford, R. (2016), "Restricted Cubic Spline Regression: A Brief Introduction."

The post Regression with restricted cubic splines in SAS appeared first on The DO Loop.

]]>The post Visualize the 68-95-99.7 rule in SAS appeared first on The DO Loop.

]]>A reader commented on last week's article about constructing symmetric intervals. He wanted to know if I created it in SAS.

Yes, the graph, which illustrates the so-called 68-95-99.7 rule for the normal distribution, was created by using several statements in the SGPLOT procedure in Base SAS

- The SERIES statement creates the bell-shaped curve.
- The BAND statement creates the shaded region under the curve.
- The DROPLINE statement creates the vertical lines from the curve to the X axis.
- The HIGHLOW statement creates the horizontal lines that indicate interval widths.
- The TEXT statement creates the text labels "68%", "95%", and "99.7%".

If you follow some simple rules, it is easy to use PROC SGPLOT the overlay multiple curves and lines on a graph. The key is to organize the underlying data into a block form, as shown conceptually in the plot to the right. I use different variable names for each component of the plot, and I set the values of the other variables to missing when they are no longer relevant. Each statement uses only the variables that are relevant for that overlay.

The following DATA step creates the data for the 68-95-99.7 graph. Can you match up each section with the SGPLOT statements that overlay various components to create the final image?

data NormalPDF; mu = 50; sigma = 8; /* parameters for the normal distribution N(mu, sigma) */ /* 1. Data for the SERIES and BAND statements */ do m = -4 to 4 by 0.05; /* x in [mu-4*sigma, mu+4*sigma] */ x = mu + m*sigma; f = pdf("Normal", x, mu, sigma); /* height of normal curve */ output; end; x=.; f=.; /* 2. Data for vertical lines at mu + m *sigma, m=-3, -2, -1, 1, 2, 3 */ do m =-3 to 3; if m=0 then continue; /* skip m=0 */ Lx = mu + m*sigma; /* horiz location of segment */ Lf = pdf("Normal", Lx, mu, sigma); /* vertical height of segment */ output; end; LX = .; Lf = .; /* 3. Data for horizontal lines. Heights are 1.1, 1.2, and 1.3 times the max height of the curve */ Tx = mu; /* text centered at mu */ fMax = pdf("Normal", mu, mu, sigma); /* highest point of curve */ Text = "68% "; /* 68% interval */ TL = mu - sigma; TR = mu + sigma; /* Left/Right endpoints of interval */ Ty = 1.1 * fMax; /* height of label and segment */ output; Text = "95% "; /* 95% interval */ TL = mu - 2*sigma; TR = mu + 2*sigma; /* Left/Right endpoints of interval */ Ty = 1.2 * fMax; /* height of label and segment */ output; Text = "99.7%"; /* 99.7% interval */ TL = mu - 3*sigma; TR = mu + 3*sigma; /* Left/Right endpoints of interval */ Ty = 1.3 * fMax; /* height of label and segment */ output; keep x f Lx Lf Tx Ty TL TR Text; run; proc sgplot data=NormalPDF noautolegend; band x=x upper=f lower=0; series x=x y=f / lineattrs=(color=black); dropline x=Lx y=Lf / dropto=x lineattrs=(color=black); highlow y=Ty low=TL high=TR / lowcap=serif highcap=serif lineattrs=(thickness=2); text x=Tx y=Ty text=Text / backfill fillattrs=(color=white) textattrs=(size=14); yaxis offsetmin=0 min=0 label="Density"; xaxis values=(20 to 80 by 10) display=(nolabel); run; |

The post Visualize the 68-95-99.7 rule in SAS appeared first on The DO Loop.

]]>The post Quadratic optimization in SAS appeared first on The DO Loop.

]]>The Newton-Raphson algorithm is one of my favorite optimizers. However, for a quadratic objective function there is a simpler choice. You can use the NLPQUA function in SAS/IML to optimize a quadratic objective function.

Let's see how this works by specifying a quadratic polynomial in two variables. The function is

`
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
= 0.5* x` H x + L x + 6
`
where

To use the NLPQUA subroutine, you need to write down the Hessian matrix, **H**, which is the symmetric matrix of second derivatives. Of course, for a quadratic function the second derivatives are constants. In the following SAS/IML program, the matrix **H** contains the second derivatives and the vector **lin** contains the coefficients for the linear and constant terms of *f*:

/* objective function is f(x,y) = (3*x-2)##2 + (2*y-1)##2 + x#y + 1 = (9*x##2 + x#y + 4*y##2) + -12*x - 4*y + 6 grad(f) = g(x,y) = [ 18*x + y - 12, x + 8*y - 4 ] hess(f) = [ dg1/dx dg2/dx ] = [ 18 1 ] [ dg1/dy dg2/dy ] [ 1 8 ] */ proc iml; H = {18 1, /* matrix of second derivatives */ 1 8}; lin = { -12 -4 6 }; /* vector of linear terms and the constant term */ |

The NLPQUA subroutine enables you to specify the Hessian matrix and the linear coefficients. From those values, the routine uses an efficient solver to find the global minimum of the quadratic polynomial, which occurs at x=92/143, y=60/143:

x0 = j(1, 2, 1); /* initial guess */ opt = {0 1}; /* minimize, print final parameters */ call nlpqua(rc, xOpt, H, x0, opt) lin=lin; print xOpt; print xOpt[format=Fract.]; |

You can also use the NLPQUA subroutine to solve a constrained problem. Suppose that you are only interested in values of (x,y) in the unit square and you require that the solution satisfies the linear constraint x + y = 1. In that case you can define a matrix that specifies the linear constraints as follows:

blc = {0 0 . ., /* lower bound: 0 <= x_i */ 1 1 . ., /* upper bound: x_i <= 1 */ 1 1 0 1}; /* linear constraint x + y = 1 */ call nlpqua(rc, conOpt, H, x0, opt, blc) lin=lin; print conOpt[format=Fract.]; |

Notice how the linear constraints are specified for this (and every other) NLP function. Let *p* be the number of parameters in the problem. The first *p* elements of the first row specify the lower bounds for the parameters; use a missing value for variables that are not bounded below. The last two columns of the first row are not used, so you should always set those elements to missing values.
Similarly, the first *p* elements of the second row specify the upper bound for the variables. The last two columns of the second row are not used, so set those elements to missing values.

Additional rows specify coefficients for the linear constraint. An equality constraint of the form
a_{1}*x_{1} + a_{2}*x_{2} + ... + a_{n} *x_{n} = c
is encoded as a row of coefficients
{a_{1} a_{2} ... a_{n} 0 c},
where the 0 in the penultimate location indicates equality.
A "less than" inequality of the form
a_{1}*x_{1} + a_{2}*x_{2} + ... + a_{n} *x_{n} ≤ c
is encoded as the row
{a_{1} a_{2} ... a_{n} -1 c},
where the -1 indicates the "less than or equal" symbol. Similarly, a value of +1 in the penultimate location indicates the "greater than or equal" symbol (≥). See the documentation for the NLP routines for additional details about specifying constraints.

If you have access to SAS/OR software, PROC OPTMODEL provides a simple and natural language for solving simple and complex optimization problems. For this problem, PROC OPTMODEL detects that the objective function is quadratic and automatically chooses an efficient QP solver. Here is one way to solve the unconstrained and constrained problems in PROC OPTMODEL. The answers are the same as before and are not shown.

/* variable names x and y */ proc optmodel; /* unconstrained quadratic optimization */ var x, y; min f = 9*x^2 + x*y + 4*y^2 - 12*x - 4*y + 6; solve; /* call QP solver */ print x Fract. y Fract.; /* constrained quadratic optimization */ x.lb = 0; x.ub = 1; /* bounds on variables */ y.lb = 0; y.ub = 1; con SumToOne: /* linear constraint */ x + y = 1; solve; /* call QP solver */ print x Fract. y Fract.; quit; |

In summary, if you want to optimize a quadratic objective function (with or without constraints), use the NLPQUA subroutine in SAS/IML, which is a fast and efficient way to maximize or minimize a quadratic function of many variables. If you have access to SAS/OR software, PROC OPTMODEL provides an even simpler syntax.

The post Quadratic optimization in SAS appeared first on The DO Loop.

]]>The post A simple trick to construct symmetric intervals appeared first on The DO Loop.

]]>proc iml; mu = 50; delta = 1.5; CI = mu - delta || mu + delta; /* horizontal concatenation ==> 1x2 vector */ |

Last week it occurred to me that there is a simple trick that is even easier: use the fact that SAS/IML is a matrix-vector language to encode the "±" sign as a vector {-1, 1}. When SAS/IML sees a scalar multiplied by a vector, the result will be a vector:

CI = mu + {-1 1}*delta; /* vector operation ==> 1x2 vector */ print CI; |

You can extend this example to compute many intervals by using a single statement. For example, in elementary statistics we learn the "68-95-99.7 rule" for the normal distribution. The rule says that in a random sample drawn from a normal population, about 68% of the observations will be within 1 standard deviation of the mean, about 95% will be within 2 standard deviations, and about 99.7 % will be within 3 standard deviations of the mean. You can construct those intervals by using a "multiplier matrix" whose first row is {-1, +1}, whose second row is {-2, +2}, and whose third row is {-3, +3}. The following SAS/IML statements construct the three intervals for the 69-95-99.7 rule for a normal population with mean 50 and standard deviation 8:

mu = 50; sigma = 8; m = {-1 1, -2 2, -3 3}; Intervals = mu + m*sigma; ApproxPct = {"68%", "95%", "99.7"}; print Intervals[rowname=ApproxPct]; |

Just for fun, let's simulate a large sample from the normal population and empirically confirm the 68-95-99.7 rule. You can use the RANDFUN function to generate a random sample and use the BIN function to detect which observations are in each interval:

call randseed(12345); n = 10000; /* sample size */ x = randfun(n, "Normal", mu, sigma); /* simulate normal sample */ ObservedPct = j(3,1,0); do i = 1 to 3; b = bin(x, Intervals[i,]); /* b[i]=1 if x[i] in interval */ ObservedPct[i] = sum(b) / n; /* percentage of x in interval */ end; results = Intervals || {0.68, 0.95, 0.997} || ObservedPct; print results[colname={"Lower" "Upper" "Predicted" "Observed"} label="Probability of Normal Variate in Intervals: X ~ N(50, 8)"]; |

The simulation confirms the 68-95-99.7 rule. Remember that the rule is a mnemonic device. You can compute the exact probabilities by using the CDF function. In SAS/IML, the exact computation is
`p = cdf("Normal", m[,2]) - cdf("Normal", m[,1]);`

In summary, the SAS/IML language provides an easy syntax to construct intervals that are symmetric about a central value. You can use a vector such as {-1, 1} to construct an interval of the form *p* ± δ, or you can use a *k* x 2 matrix to construct *k* symmetric intervals.

The post A simple trick to construct symmetric intervals appeared first on The DO Loop.

]]>The post Nonsmooth models and spline effects appeared first on The DO Loop.

]]>The classical ANOVA is one way to analyze data that are collected before and after a known event. For example, you might record gas mileage for a car before and after a tune-up. You might collect patient data before and after they undergo a medical or surgical treatment. You might have data about real estate prices before and after some natural disaster. In all these cases, you might suspect that the response changes abruptly because of the event.

To give a simple example, suppose that a driver records the fuel economy (in mile per gallon) for a car for 12 weeks. Because the car engine is coughing and knocking, the owner brings the car to a mechanic for maintenance. After the maintenance, the car seems to run better and he records the fuel economy for another six weeks. The hypothetical data are below:

data MPG; input mpg @@; week = _N_; period = ifc(week > 12, "After ", "Before"); label mpg="Miles per Gallon"; datalines; 30.5 28.1 27.1 31.2 25.2 31.1 27.7 28.2 29.6 30.6 28.9 25.9 30.6 33.0 31.2 29.7 32.7 31.1 ; |

Notice that the data contains a binary indicator variable (period) that records whether the data are from before or after the tune-up. You can use PROC GLM to perform a simple ANOVA analysis to determine whether there was a significant change in the mean fuel economy after the maintenance. The following call to PROC GLM runs an ANOVA and indicates that the mean fuel economy is about 2.7 mpg better after the tune-up.

proc glm data=MPG plots=fitplot; class period / ref=first; model mpg = period /solution; output out=out predicted=Pred; quit; |

Graphically, this regression analysis is usually visualized by using two box plots. (PROC GLM creates the box plots automatically when ODS graphics are enabled.) However, because the independent variable is time, you could also use a series plot to show the observed data and the mean response before and after the maintenance. By using the GROUP= option on the SERIES statement, you can get two lines for the "before" and "after" time periods.

title "Piecewise Constant Regression with Jump Discontinuity"; proc sgplot data=Out; block x=week block=period / transparency=0.8; scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black); series x=week y=pred / group=period lineattrs=(thickness=3) ; run; |

The graph shows that the model has a jump discontinuity at the time point at which the maintenance intervention occurred. If you include the WEEK variable in the analysis, you can model the response as a linear function of time, with a jump at the time of the tune-up.

All this is probably very familiar. However, did you know that you can use splines to model the data as a continuous function that has a kink or "corner" at the time of the maintenance event? You can use this feature when the model is continuous, but the slope changes at a known time.

Several SAS procedures support the EFFECT statement, which enables you to build spline effects. The paper "Rediscovering SAS/IML Software" (Wicklin 2010, p. 4) has an example where splines are used to construct a highly nonlinear curve for a scatter plot.

A spline effect is determined by the placement of certain points called "knots." Often knots are evenly spaced within the range of the explanatory variable, but the EFFECT statement supports many other ways to position the knots. In fact, the documentation for the EFFECT statement says:
"If you remove the restriction that the knots of a spline must be distinct and allow repeated knots, then you can obtain functions with less smoothness and even discontinuities at the repeated knot location. For a spline of degree *d* and a repeated knot with multiplicity *m* ≤ *d*, the piecewise polynomials
that join such a knot are required to have only *d – m* matching derivatives."

The degree of a linear regression is *d*=1, so if you specify a knot position once you obtain a piecewise linear function that contains a "kink" at the knot. The following call to PROC GLIMMIX demonstrates this technique. (I use GLIMMIX because neither PROC GLM nor PROC GENMOD support the EFFECT statement.) You can manually specify the position of knots by using the KNOTMETHOD=LIST(*list*) option on the EFFECT statement.

proc glimmix data=MPG; effect spl = spline(week / degree=1 knotmethod=list(1 13 18)); /* p-w linear */ *effect spl = spline(week / degree=2 knotmethod=list(1 13 13 1); /*p-w quadratic */ model mpg = spl / solution; output out=out predicted=Pred; quit; title "Piecewise Linear Regression with Kink"; proc sgplot data=Out noautolegend; block x=week block=period / transparency=0.8; scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black); series x=week y=pred / lineattrs=(thickness=3) ; run; |

The graph shows that the model is piecewise linear, but that the slope of the model changes at week=13. In contrast, the second EFFECT statement in the PROC GLIMMIX code (which is commented out), specifies piecewise quadratic polynomials (*d*=2) and
repeats the knot at week=13. That results in two quadratic models that give the same predicted value at week=13 but the model is not smooth at that location. Try it out!

If you are using a SAS procedure that does not support the EFFECT statement, you can use the GLMMIX procedure to output the dummy variables that are associated with the spline effects. A nice paper by David Pasta (2003) describes how to use dummy variables in a variety of models. The paper was written before the EFFECT statement; many of the ideas in the paper are easier to implement by using the EFFECT statement.

Lastly, the TRANSREG procedure in SAS supports spline effects but has its own syntax. See the TRANSREG documentation, which includes an example of repeating knots to build a regression model for discontinuous data.

Have you ever needed to construct a nonsmooth regression model? Tell your story by leaving a comment.

The post Nonsmooth models and spline effects appeared first on The DO Loop.

]]>The post Print tables in SAS/IML appeared first on The DO Loop.

]]>The TablePrint subroutine provides many options that give you control over the output:

- The FIRSTOBS= and NUMOBS= options specify the range of observations to print.
- The ID= option specifies a variable to use for the row header on the left side of the table.
- The VAR= option specifies the columns (and the column order) for the table.
- The LABEL= option specifies a spanning header for the table.

For example, the following DATA step creates random birthdays for the students in the Sashelp.Class data set. The `Birthday` variable is formatted with the DATE9. format. The data is then read into a SAS/IML table by using the TableCreateFromDataSet function. Finally, the TablePrint subroutine prints a customize portion of the table:

data class; set sashelp.class; Birthday = '03APR2017'd - age*365 - floor(365*uniform(1)); /* create birthday */ format Birthday DATE9.; run; proc iml; tClass = TableCreateFromDataSet("Class"); /* read data into table */ run TablePrint(tClass) firstobs=3 numobs=5 var={"Age" "Birthday"} ID="Name" label="Subset of Class"; |

Notice that the table contains both numeric and character columns. Furthermore, the numeric columns have different formats. The TablePrint subroutine has some distinct advantages over the traditional PRINT statement in SAS/IML:

- The TablePrint subroutine supports an easy way to display a range of observations. When you use the PRINT statement for multiple vectors, you have to use row subscripts in each vector, such as
`PRINT (X[3:8,]) (Y[3:8,]);` -
The TablePrint subroutine supports printing any columns in any order. When you use the PRINT statement on a matrix, you have to use column subscripts to change the order of the matrix columns:
`PRINT (X[, {2 3 1}]);` - The PRINT statement supports the ROWNAME= option (for specifying row headers), the COLNAME= option (for specifying column headers), and the LABEL= option. Those options are easy to work with when you print a single matrix. However, you can't store mixed-type data in a matrix and those options are less convenient when you print a set of vectors.

The SAS/IML documentation has several sections of documentation devoted to advanced printing of SAS/IML tables. For example, you can use the TablePrint subroutine to do the following:

- Use a custom template to display a table.
- Specify values for dynamic variables in templates.
- Use an existing ODS template to format and display data that are in SAS/IML.

The documentation contains an example of "traffic lighting," where the colors of cells depend on the value in the cells, as shown to the right.

For statistical programmers, the ability to use ODS templates means that output from PROC IML can look the same as output from some other SAS procedue. For example, suppose that you have a table that contains parameter estimates for a linear regression. The following example prints that table by using the same ODS template that PROC REG uses, which is the Stat.Reg.ParameterEstimates template:

proc iml; vars = {"Intercept", X, Z}; stats = {32.19 5.08 21.42 42.97, 0.138 0.0348 0.0644 0.2117, 1.227 0.5302 0.1027 2.3506 }; tbl = TableCreate("Variable", vars); call TableAddVar(tbl, {"Estimate" "StdErr" "LowerCL" "UpperCL"}, stats); Confidence=95; call TablePrint(tbl) template="Stat.Reg.ParameterEstimates" dynamic={Confidence}; |

This example works because the column names in the SAS/IML table match the names that are expected by the Stat.REG.ParameterEstimates template. The DYNAMIC= option specifies a dynamic variable (Confidence) that the template requires. See the documentation for further details.

In summary, the TablePrint subroutine in SAS/IML gives programmers control over many options for printing tables of data and statistics. For complex layouts, you can use an existing ODS template or create your own template to customize virtually every aspect of your tabular displays. For more information about SAS/IML tables, see the SAS Global Forum paper "More Than Matrices" (Wicklin, 2017).

The post Print tables in SAS/IML appeared first on The DO Loop.

]]>The post Lists: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

]]>You can create a list by using the ListCreate function. If you know how many items the list will contain, you can specify the list length when you create the list. You can then use the ListSetItem subroutine to assign a value to each item:

proc iml; L = ListCreate(2); /* L is two-item list */ call ListSetItem(L, 1, 1:3); /* 1st item is 1x3 vector */ call ListSetItem(L, 2, {4 3, 2 1}); /* 2nd item is 2x2 matrix */ |

The arguments for the ListSetItem subroutine are *list*, *index*, and *value*, which is the same order that you use when you assign a value to a matrix element, such as `A[2] = 5`.

If you do not know how many items the list will contain, or if you later want to add additional items, you can use the ListAddItem subroutine to append new items to an existing list. The list will automatically grow to accommodate the new item:

X = {3 1 4, 2 5 3}; /* create a 2x3 matrix */ call ListAddItem(L, X); /* add 3rd item; list grows */ |

Notice that the syntax for the ListAddItem subroutine only requires the list and the value because the new item is always added to the end of the list. At this point, the list contains three items, as shown below:

A convenient feature of SAS/IML lists is that they automatically resize. You can insert new items into any position in the list. You can also replace an existing item or delete an item. The following statements demonstrate modifying an existing list.

call ListInsertItem(L, 2, -1:1); /* insert new item before item 2; list grows */ call ListSetItem(L, 1, {A B, C D}); /* replace 1st item with 2x2 character matrix */ call ListDeleteItem(L, 3); /* delete 3rd item; list shrinks */ |

The arguments for the ListInsertItem subroutine are again *list*, *index*, and *value*. Inserting an item at position *k* means that the new item becomes the *t*th item and existing higher-indexed items are renumbered. (Conceptually, imagine some items "move to the right" to "make room" for the new item.) If you delete the item at position *k*, existing higher-indexed items are also renumbered. (Conceptually, items "move to the left.") After the previous sequence of operations, the list contains the following items:

In the previous section, the list acted like a dynamic array: it stored items that you could access by using indices such as 1, 2, and 3. For some applications, it makes more sense to name the list items and refer to the items by their names rather than their positions. This is similar to "structs" in some languages, where you can refer to members of a struct by name.

For example, suppose a teacher wants to store information about students in her classes. For each student, she might want to store the student's name, class, and test scores. The following statements create a SAS/IML list that has three items named "Name", "Class", and "Scores". The items are then assigned values:

Student = ListCreate({"Name" "Class" "Scores"}); /* create named list with 3 items */ call ListSetItem(Student, "Name", "Ron"); /* set "name" value for student */ call ListSetItem(Student, "Class", "Statistics"); /* set "class" value for student */ Tests = {100 97 94 100 100}; /* test scores */ call ListSetItem(Student, "Scores", Tests); /* set "scores" value for student */ |

Although you can still use positional indices to access list items ("Ron" is the first item, "Statistics" is the second item, ...), you can also use the name of items. For example, to extract the test scores into a SAS/IML matrix or vector, you can use the ListGetItem function, as follows:

s = ListGetItem(Student, "Scores"); /* get test scores from list */ |

Lists are for storing items, not for computing. You cannot add, subtract, or multiply lists. However, the example shows that you can extract an item into a matrix and subsequently perform algebraic operations on the matrix.

Lists can contain sublists. In this way you can represent nested or hierarchical data. For example, the teacher might want to store information about several students. If information about each student is stored in a list, then she can use a list of lists to store information about multiple students.

The following statements create a list called `All`. The first student ("Ron") is added to the list, which means that the item is a *copy* of the `Student` list. Then information about a second student ("Karl") is stored in the `Student` list and copied into the `All` list as a second item. This process could continue until all students are stored in the list.

All = ListCreate(); /* empty list */ call ListAddItem(All, Student); /* add "Ron" to list */ call ListSetItem(Student, "Name", "Karl"); call ListSetItem(Student, "Class", "Calculus"); call ListSetItem(Student, "Scores", {90 92 84 70 80}); call ListAddItem(All, Student); /* add "Karl" to list */ |

At this point, the `All` list contains two items. Each item is a list that contains three items.

SAS/IML 14.2 supports lists, which are containers that store other objects. Lists dynamically grow or shrink as necessary. You can index items by their position in the list (using indices) or you can create a named list and reference items by their names. You can use built-in functions to extract items a list or add new items to a list. You can insert, delete, and modify list items. You can represent hierarchical data by using lists of lists. For more information about SAS/IML lists, see the SAS Global Forum paper "More Than Matrices" (Wicklin, 2017) or the chapter "Lists and Data Structures" in the SAS/IML documentation. The documentation shows how lists can be used to emulate other data structures, such as stacks, queues, and trees.

The post Lists: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

]]>The post Is ODS graphics enabled? Use automatic macro variables to determine the state of SAS appeared first on The DO Loop.

]]>For example, the following SAS/IML program checks to see if ODS graphics is on. If so, it creates a bar chart of a categorical variable. If not, it computes the count for each category and prints a frequency table:

ods graphics on; /* -OR- ods graphics off */ proc iml; use sashelp.cars; read all var "Origin"; close; if &SYSODSGRAPHICS then /* is ODS graphics enabled? */ call bar(Origin); /* optionally display a graphical summary */ /* always display a tabular summary */ call tabulate(Categories, Counts, Origin); print Counts[colname=Categories]; |

The SYSODSGRAPHICS automatic macro variable is a recent addition to the automatic macro variables in SAS. You can read Rick Langston's 2015 paper to discover other (relatively) new macro features in SAS 9.3 and SAS 9.4.

SAS has many other automatic macro variables. In general, you can use automatic macro variables in SAS to check the status of the SAS system at run time. Some of my other favorite automatic macro variables (which some people call system macro variables) are the following:

- SYSERR, SYSERRORTEXT, and SYSWARNINGTEXT: Provide information about errors and warnings from SAS procedures. In my book
*Statistical Programming with SAS/IML*I show how to use these macro variables in conjunction with the SUBMIT/ENDSUBMIT statements in SAS/IML. - SYSVER and SYSVLONG: Provide information about your version of SAS. You can use these macro variables to execute newer, more efficient, code if the user has a recent version of SAS.
- SYSSCP and SYSSCPL: Provide information about the operating system. For example, is SAS running on Windows or Linux?

Do you have a favorite automatic variable? Leave a comment and tell me how you use automatic variables in your SAS programs.

The post Is ODS graphics enabled? Use automatic macro variables to determine the state of SAS appeared first on The DO Loop.

]]>