The post The distribution of the R-square statistic appeared first on The DO Loop.
]]>In general, when you receive advice on the internet, you should always scrutinize the information. In this case, the Olkin-Finn formula does not match my intuition and experience. The formula assumes that the sampling distribution of the R-square statistic is normal, but that is not correct in all cases. For simple one-variable regression, the R-square value is equal to the squared Pearson correlation between the explanatory variable, X, and the response variable, Y. Furthermore, the distribution of the correlation statistic is fairly complicated and is not normal.
To better understand the sampling distribution of the R-square statistic and when it is approximately normal, I ran a simulation study. This article presents the results of the simulation study and discusses when to use and when to avoid using the Olkin-Finn formula.
A confidence interval is based on the sampling distribution of a statistic on random samples of data. Often (but not always) the sampling distribution depends on the distribution of the data. In order to obtain a formula, researchers often make simplifying assumptions. Common assumptions include that the data are multivariate normal, that the model is correctly specified, that the sample is very large, and so forth. The last assumption (large sample) is often used to obtain formulas that are asymptotically valid as the sample size increases. I eventually found a reference that states that the Olkin-Finn formula should not be used for small samples (degrees of freedom less than 60), so it is definitely an asymptotic formula.
By definition, the R^{2} value is in the interval [0,1]. As we will see, the distribution of the R^{2} statistic is highly skewed when R^{2} is near 0 or 1. Consequently, the distribution isn't normal, and you shouldn't use the Olkin-Finn formula when R^{2} is near 0 or 1. However, "nearness" depends on the sample size: when the sample size is large you can use the formula on a wider interval than when the sample size is small.
The best way to understand the distribution of the R^{2} statistic is to use a simulation. For data, I will choose a bivariate normal (BVN) sample for which the Pearson correlation is ρ. As mentioned earlier, if (X,Y) ~ BVN(0, ρ), then the regression of Y onto X has and R^{2} value equal to the squared correlation between X and Y (=ρ^{2}). For simplicity, I will break the simulation study into two procedure calls. First, for each value of ρ, use PROC IML to simulate 1,000 samples of size n=100 from BVN(0,ρ). Then, I will use PROC REG to run a regression on each sample and save the R^{2} statistic into an output data set.
/* output a simulation data set: For each rho, generate 1,000 random samples of size N for (X,Y) are bivariate normal with correlation rho */ %let NumSamples = 1000; %let N = 100; proc iml; N = &N; /* sample size */ mu = {0 0}; Sigma = I(2); rhoList = round( do(0, 0.9, 0.1), 0.1) || 0.95; rho=j(N,1); SampleID=j(N,1); create Sim var {'rho' 'SampleID' 'Y' 'X'}; do i = 1 to ncol(rhoList); Sigma[{2 3}] = rhoList[i]; /* set off-diagonal elements of cov matrix */ rho[,] = rhoList[i]; do ID = 1 to &NumSamples; Z = randnormal(N, mu, Sigma); X = Z[,1]; Y = Z[,2]; SampleID[,] = ID; append; end; end; close; QUIT; /* run a procedure to obtain statistics for each random sample. In this case, obtain the R-square statistic for the regression model Y = b0 + b1*X */ proc reg data=Sim noprint outest=PE; by rho SampleID; model Y = X / rsquare; /* add R-square statistic to OUTEST= data set */ quit; |
At this point, the PE data set contains a variable named _RSQ_. There is a value of _RSQ_ for each sample, and there are 1,000 samples for each ρ value. To visualize the distribution of the R-square statistic, you can use box plots, as follows:
/* plot the distribution of the R-square statistics for each rho */ proc sgplot data=PE noautolegend; vbox _RSQ_ / category=rho nooutliers; scatter x=rho y=_RSQ_ / transparency=0.8 jitter; yaxis grid min=0 max=1; run; |
For n=100, the box plots show that the distribution of the R^{2} statistic is skewed except when ρ is near sqrt(2) ≈ 0.7. Equivalently, ρ^{2} is near 0.5, which is as far as ρ^{2} can be from 0 and 1. To make the distributions easier to visualize, you can create a panel of histograms:
title "Distribution of R-Square Statistics"; title2 "n = &n"; proc sgpanel data=PE noautolegend; where rho in (0, 0.2, 0.4, 0.6, 0.8); panelby rho / rows=5 onepanel uniscale=column; histogram _RSQ_ / binwidth=0.02 binstart=0.01; density _RSQ_ / type=kernel; colaxis min=0 offsetmin=0.01; run; |
The graph demonstrates that for a small sample (n=100), the distribution of the R^{2} statistic is not normal when R^{2} is near 0. However, a normal approximation seems reasonable for ρ=0.6 and ρ=0.8. If you increase the sample size to, say, n=500, and rerun the program, you will notice that the R^{2} distributions become narrower and more normal. The distribution remains skewed for extreme values of ρ.
This article reminds us that most formulas have assumptions behind them. Sometimes formulas are applicable to our data, and sometimes they are not. In this article, I use simulation to visualize the sampling distribution of the R^{2} statistic for a simple linear regression on a small set of bivariate normal data. The simulations suggest that you should use care when using the Olkin-Finn formula for small samples. The Olkin-Finn formula is asymptotic and assumes large n. Even for large samples, the formula assumes that the sampling distribution of R-square is approximately normal, which means that you shouldn't use it when R^{2} is near 0 or near 1.
I have no issues with using the Olkin-Finn formula for large data sets when the observed R-square statistic is not too extreme. But what should you do to obtain a confidence interval for small and mid-sized data? One possibility is to use a bootstrap analysis to estimate a confidence interval. In SAS, the GLMSELECT procedure supports the MODELAVERAGE statement, which enables you to perform a simple bootstrap analysis of many regression statistics. A bootstrap analysis does not require large samples and can correctly handle situations where the sampling distribution of R-square is skewed.
As discussed in this article, the Olkin-Finn formula provides an asymptotic confidence interval for R-square. The following text is from p. 88 of Cohen et al., 1988:
The text uses R_{2} for the observed estimate, which is slightly confusing, but I will stick with their notation.
If R^{2} is the point estimate for a linear regression that has k parameters (not counting the intercept)
and the sample has n observations, the formula approximates the sampling distribution by a normal distribution that is centered at
R^{2} and has a standard deviation that is equal to
SE =
sqrt((4 R^{2} ((1-R^{2})^{2}) ((n-k-1)^{2})) / ((n^{2}-1) (n + 3)))
To obtain a 100(1-α) confidence interval, you can use a normal quantile for the α level, or you can use a quantile from a t distribution with n-k-1 degrees of freedom. The following SAS DATA step shows both estimates:
/* Online citation for formula is Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.) The formula for the squared standard error of R^2 seems to originate from Olkin and Finn (1995), "Correlations Redux", Psychological Bulletin, 118(1), pp. 155-164. This is an asymptotic interval, so best for n - k - 1 > 60 For simple 1-var regression, R-sq = corr(X,Y)**2 so for rho = 0.2, R-sq = 0.04 rho = 0.6, R-sq = 0.36 */ %let N = 50; /* smaller than recommended */ data AsympR2; alpha = 0.05; /* obtain 100*(1-alpha)% confidence interval */ n = &N; /* sample size */ k = 1; /* number of parameters, not including intercept */ do R2 = 0.04, 0.36; /* observed R^2 for sample */ SE = sqrt((4*R2*((1-R2)**2)*((n-k-1)**2))/((n**2-1)*(n + 3))); /* use normal quantile to estimate CI (too conservative) */ z = quantile("Normal", 1-alpha/2); LCL_N = R2 - z*SE; UCL_N = R2 + z*SE; /* use t quantile to estimate CI */ t = quantile("T", 1-alpha/2, n-k-1); LCL_T = R2 - t*SE; UCL_T = R2 + t*SE; output; end; run; title "Asymptotic Confidence Interval (Normal Quantile)"; proc print noobs data=AsympR2; var SE z LCL_N UCL_N; ID R2; run; title "Asymptotic Confidence Interval (t Quantile)"; proc print noobs data=AsympR2; var SE t LCL_T UCL_T; ID R2; run; |
Notice that a naïve application of the Olkin-Finn formula can lead to a confidence interval (CI) that include negative values of R-square, although R-square can never be negative. You can compare the two examples to the panel of histograms that is shown earlier in this article. When ρ=0.2, the R-square value is 0.04; When ρ=0.6, the R-square value is 0.36. You can see how the Olkin-Finn CI compares to the true distribution in the histograms.
The post The distribution of the R-square statistic appeared first on The DO Loop.
]]>The post Visualize a multivariate regression model when using spline effects appeared first on The DO Loop.
]]>This article explains how to score the model on new data and how to visualize the predicted values by plotting them against the explanatory variable that you used to generate the spline effects. Although the model uses spline effects, everything in this article also applies to models that do not use splines.
I have previously discussed how to use the EFFECT statement in SAS regression procedures to build regression models that use cubic splines to model nonlinear relationships between the response variable and an explanatory variable. For procedures that do not support an EFFECT statement, you can write the spline basis functions to a data set and then use those effects as explanatory variables in a model.
Visualizing a regression model that has multiple covariates can be a challenge because the "obvious" visualization does not work! The "obvious" (but wrong!) visualization uses an OUTPUT statement to write the predicted values to a data set, then plots the predicted values versus one of the continuous variables. But the following example shows that this naïve approach does not give you an effective visualization when there are multiple covariates:
/* sort the data and choose a subset of 4-, 6-, and 8-cylinder vehicles */ proc sort data=sashelp.cars out=cars(Where=(Type^="Hybrid" and Cylinders in (4,6,8))); by EngineSize; run; /* Fit data by using restricted cubic splines. The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */ proc glmselect data=cars; effect spl = spline(EngineSize / details /* define spline effects */ naturalcubic basis=tpf(noint) /* natural cubic splines, omit constant effect */ knotmethod=equal(4)); /* 4 evenly spaced interior knots */ class Cylinders; model mpg_city = spl Cylinders Weight / selection=none; /* include categorical and continuous covariates */ output out=GLSOut pred=p_MPG_City; /* try to use predicted values on original data */ quit; title "Predicted Values at Data Values from OUTPUT Statement"; title2 "This Output Is Not Useful!"; proc sgplot data=GLSOut; scatter x=EngineSize y=MPG_City; series x=EngineSize y=p_MPG_City / group=Cylinders; xaxis grid; yaxis grid label="Predicted MPG (City)"; run; |
This graph looks strange, but the strangeness is not because the model uses splines. Every multivariate regression model will look similar when you project the predicted values onto one of the covariates. The reason that two observations that have the same (or nearly the same) value for the horizontal variable in the graph (EngineSize, in this example) might have very different values in one of the other covariates. This means that for every horizontal value, there are multiple response values. Consequently, when you use a SERIES statement to plot the predicted values, the line looks jagged as it traces out multiple Y values for each X value.
You can see the range of predicted values by using PROC PRINT. The following statements display the predicted values for observations at the extreme right edge of the graph:
proc print data=GLSOut; where EngineSize=6.0; var Make Model EngineSize Cylinders Weight MPG_City p_MPG_City; run; |
The table shows that the Weight values are different for the five vehicles that have 6.0-liter engines. Because the Weight variable is included in the model, the predicted values are also different. Thus, when EngineSize=6, the SERIES statement draws a vertical segment that spans the range [9.84, 14.34] in the vertical direction. The same explanation is valid for other values of EngineSize, which leads to the jagged line.
The solution to this problem is to use a sliced fit plot. In a sliced fit plot, you evaluate the model at the following set of points:
The good news is that there is a straightforward way to obtain a sliced fit plot in SAS! Many SAS regression procedures support the STORE statement, which enables you to create an item store for the model. You can use the EFFECTPLOT statement in PROC PLM to create a sliced fit plot. For more control over the visualization, you can use the SCORE statement in PROC PLM to evaluate the model at any set of points that you want.
This works because the item store contains not only the parameter estimates for the spline basis, but also the transformation that generates the spline effects from the original explanatory variable.
For example, the following model includes spline effects. The program uses the STORE statement to save the model to an item store, then uses PROC PLM to plot the predicted values against the EngineSize variable for various values of the covariates:
proc glmselect data=cars; effect spl = spline(EngineSize / details /* define spline effects */ naturalcubic basis=tpf(noint) /* natural cubic splines, omit constant effect */ knotmethod=equal(4)); /* 4 evenly spaced interior knots */ class Cylinders Origin; model mpg_city = spl Cylinders Weight / selection=none; /* include categorical and continuous covariates */ store out=GLSModel; run; title "Predicted Values at Specified Values from PROC PLM"; title2 "Model Uses Splines"; proc plm restore=GLSModel; effectplot slicefit(x=EngineSize sliceby=Cylinders) / at(Weight=3500); run; |
Notice the information that you can supply to the SLICEFIT keyword in the EFFECTPLOT statement:
PROC PLM is also useful to scoring the model on a new set of data is straightforward. This is sometimes necessary when you want complete control over the values at which to evaluate the model. To score the model at arbitrary locations, generate the scoring data and use the SCORE statement in PROC PLM, as follows:
/* Create a scoring data set. For cars, mean(Weight)=3570 and median(Weight)=3473, so use 3500 as "round" value. */ data ScoreIn; Weight = 3500; /* a value close to the mean */ do Cylinders = 4,6,8; /* important modes */ do EngineSize = 1.5 to 6 by 0.1; output; end; end; run; /* score the model by using PROC PLM */ proc plm restore=GLSModel; score data=ScoreIn out=ScoreOut; quit; |
You can use the ScoreOut data set to create your own sliced fit plot by using PROC SGPLOT. Scoring the data enables you to overlay observations, add labels, and enhance the plot in other ways. For example, you can combine the predicted values with the original data and overlay them, as follows:
/* you can combine the original data and the predicted value to visualize the sliced model */ data Combine; set cars ScoreOut(rename=( Weight = scoreWeight Cylinders = scoreCylinders EngineSize = scoreEngineSize )); run; title "Model Scored at Weight=3500"; proc sgplot data=Combine; scatter x=EngineSize y=MPG_City / markerattrs=(symbol=CircleFilled) colorresponse=Weight colormodel=(yellow red); series x=scoreEngineSize y=Predicted / group=scoreCylinders name="pred" nomissinggroup; xaxis grid; yaxis grid label="MPG (City)"; gradlegend / position=bottom; keylegend "pred" / title="Cylinders" across=1 position=right; run; |
This graph shows a potential pitfall of evaluating a model at the mean or median of a covariate. For these data, the EngineSize and Weight variables are highly correlated (r=0.83). A consequence of that fact is that small vehicles tend have small engines and small weights; large vehicles tend to have large engines and large weights. (There are also sports cars, which have large engines and small weights.) When you evaluate the model at the value Weight=3500 and overlay the predicted values on the data, the fit does not seem to pass through the bulk of the data. It under-predicts the MPG for small cars and over-predicts the MPG for large cars. This happens because many vehicles have weights that are far from 3,500 pounds. You should not expect the predicted MPG_City (which is for reference value, Weight=3500) to be close to the observed MPG_City when the observed weight is far from the reference value.
This article shows how to use the EFFECTPLOT statement in PROC PLM to visualize multivariate regression models, with an emphasis on models that include spline effects. The EFFECTPLOT SLICEFIT statement enables you to specify the variable to use for the X axis, and the value of the covariates at to evaluate the model. Optionally, you can specify a classification variable and get several curves, one for each level of the classification variable.
In a similar way, you can use the SCORE statement in PROC PLM to evaluate the model at an arbitrary set of values for the explanatory variable. When used correctly and in conjunction with PROC SGPLOT, this gives you absolute control over the visualization of multivariate regression models.
The ideas in this article also apply to models that do NOT use spline effects.
The post Visualize a multivariate regression model when using spline effects appeared first on The DO Loop.
]]>The post Find the label of a variable in SAS appeared first on The DO Loop.
]]>However, you can easily look up a label. This article shows how to look up a variable's label and copy it into a macro variable so that it can be used in reports and graphs. I recently used this technique to display a variable's label on the axis of a graph.
Did you know that there are DATA step functions in SAS that get the attributes of variables?
For example, the following DATA step uses the VLABEL function, followed by a call to the SYMPUT subroutine to copy the value into a macro variable.
%let DSName = Sashelp.Iris; /* data set name */ %let Var = SepalLength; /* response variable */ /* look up the label of the variable; create a macro variable */ data _null_; set &DSName(obs=1); /* load attributes of variables */ label = vlabel(&Var); /* get the variable's label */ call symput("Labl", label); run; %put &=Labl; |
LABL=Sepal Length (mm) |
The call to the VLABEL is straightforward. My implementation reads one observation from the input data. There is, in fact, a sneaky way to load the attributes without reading ANY observations, but it is somewhat esoteric. For the interested reader, I direct you to the common SAS trick for getting the number of observations into a macro variable, which uses an "IF 0 THEN SET" clause to set the variable attributes and uses a STOP statement to handle the fact that the DATA step does not encounter an end-of-file token. See the link for details.
The previous DATA step is short and easy to understand, but I can hear some of my readers exclaim, "But wait! What about DICTIONARY tables?"
Experienced SAS programmers might use DICTIONARY tables for this task. You can get metadata about variables (formats, lengths, labels, etc.) by using PROC SQL to examine the Dictionary.columns table, or by using the DATA step and the Sashelp.vcolumn table. DICTIONARY tables are powerful resources, but the VLABEL function is much simpler in this instance. For a comparison, I include an Appendix that demonstrates how to use dictionaries.
You can use the VLABEL function in a SAS DATA step to get the label for a variable. Similar functions exist to get other variable attributes, such as formats (VFORMAT), type (VTYPE), and name (VNAME). These functions provide a simple way to get a variable's metadata. An example that uses DICTIONARY tables is shown in the Appendix.
If you have ever been to a SAS conference, you know that DICTIONARY tables are a favorite topic for SAS programmers. DICTIONARY tables are read-only tables that provide information about the state of the SAS session, including libraries, data sets, variables, and system options.
For completeness, here is some DATA step code that gets the label for a variable. The first DATA step extracts the libref and the data set name from a two-level specification of the data set. In some situations, you can skip this step.
%let DSName = Sashelp.Iris; /* two-level data set name */ %let Var = SepalLength; /* response variable */ /* if necessary, extract libname and member name from two-level name */ data _null_; k = find("&DSName", "."); /* find the position of the dot */ if k > 0 then lib = upcase(substr("&DSName", 1, k-1)); /* if two-level name */ else lib = "WORK"; /* if one-level name */ mem = upcase(substr("&DSName", k+1)); call symput("Lib", lib); /* LIB contains the libref */ call symput("Mem", mem); /* MEM contains the data set name */ run; |
The result of the previous DATA step is to parse a two-level data set name and create two macro variables: LIB contains the libref and MEM contains the name of the data set. You can use these values to access the DICTIONARY table that contains information about columns. Here is the DATA step approach:
/* look up the label of the original variable; create macro variable */ data _null_; set Sashelp.vcolumn(where=( libname="&Lib" and memname="&Mem" & upcase(name)=upcase("&Var"))); call symput("Labl", label); put label; run; %put &=Labl; |
If you prefer, you can also use the PROC SQL to obtain this information:
/* Alternative: use PROC SQL and the DICTIONARY tables */ proc sql ; select label into :Labl from dictionary.columns where libname="&Lib" AND memname="&Mem" AND upcase(name)=upcase("&Var"); quit; %put &=Labl; |
The post Find the label of a variable in SAS appeared first on The DO Loop.
]]>The post Create filled density plots in SAS appeared first on The DO Loop.
]]>A SAS programmer wanted to visualize density estimate for some univariate data. The data had several groups, so he wanted to create a panel of density estimate, which you can easily do by using PROC SGPANEL in SAS. However, the programmer's boss wanted to see filled density estimates, such as shown to the right. These graphs are also known as shaded density plots. This article shows how to create a graph like this in SAS.
The first step in any visualization is to prepare data that are suitable to visualize. In my blog, I often use data sets in the SASHELP library, which includes familiar data sets such as Sashelp.Cars, Sashelp.Heart, and Sashelp.Iris. For this task, I will use Sashelp.BWeight, which includes birthweights for 50,000 live singleton births in the US, along with information about the mothers. One of the variables is MomSmoke, which is a binary (0/1) variable that records whether the mother smokes cigarettes. It is well known that babies of smoking mothers tend to weigh less than babies of non-smoking mothers.
You can use PROC SGPANEL to visualize the weight of the babies for smoking and non-smoking mothers. The following statements overlay a histogram and a kernel density estimate (KDE) for each group. To make it easier for you to use your own data, I include a few macro variables. You can specify the names of your data set and the variable whose distribution you want to visualize. You can also specify the classification (grouping) variable and the number of levels, which equals the number of rows in the panel of density estimates.
/* generalize the problem: Specify the name of the data set, the name of the response variable, and the name of the classification variable */ %let DSName = Sashelp.BWeight; /* data set name */ %let Var = Weight; /* response variable */ %let ClassVar = MomSmoke; /* classification variable */ ods graphics / width=480px height=480px; title "Panel from PROC SGPANEL"; proc sgpanel data=&DSName noautolegend; panelby &ClassVar / layout=ROWLATTICE rowheaderpos=right; histogram &Var; density &Var / type=kernel; /* or TYPE=NORMAL for a normal estimate */ rowaxis grid; colaxis grid; run; |
This is the graph that I recommend that you use to demonstrate the empirical difference between the weights of babies for smoking and non-smoking mothers. However, the SAS programmer wanted to display ONLY the KDE and wanted to fill the area underneath the KDE curve. Unfortunately, the DENSITY statement does not support a "FILL" option. However, the BAND statement can be used to display a curve and to fill the area beneath the curve.
I have previously shown how to use the BAND statement to display the upper or lower tail densities for a distribution. You can use the same ideas to display a panel of shaded density estimates. However, I first want to show that you can get a nice-looking panel by using only PROC UNIVARIATE.
The HISTOGRAM statement supports many options and suboptions. Many people use the KERNEL option to overlay a kernel density estimate on a histogram. However, you can use the NOBARS option to hide the histogram bars, leaving only the KDE. Furthermore, you can use the KERNEL(FILL) suboption to display shaded density curves. Lastly
title "Panel of Kernel Density Estimates"; proc univariate data=&DSName; class &ClassVar; var &Var; histogram &Var / odstitle=title nocurvelegend nobars /* hide histogram */ kernel(fill) /* display KDE, or use NORMAL(FILL) */ statref=mean /* optional: display mean for each group */ ncol=1 nrow=2 /* optional: set NROW=(number of levels in ClassVar) */ ; ods select histogram; run; |
This panel of KDEs includes reference lines for the mean of each group, which can be an effective way to visualize the difference between the means of the groups. The histograms do not appear, and the KDEs are filled. If you do not like the default layout of ticks on the horizontal axes, you can use the ENDPOINTS= or MIDPOINTS= options to change the tick locations.
Although PROC UNIVARIATE does a good job creating a panel of shaded density estimates, PROC SGPANEL supports more ways to control colors, placement of headers, and other attributes. The HISTOGRAM statement in PROC UNIVARIATE supports the OUTKERNEL= option, which you can use to write a data set that contains the coordinates for the KDE curves of each group. You can then use PROC SGPANEL to read that data and display a panel of curves.
If you want to obtain shaded density estimates, use the BAND statement to specify the upper limit of a band. Use zero (0) as the lower limit. The following statements create a basic panel of shaded KDEs.
/* Write the KDEs to a data set */ proc univariate data=&DSName; class &ClassVar; var &Var; histogram &Var / odstitle=title nobars overlay kernel outkernel=OutKer; /* write kernel estimates to data set */ ods select histogram; /* ignore this graph */ run; /* Read KDE coordinates into PROC SGPLOT or PROC SGPANEL. If you use LAYOUT=ROWLATTICE, you do not need to specify the levels of the grouping variable */ proc sgpanel data=OutKer; *label _VALUE_ = "Weight of Baby (g)" _PERCENT_="Percent"; /* optional: set labels */ panelby &ClassVar / layout=ROWLATTICE rowheaderpos=right onepanel; band x=_VALUE_ upper=_Percent_ lower=0; rowaxis grid offsetmin=0; colaxis grid ; run; |
For brevity, I did not change any default attributes in this plot, other than adding grid lines. The graph at the top of this article includes small changes to the axis labels and the tick marks.
If you want to try out this idea on your own data, change the values of the macro variables and rerun only the program in the previous section. For example, here is the same code applied to the SepalLength variable in the Sashelp.Iris data, grouped by the Species variable:
/* run the program on different data */ %let DSName = Sashelp.Iris; /* data set name */ %let Var = SepalLength; /* response variable */ %let ClassVar = Species; /* classification variable */ /* Write the KDEs to a data set */ proc univariate data=&DSName; class &ClassVar; var &Var; histogram &Var / odstitle=title nobars overlay kernel outkernel=OutKer; /* write kernel estimates to data set */ ods select histogram; run; /* Read KDE coordinates into PROC SGPLOT or PROC SGPANEL. If you use LAYOUT=ROWLATTICE, you do not need to specify the levels of the grouping variable */ title "Distribution of &Var Variable for Each Level of &ClassVar"; proc sgpanel data=OutKer; label _VALUE_ = "Sepal Length (mm)" _PERCENT_="Percent"; panelby &ClassVar / layout=ROWLATTICE rowheaderpos=right onepanel; band x=_VALUE_ upper=_Percent_ lower=0; rowaxis grid offsetmin=0; colaxis grid ; run; |
Instead of hard-coding the label, you could look up the label for the response variable, copy it into a macro variable, and automatically use it on the LABEL statement. This is a nice application of using dictionaries, so I will demonstrate it in a future article.
This article shows how to create a panel of filled kernel density estimates for subpopulations that belong to groups. The UNIVARIATE procedure can create this panel by itself, but for more control you might want to write the density estimates to a SAS data set and use PROC SGPANEL to display the KDEs. To display filled KDEs, you need to use the BAND statement instead of the SERIES statement.
The post Create filled density plots in SAS appeared first on The DO Loop.
]]>The post On the correctness of a discrete simulation appeared first on The DO Loop.
]]>I recently demonstrated this technique by simulating many tosses for a pair of fair six-side dice. The sum can be any of the values {2, 3, 4, ..., 12}, and the theoretical probabilities for this discrete distribution are known for each value. For example, the sum is 2 with probability 1/36, the sum is 3 with probability 2/36, and so forth. In a previous article, I simulated 1,000 tosses and used PROC FREQ in SAS to compare the empirical distribution of the sums to the theoretical distribution. For the example, the p-value for the chi-square test is large, which indicates that the empirical and theoretical distributions are similar.
However, I ended the article with an unsettling question. A hypothesis test will REJECT the null hypothesis for 5% of the random samples that you generate. So, if the test rejects the null hypothesis, how do you know whether your simulation is wrong or if you just got unlucky?
The obvious answer is to generate more than one random sample. For example, if you generate 100 random samples, expect about 5% of them to reject the null hypothesis. I recently discussed scenarios in which the p-values are uniformly distributed when the null hypothesis is true. The example in that article was a test of the mean in a continuous sample. However, the ideas in that article generalize to other hypothesis tests. This article shows how to examine the distribution of the p-values for a chi-square test that assesses the correctness of a discrete distribution.
In a simulation, you specify the model from which you will draw random samples. In this example, the sum of the two (virtual) dice are supposed to have a discrete triangular distribution. The following SAS program generates 10,000 random samples. Each sample simulates tossing two dice a total of 1,000 times. After generating the samples, you can use PROC FREQ to count the proportion of times that the sum was 2, 3, ..., and 12 in each sample. You can then use a chi-square test to compare the empirical distribution to the known theoretical probabilities for the sum. Notice that the program generates all samples in a single data set and uses a BY-group analysis to analyze each sample.
/* Simulate 10,000 random samples. Each sample is 1,000 tosses of two six-sided dice. Report the total sum of the dice. */ %let seed = 1234; %let NumSamples = 10000; /* number of samples */ %let N = 1000; /* size of each sample */ data SumDice2; call streaminit(&seed); do SampleID = 1 to &NumSamples; do i = 1 to &N; d1 = rand("Integer", 1, 6); d2 = rand("Integer", 1, 6); sum = d1 + d2; /* simulate the sum of two six-sided dice */ output; end; end; keep SampleID sum; run; /* For each sample, perform a chi-square test to see if the sample seems to come from the model with known theoretical probabilities. Save the chi-square statistic and p-value in a data set. */ %let probSum = 0.0277777778 0.0555555556 0.0833333333 0.1111111111 0.1388888889 0.1666666667 0.1388888889 0.1111111111 0.0833333333 0.0555555556 0.0277777778; proc freq data=SumDice2 noprint; by SampleID; tables sum / nocum chisq testp=(&probSum); /* run 10,000 chi-square tests */ output out=ChiSqStat pchi; run; |
The ChiSqStat data set contains 10,000 chi-square statistics in the _PCHI_ variable. The distribution of _PCHI_ variable approximates the sampling distribution of the chi-square test statistic. Since these data are all simulated from the model specified in the hypothesis test, the test statistics should obey a chi-square distribution with 10 degrees of freedom. Recall that the chi-square distribution with d degrees of freedom is equivalent to a gamma(d/2, 2) distribution. Therefore, the following call to PROC UNIVARIATE overlays a chi-square(10) distribution on the sampling distribution:
/* Recall that chisq(DF) = gamma(DF/2, 2). See https://blogs.sas.com/content/iml/2015/12/07/proc-univariate-distributions.html The possible number of sums is 11, so DF = 11 - 1 = 10. Fit a chisq(10) distribution by using PROC UNIVARIATE */ title "Sampling Distribution of Chi-Square Statistic"; proc univariate data=ChiSqStat; var _PCHI_; histogram _PCHI_ / gamma(shape=5 scale=2) odstitle=title; /* overlay chisq(10) */ run; |
The histogram shows that a chi-square(10) distribution is a perfect fit for the sampling distribution of the chi-square statistic. As I explained in a previous article, this implies that the associated p-values are uniformly distributed. You can visualize this situation by plotting a histogram of the P_PCHI variable, as follows:
title "Distribution of P-Values for Hypothesis Test"; proc sgplot data=ChiSqStat; histogram P_PCHI / binwidth=0.05 binstart=0.025 scale=percent transparency=0.5; /* center first bin at h/2 */ refline 5 / axis=y; run; |
The p-values are uniformly distributed. This should give you confidence that the dice rolls are correctly simulated from the theoretical discrete probability distribution.
This article combines the ideas in two previous articles. In the first article, I showed how to simulate dice rolls in the SAS DATA step. In the second article, I showed that if data are simulated from a model for a hypothesis test, the p-values for the test are uniformly distributed. For simplicity, I showed that the p-values are uniformly distributed for a familiar hypothesis test of the mean of a continuous distribution. However, this article shows that the same ideas apply to some hypothesis tests for discrete distributions, provided that the test statistic is continuous.
The post On the correctness of a discrete simulation appeared first on The DO Loop.
]]>The post Rank, order, and sorting appeared first on The DO Loop.
]]>For clarity, this article assumes that the data do not contain any tied (duplicate values). For simplicity, all sorts are assumed to be in ascending order.
Ranks are related to the sort order for a set of values. The smallest value is ranked 1, the next smallest is ranked 2, and so on. This is the "race ranking," where the athlete with the smallest time wins first place; the largest time is considered last place. (Golf also rewards low scores.) Mathematically, you sort the data values in ascending order and assign the rank as the ordinal position of the sorted data. If there are no tied values, the rank is unique.
For example, in the data {75, 88, 82, 94, 68}, the corresponding ranks are {2, 4, 3, 5, 1} because 68 is the first (sorted) value, 75 is the second (sorted) value, and so forth. For n unique values, the ranks are integers in the range [1,n].
In SAS you can use the RANK procedure to assign ranks. In the SAS IML language, the RANK function supports the same functionality.
proc iml; /* for clarity, assume no tied values */ x = {75, 88, 82, 94, 68}; /* what is the sorted order? Copy x to s, then sort, so that x is not overwritten */ s = x; call sort(s); /* RANK is the order in which the values appear if you sort them. The smallest value gets rank=1; the largest gets rank=N. */ rank = rank(x); position = T(1:nrow(x)); print position x rank s[L='sort(x)']; |
For each observation in the data, the table shows the position (observation number), the data value, and the rank. The last column is the vector of sorted data values. I added arrows to visualize how the rank works. For each value, you look at the corresponding rank. That tells you the row of the last (sorted) column in which the value will appear when you sort the values in ascending order.
The order() function in R returns what I call the "sort indices." These are also called the "anti-ranks" or the "inverse ranks" because they represent the inverse operation for the ranking process. You start with the data in sorted order. You then ask, for each sorted value, what was the original position of that value. This is the sort index.
For example, in the data {75, 88, 82, 94, 68}, the sorted order is {68, 75, 82, 88, 94}. The first sort index is 5 because the first sorted value (68) was originally in the 5th position. The second sort index is 1 because the second sorted value (75) was originally in the first position. In the SAS IML language, you can use the SORTNDX subroutine to compute the sort index, as follows:
/* the sort index is the vector of indices, ndx, such that x[ndx] is sorted. This is equivalent to the order() function in R. */ call sortndx(ndx, x); print position x ndx s[L='sort(x)']; |
The table shows the position (observation number), the data value, and the sort index. The last column is the vector of sorted data values. Again, I added arrows to visualize how the sort index works. Note that the arrows now move from the right to the left. For each sorted value, you look at the corresponding sort index. That tells you the row (observation number) of the data value in the original order.
Both the rank and the sort index enable you to sort the data. This is most easily seen by considering the sort index. If ndx is a vector that contains the sort index for the vector x, then x[ndx] is the sorted vector. This is why the name "sort index" is used: it is the vector of indices that sorts the data.
You can also use the rank to sort the data. However, you need to use the rank on the left-hand side of an assignment statement! That is, if y is a vector that is the same dimension as x and if r is the vector of ranks, then the assignment y[r]= x puts the sorted values of x into the vector y. The following SAS IML statements show both techniques for sorting the data.
/* the sort index sorts the data when it is used as a subscript on the LHS */ z = x[ndx]; print x ndx z[L='x[ndx]']; /* the rank sorts the data when it is used as a subscript on the RHS */ y = j(nrow(x),1,.); /* y is the same size as x */ y[rank] = x; print x rank y[L='y[rank]=x']; |
For both tables, the last column shows the data in (ascending) sorted order. The first table used the expression x[ndx], where ndx is the sort index. The second table used the assignment y[r]=x, where r is the rank of x.
There are two ways to express the permutation that transforms a data vector into a sorted vector. The sort index (also called the anti-rank or inverse rank) is the vector of indices that sorts the data. The rank is the position of each element after sorting the data in ascending order. The sort index and the rank are inverse operations. In the SAS, you can use the SORTNDX subroutine to compute the sort index. You can use the RANK function to compute the ranks. The SORTNDX subroutine is equivalent to the order() function in R.
The post Rank, order, and sorting appeared first on The DO Loop.
]]>The post The distribution of p-values under the null hypothesis appeared first on The DO Loop.
]]>I think data simulation is a great way to discuss the conditions for which p-values are (or are not) uniformly distributed. This article does the following:
This section discusses some of the theory, use, and misuse of p-values. If you are familiar with p-value or just want to jump straight to an example, you can skip this section.
P-values can be a source of confusion, misinterpretation, and misuse. Even professional statisticians get confused. Students and scientists often don't understand them. It is for these reasons that the American Statistical Association (ASA) released a statement in 2016 clarifying the appropriate uses of p-values (Wasserman and Lazar, 2016, "The ASA Statement on p-Values: Context, Process, and Purpose", TAS). This statement was a response to several criticisms about how p-values are used in science. It was also a response to decisions by several reputable scientific journals to "ban" p-values in the articles that they publish.
I encourage the interested reader to read the six points in Wasserman and Lazar (2016) that describe interpretations, misinterpretations, applications, and misapplications of p-values.
The ASA statement on p-values is written for practitioners. Wasserman and Lazar (2016) define a p-value as follows: "Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value." Under this definition, the p-value is one number that is computed by using one set of data. Unfortunately, this definition does not enable you to discuss the distribution of the p-value.
To discuss a distribution, you must adopt a broader perspective: You must think of a p-value as a random variable. This is discussed in the excellent article by Greenland (2019) and in the equally excellent article by Murdoch, Tsai, & Adcock (2008).
The idea is that the data (X) are a random sample from the assumed model distribution. When you generate a random sample, the data determines the test statistic for the null hypothesis. Therefore, the test statistic is a random variable. To each test statistic, there is a p-value. Therefore, the p-value, too, is a random variable. If the distribution of the test statistic is known, then the p-value is uniformly distributed, by definition, because it is the probability integral transform of the test statistic.
This section shows a familiar example of a hypothesis test and a p-value: the one-sample t test for the mean. Recall that the one-sample t test assumes that a random sample contains independent draws from a normal distribution. (You can also use the t test for large samples from a non-normal continuous distribution, but, for simplicity, let's use normal data.) Assume that the data are drawn from N(72, σ). The null hypothesis is that H0: μ ≥ 72 versus the alternative hypothesis that μ < 72. Let's generate N=20 independent normal variates and use PROC TTEST in SAS to run the one-sided t test on the simulated data:
/* define the parameters */ %let N = 20; /* sample size */ %let MU0 = 72; /* true value of parameter */ data Sim1; /* random sample from normal distrib with mu=&MU0 */ call streaminit(1234); do i = 1 to &N; x = rand("Normal", &MU0, 10); output; end; run; /* H0: generating distribution has mu<=&MU0. Since N is small, also assume data is normal. */ proc ttest data=Sim1 alpha=0.05 H0=&MU0 sides=L plots=none; var X; run; |
The output includes a table of descriptive statistics (the top table) for the sample. The middle table shows inferences about the parameters, such as that the true mean is probably less than 73.31. The last table is the one I want to focus on. It summarizes the test statistic for the hypothesis test. The test statistic has the value -1.06. Under the assumption that statistic is distributed according to the t distribution with 19 degrees of freedom, the probability that we would observe a value less than -1.06 is Pr(T < -1.06) = 0.1510.
Here's an important point: Since we simulated the data from the N(72, σ), we KNOW that the null hypothesis is true, and we know that the test statistics REALLY DOES have a T(DF=19) distribution! Therefore, we are not surprised when the t test has a large p-value and we fail to reject the null hypothesis.
Recall, however, that we might get unlucky when we simulate the sample. For 5% of normal samples, the t test will produce a small p-value and we will reject the null hypothesis. For that reason, when I check the correctness of a simulation, I like to generate many random samples, not just one.
If you want to understand the distribution of the p-values, you can perform a Monte Carlo simulation study in which you generate many random samples of size N from the same distribution. You run a t test on each sample, which generates many test statistics and many p-values. The following SAS program carries out this simulation study, as suggested by Murdoch, Tsai, & Adcock (2008):
/* simulation of many random samples */ %let NumSim = 10000; /* number of random samples */ data SimH0; call streaminit(1234); do SampleID = 1 to &NumSim; do i = 1 to &N; x = rand("Normal", &MU0, 10); output; end; end; run; ods exclude all; proc ttest data=SimH0 alpha=0.05 H0=&MU0 sides=L plots=none; by SampleID; var X; ods output TTests=TTests; run; ods exclude none; title "Test Statistics for One-Sample T-Test"; title2 "H0: mu0 <= &MU0"; proc sgplot data=TTests; histogram tValue / scale=percent transparency=0.5; run; |
The data satisfy the assumptions of the t test and are sampled from a normal distribution for which the true parameter value satisfies the null hypothesis. Consequently, the histogram shows that the distribution of the test statistic looks very much like a t distribution with 19 degrees of freedom.
What does this have to do with p-values? Well, a p-value is just a transformation of a test statistic. The p-value is a way to standardize the test statistic so that it can be interpreted more easily. For any t, the corresponding left-tailed p-value is p = CDF("T", t, 29), or 1-CDF for a right-tailed p-value. As explained in an earlier article, if t is distributed as T(DF=29), then p is U(0,1). You can demonstrate that fact by plotting the p-values:
As promised, the p-values are uniformly distributed in (0,1). If you reject the null hypothesis when the p-value is less than 0.05, you will make a Type 1 error 5% of the time.
Notice that many things must happen simultaneously for the p-values to be uniformly distributed. The data must conform to the assumptions of the test and contain independent random variates from the model that is used to construct the test statistic. If so, then the test statistic follows a known t distribution. (Sometimes we only know an asymptotic distribution of the test statistic.) If all these stars align, then the p-values will be uniformly distributed.
If any of the previous assumptions are not true, then the p-values do not need to be uniformly distributed. The easiest situation to visualize is if the data are not sampled from a distribution whose mean equals the mean of the null hypothesis. For example, we generated the data from N(72, σ), but what happens if we run a t test for the hypothesis H0: μ ≥ 73? In this scenario, we expect more samples to reject the new null hypothesis. How many more? As suggested by Murdoch, Tsai, & Adcock (2008), we don't need to guess or compute, we can simply run the new hypothesis test for all 10,000 samples (which are from N(72, 10)) and plot the p-values:
/* what happens if the samples are NOT from the null hypothesis? */ %let MU1 = %sysevalf(1+&MU0); ods exclude all; proc ttest data=SimH0 alpha=0.05 H0=73 sides=L plots=none; by SampleID; var x; ods output TTests=TTestsNotH0; run; ods exclude none; title "P-Values for One-Sample T-Test"; title2 "H0: mu = &MU0, but actually mu = &MU1"; proc sgplot data=TTestsNotH0; histogram probT / datalabel binwidth=0.05 binstart=0.025 /* center first bin at h/2 */ scale=percent transparency=0.5; refline 5 / axis=y; run; |
For the new hypothesis test, about 11.62% of the samples have a p-value that is less than 0.05. That means that if you use 0.05 as a cutoff value, you will reject the null hypothesis 11.62% of the time. This is good, because the samples are actually generated from a distribution whose mean is LESS than the hypothetical value of 73. In this case, the null hypothesis is false, so you hope to reject it. If you further increase the value used in the null hypothesis (try using 75 or 80!) the test will reject even more samples.
This article uses simulation to illustrate the oft-repeated assertion that "p-values are uniformly distributed under the null hypothesis." The article shows an example of a hypothesis test for a one-sample t test. When the data satisfy the assumptions of the t test and are generated so that the null hypothesis is correct, then the test statistic has a known sampling distribution. The p-value is a transformation of the test statistic. When the test statistic follows its assumed distribution, then the p-values are uniform. Otherwise, the p-values are not uniform, which means that using a p-value cutoff such as 0.05 will reject the null hypothesis with a probability that is different than the cutoff value.
For a second example, see "On the correctness of a discrete simulation."
The post The distribution of p-values under the null hypothesis appeared first on The DO Loop.
]]>The post Dice and the correctness of a simulation appeared first on The DO Loop.
]]>It is easy to simulate the sum of two six-sided dice by using the DATA step in SAS. For fair dice, each face has a probability of 1/6 of appearing. You can simulate rolling a die in either of two ways. Older SAS programmers will probably use an expression such as CEIL(6*RAND("Uniform")), which generates the integers 1–6 with equal probability. However, an equivalent (and clearer!) syntax is to use the "Integer" distribution to generate the integers directly, as follows:
/* Simulate 1,000 rolls of two dice. Report the sum of the faces. */ data Dice1; call streaminit(1234); do i = 1 to 1000; d1 = rand("Integer", 1, 6); d2 = rand("Integer", 1, 6); sum = d1 + d2; output; end; run; |
The Dice1 data set contains 1,000 simulated dice rolls. Each row displays the face of each die and the sum of the faces.
When you create a simulated data set, it is important to assess whether the simulation is correct. There are a few techniques that I like to use. In increasing order of complexity, they are:
To assess the correctness of the simulation, it is useful to know the exact distribution of the model from which you are drawing random samples. For the sum of two dice, you can use a simple counting argument to show that the discrete probability distribution is triangular:
For a discrete distribution, you can use PROC FREQ to count the number of observations for each category and to graph the frequency distribution, as follows:
proc freq data=Dice1; tables sum / nocum plots=freqplot(scale=percent); run; |
The FREQ procedure outputs a table of counts and percentages (not shown) and a bar chart that shows the same information. The bar chart gives you faith that the simulation is correct because the heights of the bars are close to the theoretical probabilities. The sums range from 2 to 12, with 6, 7, and 8 being the most common rolls. The distribution of the sums is roughly symmetric.
However, you might be concerned that the empirical distribution does not show a prominent peak at 7. Does the graph show the kind of variation that you should expect in a random sample of size N=1,000, or is the asymmetry showing some insidious error in the program or in the RAND function in SAS?
To answer that question, you can run a hypothesis test that tests the empirical frequencies of the samples against the expected values for the discrete triangular distribution. The null hypothesis is that the observed frequencies are from a discrete probability distribution that has a specified table of probabilities. In PROC FREQ, the TESTP= option enables you to specify the expected probabilities for each category, and the CHISQ option requests a chi-square test that compares the empirical and expected frequencies, as follows:
/* exact probabilities for the sum of two dice */ %let probSum = 0.0277777778 0.0555555556 0.0833333333 0.1111111111 0.1388888889 0.1666666667 0.1388888889 0.1111111111 0.0833333333 0.0555555556 0.0277777778; proc freq data=Dice1; tables sum / nocum chisq testp=(&probSum); run; |
For each of the 11 possible sums, PROC FREQ displays the sample percentages and the expected percentages under the hypothesis that the sums are distributed according to the probabilities specified in the TESTP= option. You can see that the empirical percentages are close to the expected values. Furthermore, the CHISQ option produces a table that shows the result of a chi-square test for the specified proportions. The test statistic is the sum of the quantities (frequency[i]– expected[i])^{2} / expected[i], where frequency[i]is the observed count and expected[i]is the expected value under the null hypothesis. For these simulated data, the p-value for the test (0.2039) is not very small, so there is no reason to reject the null hypothesis. We conclude that there is no evidence to doubt that these simulated data are drawn from the distribution for the sum of two fair six-sided dice.
In summary, this article shows how to use the DATA step to simulate tossing a pair of six-sided dice. After simulating the data, you might wonder whether the simulation was correct. You can visualize the distribution of the sample to check whether it has the correct shape. You can use statistical hypothesis tests if you need further evidence that the sample has the statistical properties that you would expect in random sample from a known probability distribution.
But here's a dirty little secret: Even if the simulation is correct, the hypothesis test will REJECT the null hypothesis 5% of the time (assuming an α = 0.05 significance level). So, if the test rejects the null hypothesis, how do you know whether your simulation is wrong or if you just got unlucky? One thing you can do is to change the random number seed and rerun the test again. Did the test reject the null again? If so, you probably have a problem in your simulation. But if the second run does not reject the null, your simulation might be okay.
I am not suggesting that your rerun a test until you get a p-value that you like! In practice, you need to have confidence in your simulation. There are principled ways to check the validity of a simulation. They require that you generate many samples and look at the distribution of statistics across the random samples. I discuss these issues in a subsequent article.
The post Dice and the correctness of a simulation appeared first on The DO Loop.
]]>The post Visualize patterns of missing values appeared first on The DO Loop.
]]>This article shows how to use only Base SAS procedures to create a heat map that reveals patterns or missing values in data.
Let's start by showing the graph we want to create. The graph to the right shows an example for the numerical variables in the Sashelp.heart data. The black tick marks indicate the missing value in the data. The X axis is the observation number. The Y axis shows the variables in the analysis.
This graph shows the patterns of missing values in the data. The graph shows that the AgeAtStart, Diastolic, and Systolic variables do not contain any missing values. The Height, Weight, and MRW variables contain a few missing values, and MRW (which is a measurement of obesity) is missing whenever Weight is missing. The Smoking and Cholesterol variables contain the most missing values, but primarily for observations early in the data collection process. It looks like some change occurred during the study that improved the collection of nonmissing data for the Smoking and Cholesterol variables.
Some readers might prefer to rotate the plot so that variables become columns and the observation numbers correspond to rows. That is easily doable, but I prefer to display the categorical variables on the vertical axis.
To create this heat map, we need, for each observation number, the names of the variables that contain missing values. It will save disk space (or RAM) if we store only the observation-variable pairs that contain missing values, not the observation-variable pairs that are nonmissing. (This is called a sparse representation of the missing value pattern.) To detect missing values, I will use the CMISS function in the SAS DATA step. You can pass multiple variables to the CMISS function, and it will return whether any of the variables are missing. If the variables are in a SAS array, then you can use the OF keyword to pass in all variables. For example, if the array is called x, then use the syntax CMISS(of x[*]). You can also loop through the array elements to discover which variables are missing.
A sparse representation stores only the variables and values that are missing. However, it is convenient to use a non-sparse representation for the first and last observations in the data. This ensures that ALL variables appear in the heat map (even those with no missing values), and that the heat map shows the complete range of missing values (1–N) instead of a range that involves only observations that are missing. Thus, the following program uses a non-sparse representation for the first and last observations.
To make the program general, I define a few macro variables:
/* show how to create a heat map of the missing values for each numeric variable for each observation */ %macro VizMissingNumericVar(DSName=, NumericVars=, colorBands=0); data Heat; length Variable $32; set &DSName nobs=numObs; label obsNum = "Observation Number"; array X[*] &numericVars; /* output all missing/nonmissing for the first and last obs so that X axis ranges over all observation numbers */ if _N_=1 | _N_ = numObs then do; /* ensure that the first output value is nonmissing (white) and second is missing (black) */ obsNum = .; Variable = " "; Value = 0; output; Value = 0; output; /* ensure that the all variable names are included, in order */ obsNum = _N_; do i = 1 to dim(X); Variable = vname(X[i]); Value = cmiss(X[i]); output; end; end; /* otherwise, output only the rows and variables that are missing */ else if cmiss(of X[*]) > 0 then do; obsNum = _N_; do i = 1 to dim(X); if cmiss(X[i]) then do; Variable = vname(X[i]); Value = 1; output; end; end; end; keep obsNum Variable Value; run; proc sgplot data=Heat; styleattrs DATACOLORS=(white black); heatmapparm x=obsNum y=Variable colorgroup=Value; xaxis grid min=0 integer; yaxis reverse display=(nolabel) %if &colorBands %then %do; colorbands=even colorbandsattrs=(transparency=0.8) %end; ; run; %mend; |
With this definition, we can call the macro and specify the name of a data set and the name of several numeric variables:
It is helpful to demonstrate new code on a small example that can be easily checked for correctness. The following statements create a small 10-observation data set that contains five numeric variables. The missing value pattern is shown without using alternating color bands:
data Test; input x1-x5; datalines; 1 2 3 4 . . . 3 4 5 1 2 3 4 5 1 2 3 . 5 1 2 3 4 5 . . 3 4 . 1 2 3 4 . . . 3 4 . . . 3 4 5 1 2 3 4 . ; ods graphics / width=400px height=150px; title "Example of Missing Value Pattern"; %VizMissingNumericVar(DSName=Test, NumericVars=x1-x5); |
You can inspect the data set and verify that the black cells correspond to the locations of missing values in the data. Notice that the x1 and x2 variables are missing or nonmissing simultaneously. The x3 variable is never missing. The x4 variable has only one missing value, whereas half of the x5 values are missing.
The following statements visualize the missing values for the Sashelp.Heart data. The resulting heat map was shown and discussed at the beginning of this article.
ods graphics / width=500px height=300px; title "Missing Value Pattern and Color Bands"; %VizMissingNumericVar(DSName = Sashelp.Heart, NumericVars = AgeAtStart Diastolic Systolic Height Weight MRW Smoking Cholesterol, ColorBands=1); |
This article shows how to use Base SAS procedures to visualize the missing-value patterns for numerical variables in a SAS data set. This implementation uses the DATA step and the CMISS function. With additional work, you can extend the program to the case of analyzing either numerical variable, character variables, or both types in a single graph. I leave the details to the motivated reader.
The post Visualize patterns of missing values appeared first on The DO Loop.
]]>The post Estimate a proportion and a confidence interval in SAS appeared first on The DO Loop.
]]>This formula defines the Wald confidence interval. It assumes that the sampling distribution of the statistic is normal. In general, you can obtain a (1-α)100% confidence interval by specifying the z-value in the formula as the (1-α/2)th quartile of the standard normal distribution. You might need to truncate the interval so that it is always a subset of the interval [0, 1].
In SAS, this problem is called an estimate of a binomial proportion, and PROC FREQ is a SAS procedure that can provide the estimate. To use PROC FREQ, you must construct a data set that has the counts of both events and non-events.
Let's use a typical example. A researcher wants to estimate the proportion of families in New York City who live in a rental unit. The researcher obtains a random sample of n = 500 housed families, and discovers that x=340 families rent (rather than own) their unit. In a "Stat 101" course, you learn the following:
The challenge for the SAS programmer is threefold: What procedure do you use? How do you create the input data set? And how to you ask for a Wald confidence interval?
As mentioned earlier, PROC FREQ can estimate the proportion when you correctly specify the input data. PROC FREQ is not designed to use the counts of the number of events and the sample size. Instead, it uses the count for the events (x) and the non-events (n – x). The following DATA step reads one observation (the number of events and the sample size) and outputs TWO observations. In the output data set, Response is a binary variable with values "Yes" and "No," and Count is a variable that specifies the frequency of each response. In PROC FREQ, you can use the WEIGHT statement to specify the counts for each level of the response variable:
/* estimate of proportion and CI for proportion: n = 500 number of families in NYC in random sample x = 340 number that rent housing */ data Have; input x n; /* read in events and trials */ /* convert input obs to TWO output obs that record the Response (Yes/No) and Count */ Response = "Yes"; Count = x; output; Response = "No "; Count = n-x; output; datalines; 340 500 ; proc freq data=Have; tables Response / nocum binomial(level='Yes' CL=Wald); weight Count / zeros; run; |
The output from PROC FREQ is shown:
In the previous sections, I said that the third table shown an estimate for the CI, rather than the estimate. The Wald interval is just one of several ways to estimate a confidence interval for the population proportion. Although the Wald interval is shown in almost every statistics textbook, the Wald interval is not the best one to use. I prefer to use the Wilson interval, which you can obtain by specifying the CL=WILSON suboption in the BINOMIAL option on the TABLES statement.
You can also choose to run a hypothesis test for a specified proportion. For example, suppose you read an article that claims that 69% of New York City families rent. Does the sample support this assertion? You can specify the P=0.69 suboption in the BINOMIAL option to get a fourth table that shows the statistics for the null hypothesis H0: p=0.69. If you do not specify the P= suboption, you will get a table for the hypothesis test for H0: p=0.5.
You can combine multiple options into one BINOMIAL statement. For example:
binomial(level='Yes' CL=Wald CL=Wilson P=0.69)
New programmers face three challenges when they use a language such as SAS to solve an elementary statistic problem. What procedure do you use? How do you create the input data set? And how to you specify options to obtain additional information such as confidence intervals or hypothesis tests? This article shows the answers to these questions for the simple problem of estimating a binomial proportion. You can use PROC FREQ and the BINOMIAL option to obtain the estimates. You need to create an input data set that specifies the number of events and non-events, as opposed to the number of events and the sample size.
The post Estimate a proportion and a confidence interval in SAS appeared first on The DO Loop.
]]>