The post Quantile estimates and the difference of medians in SAS appeared first on The DO Loop.
]]>Historically, PROC UNIVARIATE and PROC NPAR1WAY are two procedures in SAS that analysts used for univariate analysis. PROC UNIVARIATE performs standard parametric tests. In contrast, PROC NPAR1WAY performs nonparametric tests and distribution-free analyses. An internet search reveals many resources that describe how to use UNIVARIATE and NPAR1WAY for analyzing quantiles.
However, there is an alternative way to analyze univariate quantiles: PROC QUANTREG. Although QUANTREG is designed for quantile regression, the same procedure can easily analyze quantiles of univariate data. All you need to do is omit the regressors from the right side of the MODEL statement and the procedure will analyze the "response" variable.
Be aware that the QUANTREG procedure uses an optimization algorithm to perform its analysis. This can sometimes result in different estimates than a traditional computation. For example, if the data set has an even number of observations and the middle values are a and b, one estimate for the median is the average of the two middle values (a+b)/2. The QUANTREG procedure might provide a different estimate, which could be any value in [a, b]. This difference is most noticeable in small samples. (Don't let this bother you too much. There are many definitions for quantile estimates. SAS supports five different definitions for calculating quantiles.)
I have previously shown how to compute confidence intervals for percentiles in SAS by using PROC UNIVARIATE. The following statements compute the 20th, 50th, and 90th percentiles for the cholesterol levels of 5209 patients in a medical study, along with 95% confidence intervals for the quantiles. The computation is shown twice: first with PROC UNIVARIATE, then with PROC QUANTREG.
/* 1. Use PROC UNIVARIATE to get 95% CIs for 20th, 50th, and 90th pctls */ proc univariate data=Sashelp.Heart noprint; var Cholesterol; output out=pctl pctlpts=20 50 90 pctlpre=p cipctldf=(lowerpre=LCL upperpre=UCL); /* 12.1 options (SAS 9.3m2) */ run; data QUni; /* rearrange the statistics into a table */ set pctl; Quantile = 0.2; Estimate = p20; Lower = LCL20; Upper = UCL20; output; Quantile = 0.5; Estimate = p50; Lower = LCL50; Upper = UCL50; output; Quantile = 0.9; Estimate = p90; Lower = LCL90; Upper = UCL90; output; keep Quantile Estimate Lower Upper; run; title "UNIVARIATE Results"; proc print noobs; run; /**************************************/ /* 2. Alternative: Use PROC QUANTREG! */ ods select none; ods output ParameterEstimates=QReg ; proc quantreg data=Sashelp.Heart; model Cholesterol = / quantile=0.2 0.5 .9; run; ods select all; title "QUANTREG Results"; proc print noobs; var Quantile Estimate LowerCL UpperCL; run; |
The output shows that the confidence intervals (CIs) for the quantiles are similar, although the QUANTREG intervals are slightly wider. Although UNIVARIATE can produce CIs for these data, the situation changes if you add a weight variable. The UNIVARIATE procedure supports estimates for weighted quantiles, but does not produce confidence intervals. However, the QUANTREG procedure can provide CIs even for a weighted analysis.
In general, PROC QUANTREG can compute statistics for quantiles that UNIVARIATE cannot. For example, you can use the ESTIMATE statement in QUANTREG to get a confidence interval for the difference between medians in two independent samples. If the confidence interval does not contain 0, you can conclude that the medians are significantly different.
The adjacent box plot shows the distribution of diastolic blood pressure for male and female patients in a medial study. Reference lines are drawn at the median values for each gender. You might want to estimate the median difference in diastolic blood pressure between male and female patients and compute a confidence interval for the difference. The following call to PROC QUANTREG estimates those quantities:
ods select ParameterEstimates Estimates; proc quantreg data=Sashelp.Heart; class sex; model diastolic = sex / quantile=0.5; estimate 'Diff in Medians' sex 1 -1 / CL; run; |
The syntax should look familiar to programmers who use PROC GLM to compare the means of groups. However, this computation compares medians of groups. The analysis indicates that female patients have a diastolic blood pressure that is 3 points lower than male patients. The 95% confidence interval for the difference does not include 0, therefore the difference is statistically significant. By changing the value on the QUANTILE= option, you can compare quantiles other than the median. No other SAS procedure provides that level of control over quantile estimation.
PROC QUANTREG provides another tool for the SAS programmer who needs to analyze quantiles. Although QUANTREG was written for quantile regression, the procedure can also analyze univariate samples. You can use the ESTIMATE statement to compare quantiles across groups and to obtain confidence intervals for the parameters.
In general, SAS regression procedures enable you to conduct univariate analyses that are not built into any univariate procedure.
The post Quantile estimates and the difference of medians in SAS appeared first on The DO Loop.
]]>The post The distribution of colors for plain M&M candies appeared first on The DO Loop.
]]>Many introductory courses in probability and statistics encourage students to collect and analyze real data. A popular experiment in categorical data analysis is to give students a bag of M&M® candies and ask them to estimate the proportion of colors in the population from the sample data. In some classes, the students are also asked to perform a chi-square analysis to test whether the colors are uniformly distributed or whether the colors match a hypothetical set of proportions.
M&M's® have a long history at SAS. SAS is the worldâ€™s largest corporate consumer of M&M's. Every Wednesday a SAS employee visits every breakroom on campus and fill two large containers full of M&M's. This article uses SAS software to analyze the classic "distribution of colors" experiment.
The "plain" M&M candies (now called "milk chocolate M&M's") are produced by the Mars, Inc. company. The distribution of colors in M&M's has a long and colorful history. The colors and proportions occasionally change, and the distribution is different for peanut and other varieties. A few incidents from my lifetime that made the national news are:
The breakroom containers at SAS are filled from two-pound bags. So as to not steal all the M&M's in the breakroom, I conducted this experiment over many weeks in late 2016 and early 2017, taking one scoop of M&M's each week. The following data set contains the cumulative counts for each of the six colors in a sample of size N = 712:
data MandMs; input Color $7. Count; datalines; Red 108 Orange 133 Yellow 103 Green 139 Blue 133 Brown 96 ; |
A bar chart that shows the observed distribution of colors in M&M's is shown at the top of this article.
To estimate the proportion of colors in the population, simply divide each count by the total sample size, or use the FREQ procedure in SAS. PROC FREQ also enables you to run a chi-square test that compares the sample counts to the expected counts under a specified distribution. The most recent published distribution is from 2008, so let's test those proportions:
proc freq data = MandMs order=data; weight Count; tables Color / nocum chisq /* 2008 proportions: red orange yellow green blue brown */ testp=(0.13 0.20 0.14 0.16 0.24 0.13); run; |
The observed and expected proportions are shown in the table to the right. The chi-square test rejects the test hypothesis at the α = 0.05 significance level (95% confidence). In other words, the distribution of colors for M&M's in this 2017 sample does NOT appear to be the same as the color distribution from 2008! You can see this visually from the bar chart: the red and green bars are too tall and the blue bar is too short compared with the expected values.
You need a large sample to be confident that this empirical deviation is real. After collecting data for a few weeks, I did a preliminary analysis that analyzed about 300 candies. With that smaller sample, the difference between the observed and expected proportions could be attributed to sampling variability and so the chi-square test did not reject the null hypothesis. However, while running that test I noticed that the green and blue colors accounted for the majority of the difference between the observed and theoretical proportions, so I decided to collect more data.
As I explained in a previous article, you can use the sample proportions to construct simultaneous confidence intervals for the population proportions. The following SAS/IML statements load and call the functions from the previous post:
%include "conint.sas"; /* define the MultCI and MultCIPrint modules */ proc iml; load module=(MultCI MultCIPrint); use MandMs; read all var {"Color" "Count"}; close; alpha = 0.05; call MultCIPrint(Color, Count, alpha, 2); /* construct CIs using Goodman's method */ |
The table indicates that the published 2008 proportion for blue (0.24) is far outside the 95% confidence interval, and the proportion for green (0.16) is just barely inside its interval. That by itself does not prove that the 2008 proportion are no longer valid (we might have gotten unlucky during sampling), but combined with the earlier chi-square test, it seems unlikely that the 2008 proportions are applicable to these data.
The published proportions for green and blue do not seem to match the sample proportions from 2008. For this large sample, the published proportion of blue is too large whereas the published proportion of green is too small.
From reading previous articles, I know that the Customer Care team at M&M/Mars is very friendly and responsive. Apparently they get asked about the distribution of colors quite often, so I sent them a note. The next day they sent a breakdown of the colors for all M&M candies.
Interestingly, plain (and peanut) M&M's are now produced at two different factories in the US, and the factories do not use the same mixture of colors! You need to look on the packaging for the manufacturing code, which is usually stamped inside a rectangle. In the middle of the code will be the letters HKP or CLV. For example, the code might read 632GCLV20.
Although I did not know about the manufacturing codes when I collected the data, I think it is clear that the bulk of my data came from the CLV plant. You can create a graph that shows the sample proportions, the 95% simultaneous confidence intervals, and vertical hash marks to indicate the CLV population parameters, as follows:
The graph shows that the observed proportions are close to the proportions from the CLV plant. All proportions are well within the 95% simultaneous confidence intervals from the data. If you rerun the PROC FREQ chi-square analysis with the CLV proportions, the test does not reject the null hypothesis.
The experimental evidence indicates that the colors of plain M&M's in 2017 do not match the proportions that were published in 2008.
After contacting the M&M/Mars Customer Care team, I was sent a new set of proportions for 2017. The color proportions now depend on where the candies were manufactured. My data matches the proportion of colors from the Cleveland plant (manufacturing code CLV).
If you are running this analysis yourself, be sure to record whether your candies came from the HKP or CLV plant. If you want to see my analysis, you can download the complete SAS program that analyzes these data.
Educators who use M&M's to teach probability and statistics need to record the manufacturing plant, but this is still a fun (and yummy!) experiment. What do you think? Do you prefer the almost-equal orange-blue-green distribution from the CLV plant? Or do you like the orange-blue dominance from the HKP plant? Or do you just enjoy the crunchy shell and melt-in-your-mouth goodness, regardless of what colors the candies are?
The post The distribution of colors for plain M&M candies appeared first on The DO Loop.
]]>The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.
]]>In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:
data Psych; input Category $21. Count; datalines; Neurotic 91 Depressed 49 Schizophrenic 37 Personality disorder 43 ; |
If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.
May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ^{2}(α, k-1) is the upper 1–α quantile of the χ^{2} distribution with k-1 degrees of freedom, and π_{1}, π_{2}, ..., π_{k} are the true population parameters. The methods and the references for the methods are:
In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."
A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.
You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:
Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called CONINTS.sas; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):
%include "conint.sas"; /* specify full path name to file */ proc iml; load module=(MultCI MultCIPrint); use Psych; read all var {"Category" "Count"}; close; alpha = 0.05; call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */ |
The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.
You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:
CI = MultCI(Count, alpha, 2); /*or simply CI = MultCI(Count) */ /* write estimates and CIs to data set */ Estimate = CI[,1]; Lower = CI[,2]; Upper = CI[,3]; create CIs var {"Category" "Estimate" "Lower" "Upper"}; append; close; quit; ods graphics / width=600px height=240px; title 'Simultaneous Confidence Intervals for Multinomial Proportions'; title2 'Method of Goodman (1965)'; proc sgplot data=CIs; scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper markerattrs=(Size=12 symbol=DiamondFilled) errorbarattrs=GraphOutlines(thickness=2); xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05); yaxis reverse display=(nolabel); run; |
The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.
Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.
I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).
This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.
The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.
]]>The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.
]]>As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.
In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.
As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.
Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet
There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=X_{i}, where i = 1 to 1000.
One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:
%macro RunReg(DSName, NumVars); ... %do i = 1 %to &NumVars; /* repeat for each x&i */ proc reg data=&DSName noprint outest=PE(rename=(x&i=Value)); /* save parameter estimates */ model Y = x&i; /* model Y = x_i */ quit; /* ...then accumulate statistics... */ %end; %mend; |
The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.
If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.
An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.
To run a BY-group analysis:
In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.
/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */ data Long; set Wide; /* <== specify data set name HERE */ array x [*] x1-x&nCont; /* <== specify explanatory variables HERE */ do varNum = 1 to dim(x); VarName = vname(x[varNum]); /* variable name in char var */ Value = x[varNum]; /* value for each variable for each obs */ output; end; drop x:; run; |
In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:
/* 2. Sort by BY-group variable */ proc sort data=Long; by VarName; run; |
You can now call a SAS procedure one time to compute all regression models:
/* 3. Call PROC REG and use BY statement to compute all regressions */ proc reg data=Long noprint outest=PE; by VarName; model Y = Value; quit; /* Look at the results */ proc print data=PE(obs=5); var VarName Intercept Value; run; |
The PE data set contains the parameter estimates for every single-variable regression of Y onto X_{i}. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.
You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).
This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=X_{i}, but a similar result would work for GLM models such as Y=A|X_{i}, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.
Can you think of other ways to use this trick? Leave a comment.
The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.
]]>The post Winsorization: The good, the bad, and the ugly appeared first on The DO Loop.
]]>It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?
Winsorization: Definition, pitfalls, and alternatives #StatWisdom
Click To Tweet
The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.
Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.
The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.
I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:
As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.
Why do some people want to Winsorize their data? There are a few reasons:
There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.
As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:
If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,
we are likely to think of them as 'strays' [or]'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.
A little later in the essay (p. 458), he says
Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.
I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.
Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.
If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.
What are your thoughts about Winsorizing data? Share them in the comments.
The post Winsorization: The good, the bad, and the ugly appeared first on The DO Loop.
]]>The post What colors does PROC SGPLOT use for markers? appeared first on The DO Loop.
]]>data A; /* example data with groups 1, 2, ..., 5 */ do Color = 1 to 5; x = Color; y = Color; output; end; run; title "Marker Colors Used for GROUP= Option"; title2 "HTMLBlue Style"; proc sgplot data=A; xaxis grid; yaxis grid; scatter x=x y=y / group=Color markerattrs=(size=24 symbol=SquareFilled); run; |
Notice that these marker colors are not fully saturated colors, so they are not the SAS color names RED, BLUE, GREEN, BROWN, and MAGENTA. So what colors are these? What are their RGB values?
What colors does PROC SGPLOT uses for groups? #SASTip
Click To Tweet
Colors are defined by styles, and you can use ODS style elements to set marker colors. A style defines elements called GraphDataDefault, GraphData1, GraphData2, GraphData3, and so forth. Each element contains several attributes such as colors and line patterns. The complete list of style elements and attributes for ODS graphics is in the documentation, but for this article, the important fact is that the GraphDatan:ContrastColor attribute determines the marker color for the nth group. These are the colors in the previous scatter plot.
Styles are defined by ODS templates. You can use the SOURCE statement in PROC TEMPLATE to display a template. In the style template, the GraphDatan:ContrastColor attributes are set by using keywords named gcdata1, gcdata2, gcdata3, etc.
If you display the template for the styles.HTMLBlue template, you will see that the HTMLBlue style inherits from the Statistical style, and it is the Statistical style that defines the contrast colors. The following statements display the contents of the Statistical template to the SAS log:
proc template; source styles.statistical; quit; |
The template is long, and it is hard to scroll through the log to discover which colors are associated with each attribute. But that's no problem: you can use SAS to find and display only the information about contrast colors.
For years Warren Kuhfeld has been showing SAS customers how to view, edit, and use ODS templates to customize the graphs that are produced by SAS statistical procedures. A powerful technique that he uses is to write a template to a file and then use the DATA step to modify the template.
I will not modify the template but merely display information from it. The following DATA step writes the template to a text file and then uses the DATA step to find all instances of the keyword 'gcdata' in the template. For lines that contain the string 'gcdata', the program extracts the color for each keyword. The keyword-value pairs are saved to a data set, which is sorted and displayed:
libname temp "C:/temp"; proc template; source styles.statistical / file='temp.tmp'; /* write template to text file */ quit; data Colors; keep Num Name Color R G B; length Name Color $8; infile 'temp.tmp'; /* read from text file */ input; /* example string: 'gcdata1' = cx445694 */ k = find(_infile_,'gcdata','i'); /* if k=0 then string not found */ if k > 0 then do; /* Found line that contains 'gcdata' */ s = substr(_infile_, k); /* substring from 'gcdata' to end of line */ j = index(s, "'"); /* index of closing quote */ Name = substr(s, 1, j-1); /* keyword */ if j = 7 then Num = 0; /* string is 'gcdata' */ else /* extract number 1, 2, ... for strings */ Num = inputn(substr(s, 7, j-7), "best2."); /* gcdata1, gcdata2,... */ j = index(s, "="); /* index of equal sign */ Color = compress(substr(s, j+1)); /* color value for keyword */ R = inputn(substr(Color, 3, 2), "HEX2."); /* convert hex to RGB */ G = inputn(substr(Color, 5, 2), "HEX2."); B = inputn(substr(Color, 7, 2), "HEX2."); end; if k > 0; run; proc sort data=Colors; by Num; run; proc print data=Colors; var Name Color R G B; run; |
Success! The output shows the contrast colors for the HTMLBlue style. The 'gcdata' color is the fill color (a dark blue) for markers when no GROUP= option is specified. The 'gcdatan' colors are used for markers that are colored by group membership. Obviously you could use this same technique to display other style attributes, such as line patterns or bar colors ('gdata').
If you prefer a visual summary of the attributes for an ODS style, see section "ODS Style Comparisons" in the SAS/STAT documentation. That section is part of the chapter "Statistical Graphics Using ODS," which could have been titled "Everything you always wanted to know about ODS graphics but were afraid to ask."
I prefer to style elements and discrete attribute maps to set colors for markers. But if you are rushed for time, you might want to use the STYLEATTRS statement to set the colors that are used for the GROUP= option. The STYLEATTRS statement requires a color list of hexadecimal colors or SAS color names. The following call to PROC SGPLOT uses the RGB/hex values for GraphData1:ContrastColor and so forth:
/* use colors for HTMLBlue style */ %let gcdata1 = cx445694; %let gcdata2 = cxA23A2E; %let gcdata3 = cx01665E; title "Origin in {Europe, USA}"; proc sgplot data=sashelp.cars; where origin^='Asia' && type^="Hybrid"; /* omit first category */ styleattrs DataContrastColors = (&gcdata2 &gcdata3); /* use 2nd and 3rd colors */ scatter x=weight y=mpg_city / group=Origin markerattrs=(symbol=CircleFilled); keylegend / location=inside position=TopRight across=1; run; |
It would be great if you could specify a style-independent syntax such as
styleattrs DataContrastColors=(GraphData2:ContrastColor GraphData3:ContrastColor); |
Unfortunately, that syntax is not supported. The STYLEATTRS statement requires a list of color values or SAS color names.
Although this trick is interesting, in general I prefer to use styles (rather than hard-coded color values) in production code. However, if you want to know the RGB/hex values for a style, this trick shows how you can get them from an ODS template.
The post What colors does PROC SGPLOT use for markers? appeared first on The DO Loop.
]]>The post Simulate many samples from a linear regression model appeared first on The DO Loop.
]]>This article shows how to simulate many samples efficiently. Efficient simulation is the main emphasis of my book Simulating Data with SAS. For a detailed discussion about simulating data from regression models, see chapters 11 and 12.
The SAS DATA step in my previous post contains four steps. To simulate multiple samples, put a DO loop around the steps that generate the error term and the response variable for each observation in the model. The following program modifies the previous program and creates a single data set that contains NumSamples (=100) samples. Each sample is identified by an ordinal variable named SampleID.
/* Simulate many samples from a linear regression model */ %let N = 50; /* N = sample size */ %let nCont = 10; /* p = number of continuous variables */ %let NumSamples = 100; /* number of samples */ data SimReg(keep= SampleID i Y x:); call streaminit(54321); array x[&nCont]; /* explanatory variables are named x1-x&nCont */ /* 1. Specify model coefficients. You can hard-code values such as array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36); or you can use a formula such as the following */ array beta[0:&nCont] _temporary_; do j = 0 to &nCont; beta[j] = 4 * (-1)**(j+1) / (j+1); /* formula for beta[j] */ end; do i = 1 to &N; /* for each observation in the sample */ do j = 1 to &nCont; x[j] = rand("Normal"); /* 2. Simulate explanatory variables */ end; eta = beta[0]; /* model = intercept term */ do j = 1 to &nCont; eta = eta + beta[j] * x[j]; /* + sum(beta[j]*x[j]) */ end; /* 5. simulate response for each sample */ do SampleID = 1 to &NumSamples; /* <== LOOP OVER SAMPLES */ epsilon = rand("Normal", 0, 1.5); /* 3. Specify error distrib*/ Y = eta + epsilon; /* 4. Y = model + error */ output; end; end; run; |
The efficient way to analyzed simulated samples with SAS is to use BY-group processing. With By-group processing you can analyze all samples with a single procedure call. The following statements sort the data by the SampleID variable and call PROC REG to analyze all samples. The NOPRINT option ensures that the procedure does not spew out thousands of tables and graphs. (For procedures that do not support the NOPRINT option, there are other ways to turn off ODS when analyzing simulated data.) The OUTEST= option saves the parameter estimates for all samples to a SAS data set.
proc sort data=SimReg; by SampleID i; run; proc reg data=SimReg outest=PE NOPRINT; by SampleID; model y = x:; quit; |
The PE data set contains NumSamples rows. Each row contains the p parameter estimates for the analysis of one simulated sample. The distribution of estimates is an approximation to the true (theoretical) sampling distribution of the statistics. The following image visualizes the joint distribution of the estimates of four regression coefficients. You can see that the distribution of the estimates appears to be multivariate normal and centered at the values of the population parameters.
You can download the SAS program that simulates the data, analyzes it, and produces the graph. The program is very efficient. For 10,000 random samples of size N=50 that contain p=10 variables, it takes about one second to run the Monte Carlo simulation and analyses.
The post Simulate many samples from a linear regression model appeared first on The DO Loop.
]]>The post Automate the creation of a discrete attribute map appeared first on The DO Loop.
]]>A simple example demonstrates the problem. The following scatter plots are colored by the ORIGIN variable in the SasHelp.Cars data. (Click to enlarge.) The ORIGIN variable has three levels: Asia, Europe, and USA. On the left, all values of the ORIGIN variable are present in the graph. On the right, the Asian vehicles are excluded by using WHERE ORIGIN^="Asia". Notice that the colors of the markers on the right are not consistent with the values on the left.
Warren Kuhfeld wrote an excellent introduction to legend order and group attributes, and he describes several other ways that group colors can change from one plot to another. To solve this problem, Kuhfeld and other experts recommend that you create a discrete attribute map. A discrete attribute map is a SAS data set that specifies the colors to use for each group value. If you are not familiar with discrete attribute maps, I provide several references at the end of this article.
Automatically create a discrete attribute map for PROC SGPLOT #SASTip
Click To Tweet
The discrete attribute map is powerful, flexible, and enables the programmer to completely determine the legend order and color for all categories. However, I rarely use discrete attribute maps in my work because the process requires the manual creation of a data set. The data set has to contain all categories (spelled and capitalized correctly) and you have to remember (or look up) the structure of the data set. Furthermore, many examples use hard-coded color values such as CXFFAAAA or "LightBlue," whereas I prefer to use the GraphDatan style elements in the current ODS style.
However, I recently realized that PROC FREQ and the SAS DATA step can lessen the burden of creating a discrete attribute map. A colleague pointed out that the documentation for the discrete attribute map mentions a column named MarkerStyleElement (or MarkerStyle), which specifies the names of style elements such as GraphData1, GraphData2, and so on. Therefore, you can use PROC FREQ to write the category levels to a data set, and use a simple DATA step to add the MarkerStyleElement variable. For example, you can create a discrete attribute map for the ORIGIN variable, as follows:
/* semi-automatic way to create a DATTRMAP= data set */ %let VarName = Origin; /* specify name of grouping variable */ proc freq data=sashelp.cars ORDER=FORMATTED; /* or ORDER=DATA|FREQ */ tables &VarName / out=Attrs(rename=(&VarName=Value)); run; data DAttrs; ID = "&VarName"; /* or "ID_&VarName" */ set Attrs(keep=Value); length MarkerStyleElement $11.; MarkerStyleElement = cats("GraphData", 1+mod(_N_-1, 12)); /* GraphData1, GraphData2, etc */ run; proc print; run; |
Voila! The result is a valid discrete attribute data set for the ORIGIN variable. The DATTRS data set contains all the information you need to ensure that the first category is always displayed by using the GraphData1 element, the second category is displayed by using GraphData2, and so on. The program does not require that you manually type the categories or even know how many categories there are. Obviously, you could write a macro that makes it easy to generate these statements.
This data set uses the alphabetical order of the formatted values to determine the group order. However, you can use the ORDER=DATA option in PROC FREQ to order by the order of categories in the data set. You can also use the ORDER=FREQ option to order by the most frequent categories. Because most SAS-supplied styles define 12 style elements, the MOD function is used to handle categorical variable that have more than 12 levels.
To use the discrete attribute map, you need to specify the DATTRMAP= option on the PROC SGPLOT statement. You also need to specify the ATTRID= option on every SGPLOT statements that will use the map. Notice that I set the value of the ID variable to be the name of the GROUP= variable. (If that is confusing, you could choose a different value, as noted in the comments of the program.) The following statements are similar to the statements that create the right-hand graph at the top of this article, except this call to PROC SGPLOT uses the DATTRS discrete attribute map:
proc sgplot data=sashelp.cars DATTRMAP=DAttrs; where origin^='Asia' && type^="Hybrid"; scatter x=weight y=mpg_city / group=Origin ATTRID=Origin markerattrs=(symbol=CircleFilled); keylegend / location=inside position=TopRight across=1; run; |
Notice that the colors in this scatter plot are the same as for the left-hand graph at the top of this article. The group colors are now consistent, even though the number of groups is different.
The previous section showed how to create a discrete attribute map for one variable. You can use a similar approach to automatically create a discrete data map that contains several variables. The main steps are as follows:
ods select none; proc freq data=sashelp.cars; tables Type Origin; /* specify VARS here */ ods output OneWayFreqs=Freqs; run; ods select all; data Freqs2; set Freqs; length ID $32.; ID = substr(Table, 7); /* original values are "Table VarName" */ Value = COALESCEC(F_Type, F_Origin); /* also specify F_VARS here */ keep ID Value; run; data DAttrs(drop=count); set Freqs2; length MarkerStyleElement $11.; by ID notsorted; if first.ID then count = 0; count + 1; MarkerStyleElement = cats("GraphData", 1 + mod(count-1, 12)); run; |
The preceding program is not completely general, but it shows the main ideas. You can adapt the program to your own data. If you are proficient with the SAS macro language, you can even write a macro that generates appropriate code for an arbitrary number of variables. Leave a comment if this technique proves useful in your work or if you have ideas for improving the technique.
The post Automate the creation of a discrete attribute map appeared first on The DO Loop.
]]>The post Simulate data for a linear regression model appeared first on The DO Loop.
]]>When you simulate to create "synthetic" (or "fake") data, you (the programmer) control the true parameter values, the form of the model, the sample size, and magnitude of the error term. You can use simulated data as a quick-and-easy way to generate an example. You can use simulation to test the performance of an algorithm on very wide or very long data sets.
The least squares regression model with continuous explanatory variables is one of the simplest regression models. The book Simulating Data with SAS describes how to simulate data from dozens of regression models. I have previously blogged about how to simulate data from a logistic regression model in SAS.
It is useful to be able to generate data that fits a known model. Suppose you want to fit a regression model in which the response variable is a linear combination of 10 explanatory variables, plus random noise. Furthermore, suppose you don't need to use real X values; you are happy to generate random values for the explanatory variables.
The following SAS statements specify the sample size (N) and the number of explanatory variables (nCont) as macro variables. With the SAS DATA step, you can do the following:
%let N = 50; /* Specify sample size */ %let nCont = 10; /* Specify the number of continuous variables */ data SimReg1(keep= Y x:); call streaminit(54321); /* set the random number seed */ array x[&nCont]; /* explanatory vars are named x1-x&nCont */ /* 1. Specify model coefficients. You can hard-code values such as array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36); or you can use a formula such as the following */ array beta[0:&nCont] _temporary_; do j = 0 to &nCont; beta[j] = 4 * (-1)**(j+1) / (j+1); /* formula for beta[j] */ end; do i = 1 to &N; /* for each observation in the sample */ do j = 1 to dim(x); x[j] = rand("Normal"); /* 2. Simulate explanatory variables */ end; eta = beta[0]; /* model = intercept term */ do j = 1 to &nCont; eta = eta + beta[j] * x[j]; /* + sum(beta[j]*x[j]) */ end; epsilon = rand("Normal", 0, 1.5); /* 3. Specify error distrib */ Y = eta + epsilon; /* 4. Y = model + error */ output; end; run; |
After you simulate data, it is a good idea to run a regression analysis and examine the parameter estimates. For large samples, the estimates should be close to the true value of the parameters. The following call to PROC REG fits the known model to the simulated data and displays the parameter estimates, confidence intervals for the parameters, and the root mean square error (root MSE) statistic.
proc reg data=SimReg1 plots=none; model Y = x: / CLB; ods select FitStatistics ParameterEstimates; quit; |
For a simple model a "large sample" does not have to be very large. Although the sample size is merely N = 50, the parameter estimates are close to the parameter values. Furthermore, the root MSE value is close to 1.5, which is the true magnitude of the error term. For this simulation, each 95% confidence interval contain the true value of the parameters, but that does not happen in general.
For more complex regression models, you might need to generate larger samples to verify that the simulation correctly generates data from the model.
If you write the ParametereEstimates table to a SAS data set, you can create a plot that shows the parameters overlaid on a plot of the estimates and the 95% confidence limits. The plot shows that the parameters and estimates are close to each other, which should give you confidence that the simulated data are correct.
Notice that this DATA step can simulate any number of observations and any number of continuous variables. Do you want a sample that has 20, 50, or 100 variables? No problem: just change the value of the nCont macro variable! Do you want a sample that has 1000 observations? Change the value of N.
In summary, the SAS DATA step provides an easy way to simulate data from regression models in which the explanatory variables are uncorrelated and continuous. Download the complete program and modify it to your needs. For example, if you want more significant effects, use sqrt(j+1) in the denominator of the regression coefficient formula.
In a future article, I will show how to generalize this program to efficiently simulate and analyze many samples a part of a Monte Carlo simulation study.
The post Simulate data for a linear regression model appeared first on The DO Loop.
]]>The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.
]]>The SAS analytical documentation has a new look.
Beginning with the 14.2 release of the SAS analytical products (which shipped with SAS 9.4m4 in November 2016), the HTML version of the online documentation has moved to
a new framework called the Help Center. The URL for the online documentation is easy to remember:
http://support.sas.com/documentation/
This article shows the 14.2 documentation for the SAS analytical products, as highlighted in the adjacent image. Documentation for previous releases is also available.
The 14.2 link takes you to a new page that contains links for the User's Guides for each SAS analytical product, such as SAS/STAT, SAS/ETS, SAS/IML, and so on. When you click on a User's Guide, you are taken to the new Help Center.
An example page for the SAS/STAT documentation is shown in the following image (click to enlarge). As in previous versions of the help system, the Help Center provides drop-down lists (Overview, Getting Started, Syntax, etc.) for quick navigation within a procedure. There are also arrows (now in the upper right corner) that take you to the previous or next page in the book.
The following list describes five features of the Help Center that are either new or that extend features of the older HTML format. The locations of these features are highlighted with red rectangles in the previous image.
Five new features in the #SAS analytical documentation
Click To Tweet
In summary, the new Help Center framework provides additional ways for SAS customers to learn about the syntax, options, and output of SAS analytical procedures. At the present time, only analytical products use the Help Center. The documentation for Base SAS continues to be provided in HTML and PDF formats.
Check out the SAS analytical products 14.2 documentation and let me know what you think. Do you like something that I didn't mention? Post a comment.
The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.
]]>