Learn how SAS can help identify, govern and protect personal data
]]>In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:
data Psych;
input Category $21. Count;
datalines;
Neurotic 91
Depressed 49
Schizophrenic 37
Personality disorder 43
;
If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.
May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ^{2}(α, k-1) is the upper 1–α quantile of the χ^{2} distribution with k-1 degrees of freedom, and π_{1}, π_{2}, ..., π_{k} are the true population parameters. The methods and the references for the methods are:
In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."
A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.
You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:
Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called CONINTS.sas; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):
%include "conint.sas"; /* specify full path name to file */
proc iml;
load module=(MultCI MultCIPrint);
use Psych; read all var {"Category" "Count"}; close;
alpha = 0.05;
call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */
The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.
You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:
CI = MultCI(Count, alpha, 2); /*or simply CI = MultCI(Count) */
/* write estimates and CIs to data set */
Estimate = CI[,1]; Lower = CI[,2]; Upper = CI[,3];
create CIs var {"Category" "Estimate" "Lower" "Upper"};
append;
close;
quit;
ods graphics / width=600px height=240px;
title 'Simultaneous Confidence Intervals for Multinomial Proportions';
title2 'Method of Goodman (1965)';
proc sgplot data=CIs;
scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper
markerattrs=(Size=12 symbol=DiamondFilled)
errorbarattrs=GraphOutlines(thickness=2);
xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05);
yaxis reverse display=(nolabel);
run;
The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.
Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.
I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).
This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.
]]>proc sgplot data=sashelp.enso;
title 'El Nin(*ESC*){unicode tilde}o Southern Oscillation';
reg y=Pressure x=Month;
pbspline y=Pressure x=Month / nomarkers;
loess y=Pressure x=Month / nomarkers;
reg y=Pressure x=Month / nomarkers degree=4 legendlabel='Polynomial';
run;
Click on graphs to enlarge.
You usually specify one of these statements, but you can specify any number, and it is instructive to compare multiple functions in one plot. The linear regression function (DEGREE=1) is a straight line. The DEGREE=4 polynomial regression function has some curvature. The penalized B-spline and the loess fit are almost identical for these data. Both are highly flexible and automatically find the seasonal changes.
The PBSPLINE statement can also be used to fit polynomial spline functions with or without knots, and with or without automatic smoothing. These are the same kinds of functions that you can fit by using PROC TRANSREG. The REG statement and the first PBSPLINE statement (SMOOTH=0, NKNOTS=0, DEGREE=3) both fit a degree three polynomial with no smoothing or knots. Both produce the same cubic polynomial fit function. The SERIES statement displays the PROC TRANSREG predicted values, which are the same as the penalized B-spline with SMOOTH=0, DEGREE=3, and NKNOTS=9. The SMOOTH=0 option disables the automatic smoothing. You can vary the number of knots to fit more- or less-smooth piecewise polynomial spline functions.
proc transreg data=sashelp.enso plots=none noprint;
model ide(pressure) = spl(month / nknots=9 evenly);
output out=b p;
run;
proc sgplot data=b;
pbspline y=Pressure x=Month / smooth=0 nknots=0 degree=3
legendlabel='B-Spline' lineattrs=GraphData1(thickness=10);
reg y=Pressure x=Month / nomarkers degree=3 legendlabel='Polynomial'
lineattrs=GraphData2(thickness=3);
series y=PPressure x=Month / legendlabel='Transreg'
lineattrs=GraphData3(thickness=10);
pbspline y=Pressure x=Month / nomarkers smooth=0 nknots=9 degree=3
legendlabel='B-Spline with Knots' lineattrs=GraphData5(thickness=3);
keylegend / location=inside across=1 position=bottomleft noborder;
yaxis label='Pressure';
run;
You can use the REG statement to fit linear or polynomial functions, the PBSPLINE or LOESS statements to fit smooth functions with automatic smoothing, or the PBSPLINE statement along with DEGREE= and NKNOTS= options to explicitly control the fit. This example also shows the SERIES statement, which connects the points with no smoothing. (The TRANSREG function appears smooth, even though there is no smoothing, because the line segments are short.)
Now consider the problem of displaying a curve that passes through a small number of points.
data tests;
input Day mmddyy8. Results;
y3 = Results - 0.015;
y2 = y3 + 0.015;
y1 = y2 + 0.015;
cards;
9/16/15 .115
9/21/15 .108
10/19/15 .102
3/24/16 .082
5/12/16 .06
8/18/16 .027
11/28/16 .027
;
The following steps fit a line, a cubic polynomial, and a penalized B-spline.
proc sgplot data=tests;
title 'Sparse Data - Unsatisfactory Results from Using the Wrong Methods';
reg y=Results x=day / legendlabel='Linear';
reg y=Results x=day / nomarkers degree=3 legendlabel='Cubic';
pbspline y=Results x=day / nomarkers;
format day mmddyy8.;
xaxis display=(nolabel);
yaxis min=0 offsetmin=0;
run;
None of these is satisfactory, particularly the penalized B-spline, which is not designed for sparse data sets. The following step shows three alternatives:
proc sgplot data=tests;
title 'Sparse Data';
scatter y=y1 x=day / markerattrs=GraphData1;
scatter y=y2 x=day / markerattrs=GraphData2;
scatter y=y3 x=day / markerattrs=GraphData3;
series y=y1 x=day / lineattrs=GraphData1 legendlabel='Series' name='a';
spline y=y2 x=day / lineattrs=GraphData2 legendlabel='Spline' name='b';
series y=y3 x=day / lineattrs=GraphData3 legendlabel='Smooth' name='c' smoothconnect;
format day mmddyy8.;
xaxis display=(nolabel);
yaxis min=0 offsetmin=0 label='Results' display=(noticks novalues);
keylegend 'a' 'b' 'c' / location=inside across=1 position=topright;
run;
In the interest of space and clarity, all three functions are displayed in the same graph, but each is vertically offset so that they do not intersect. The first SERIES statement connects the points. The SPLINE statement is similar to the SERIES statement, but it draws a curve instead of a series of line segments, and the curve is not required to touch every point. The SERIES statement along with the SMOOTHCONNECT option connects every point by using a smooth function.
The next steps illustrate groups. The following step reads some artificial data for patients in two groups: treatment and control.
Click here for the data.
The following step displays the results for each patient:
proc sgplot data=patients;
series y=results x=date / group=id;
run;
The patient ID is specified as the GROUP= variable. When a new group is encountered, the previous series plot ends and a new one starts. If you want to display a different group variable, then you need a different way to mark the end of a series plot. You can add a row to the data set that has a missing value for the Y= variable after each group:
data p2;
set patients;
by id;
output;
if last.id then do;
results = .;
output;
end;
run;
ods graphics on / attrpriority=none;
proc sgplot data=p2;
styleattrs datasymbols=(circlefilled squarefilled) datalinepatterns=(solid);
series y=results x=date / break group=group markers;
run;
You can use the BREAK option to break each series plot at the missing value. Then you can use the GROUP= variable to differentiate the treatment and control groups.
In summary, ODS Graphics provides you with many ways to add linear and nonlinear fit functions to your graphs. Furthermore, it provides ways to connect points using both lines and smooth functions. They have many uses. You can even use them to make art! This next example comes from my Advanced ODS Graphics Examples book. The code shows what happens when you connect all pairs of 20 evenly spaced points along a circle.
data x(drop=t);
do id = 1 to 20;
t = (id - 1) * 2 * constant('pi') / 20;
x = cos(t);
y = sin(t);
output;
end;
run;
data curves(drop=id t: m d);
do id1 = 1 to 20;
do id2 = id1 + 1 to 20;
g + 1;
set x(rename=(x=t1 y=t2)) point=id1;
set x(rename=(x=t3 y=t4)) point=id2;
d = (t4 - t2) ** 2 + (t3 - t1) ** 2;
x1 = t1;
y1 = t2;
output; /* output the starting point */
td = ifn(abs(t4 - t2) lt 1e-12, 1e-12, t4 - t2);
m = -(t3 - t1) / td;
t1 = mean(t1, t3);
t2 = mean(t2, t4);
x1 = t1 + ifn(t1 gt 0 or td eq 1e-12, -1, 1) *
sqrt(0.1 * d / (1 + m * m));
y1 = m * (x1 - t1) + t2;
output; /* output the midpoint */
x1 = t3;
y1 = t4;
output; /* output the ending point */
end;
end;
stop;
run;
title;
ods graphics on / width=480px height=480px;
proc sgplot data=curves noautolegend;
series y=y1 x=x1 / group=g lineattrs=graphdata1(pattern=solid);
spline y=y1 x=x1 / group=g lineattrs=graphdata2(pattern=solid);
xaxis display=none;
yaxis display=none;
run;
For more information, see my free web books: Basic ODS Graphics Examples and Advanced ODS Graphics Examples. Also see the documentation for ODS Graphics.
]]>"Voters are removed from the voter rolls due to a biennial list maintenance process that is mandated by federal and state law. If a county board of elections has not had any contact with a voter for a period of two federal election cycles, then the voter will be sent a forwardable address confirmation mailing. The voter will be required to return the confirmation mailing within 30 days of the mailing. If the confirmation mailing is not returned by the voter within that time, or the mailing is returned by the postal system as undeliverable, then the voter’s record will be marked inactive in the voter registration database. Inactive voters are still registered voters. If an inactive voter presents to vote, the person will be asked to update his or her address with the board of elections. In the event that an inactive voter remains in this status for another two federal election cycles (meaning the county board still has no contact with the voter), then the voter will be removed as a voter in the county."
Now the dip in the graph (#1) made more sense. There was a big increase in voter registrations for the 2008 election (#5) when Obama won his first term, and if people hadn't voted in the four federal elections after that (#4, #3, #2, and #1) then their registration was purged. So it seemed people got excited and registered for and voted in the 2008 election, and then then lost interest after that and stopped voting (became inactive). But... As I studied the graph more closely, I noticed an even bigger dip after the 2014 election (#2). How could that be, since it was only three elections after the big one in 2008?!? And then I had my eureka moment! Perhaps many people had signed up during the "register to vote drives" in 2008 ... but they hadn't actually voted in the 2008 election (the best of intentions, but no follow-through). And then, being inactive in elections #5, #4, #3, and #2 their registration was purged after the 2014 election (#2). It had never dawned on me that many people who registered specifically to vote for the 2008 election might not actually vote! Perhaps this is an important phenomenon people organizing voter-registration drives might should keep in mind - in addition to getting people to register, you need to get a commitment from them to actually go vote! Here's a similar graph for the NC totals. The same trend is state-wide, but the after-election dips are a little smoother here because different counties tend to update their numbers at slightly different times. Click the image below to go to my webpage which has graphs for all 100 counties. You can navigate the graphs by clicking the counties in the map, or scrolling through the graphs alphabetically: Hopefully my theory about the increases and decreases in the graph are correct, but if you have some alternate theories, feel free to share them in the comments section! (I know that some people get a bit passionate when it comes to elections and voter registration - but please remember to keep your comments civil and analytics-related!) :-) ]]>“...an agreement between the customer and the organization that basically says the organization will protect the customer’s data and only use it in ways the customer has authorized – in return for the customer ensuring the data’s accuracy and specifying preferences for its use. This model empowers customers to take ownership of their data and assume responsibility for its quality. Clearly articulating each party’s responsibilities for data stewardship benefits both the organization and the customer by ensuring that customer data is high-quality and properly maintained.”This approach would openly demonstrate compliance with industry regulations and personal privacy requirements, enabling your enterprise to be open and compliant at the same time.
Learn how SAS can help identify, govern and protect personal data
]]>As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.
In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.
As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.
Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet
There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=X_{i}, where i = 1 to 1000.
One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:
%macro RunReg(DSName, NumVars);
...
%do i = 1 %to &NumVars; /* repeat for each x&i */
proc reg data=&DSName noprint
outest=PE(rename=(x&i=Value)); /* save parameter estimates */
model Y = x&i; /* model Y = x_i */
quit;
/* ...then accumulate statistics... */
%end;
%mend;
The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.
If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.
An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.
To run a BY-group analysis:
In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.
/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide; /* <== specify data set name HERE */
array x [*] x1-x&nCont; /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
VarName = vname(x[varNum]); /* variable name in char var */
Value = x[varNum]; /* value for each variable for each obs */
output;
end;
drop x:;
run;
In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:
/* 2. Sort by BY-group variable */
proc sort data=Long; by VarName; run;
You can now call a SAS procedure one time to compute all regression models:
/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
quit;
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;
run;
The PE data set contains the parameter estimates for every single-variable regression of Y onto X_{i}. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.
You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).
This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=X_{i}, but a similar result would work for GLM models such as Y=A|X_{i}, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.
Can you think of other ways to use this trick? Leave a comment.
]]>