The post Principal component regression in SAS appeared first on The DO Loop.

]]>Near collinearity among the explanatory variables in a regression model requires special handling because:

- The crossproduct matrix X`X is ill-conditioned (nearly singular), where X is the data matrix.
- The standard errors of the parameter estimates are very large. The variance inflation factor (VIF), which is computed by PROC REG, is one way to measure how collinearities inflate the variances of the parameter estimates.
- The model parameters are highly correlated, which makes interpretation of the parameters difficult.

Principal component regression keeps only the most important principal components and discards the others.
This means that you compute the principal components for the explanatory variables and drop the components that correspond to the smallest eigenvalues of X`X.
If you keep *k* principal components, then those components enable you to form a rank-*k* approximation to the crossproduct matrix.
If you regress the response variable onto those *k* components, you obtain a PCR. Usually the parameter estimates are expressed in terms of the original variables, rather than in terms of the principal components.

In SAS there are two easy ways to compute principal component regression:

- The PLS procedure supports the METHOD=PCR to perform principal component regression. You can use the NFAC= option to determine the number of principal components to keep.
- The MODEL statement in PROC REG supports the PCOMIT= option. (This option is read as "PC omit.") The argument to the PCOMIT= option is the number of principal components to drop (omit) from the regression.

Notice that neither of these methods calls PROC PRINCOMP. You could call PROC PRINCOMP, but it would be more complicated than the previous methods. You would have to extract the first principal components (PCs), then use PROC REG to compute the regression coefficients for the PCs, then use matrix computations to convert the parameter estimates from the PCs to the original variables.

Principal component regression is also sometimes used for general dimension reduction. Instead of projecting the response variable onto a *p*-dimensional space of raw variables, PCR projects the response onto a *k*-dimensional space where *k* is less than *p*. For dimension reduction, you might want to consider another approach such as variable selection by using PROC GLMSELECT or PROC HPGENSELECT. The reason is that the PCR model retains all of the original variables whereas variable selection procedures result in models that have fewer variables.

I recommend using the PLS procedure to compute a principal component regression in SAS. As mentioned previously, you need to use the METHOD=PCR and NFAC= options. The following data for 31 men at a fitness center is from the documentation for PROC REG. The goal of the study is to predict oxygen consumption from age, weight, and various physiological measurements before and during exercise. The following call to PROC PLS computes a PCR that keeps four principal components:

data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ; proc pls data=fitness method=PCR nfac=4; /* PCR onto 4 factors */ model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / solution; run; |

The output includes the parameter estimates table, which gives the estimates for the four-component regression in terms of the original variables. Another table (not shown) shows that the first four principal components explain 93% of the variation in the explanatory variables and 78% of the variation in the response variable.

For another example of using PROC PLS to combat collinearity, see Yu (2011), "Principal Component Regression as a Countermeasure against Collinearity."

I recommend PROC PLS for principal component regression, but you can also compute a PCR by using the PCOMIT= option on the MODEL statement in PROC REG. However, the parameter estimates are not displayed in any table but must be written to OUTEST= data set, as follows:

proc reg data=fitness plots=none outest=PE; /* write PCR estimates to PE data set */ model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / PCOmit=2; /* omit 2 PCs ==> keep 6-2=4 PCs */ quit; proc print data=PE(where=(_Type_="IPC")) noobs; var Intercept--MaxPulse; run; |

Notice that the PCOMIT=2 option specifies that two PCs should be dropped, which is equivalent to keeping four components in this six-variable model. The parameter estimates are written to the PE data set and are displayed by PROC PRINT. The estimates the same as those found by PROC PLS. In the PE data, the PCR estimates are indicated by the value "IPC" for the _TYPE_ variable, which stands for *incomplete principal component* regression. The word "incomplete" indicates that not all the principal components are used.

It is worth noting that even though the principal components themselves are based on centered and scaled data, the parameter estimates are reported for the original (raw) variables. It is also worth noting that you can use the OUTSEB option on the PROC REG statement to obtain standard errors for the parameter estimates.

This article shows you how to perform principal component regression in SAS by using PROC PLS with METHOD=PCR. However, I must point out that there are statistical drawbacks to using principal component regression. The primary issue is that principal component regression does not use any information about the response variable when choosing the principal components. Before you decide to use PCR, I urge you to read my next post about the drawbacks with the technique. You can then make an informed decision about whether you want to use principal component regression for your data.

The post Principal component regression in SAS appeared first on The DO Loop.

]]>The post The diffogram and other graphs for multiple comparisons of means appeared first on The DO Loop.

]]>
The diffogram (also called a *mean-mean scatter diagram*) is automatically created when you use the PDIFF=ALL option on the LSMEANS statement in several SAS/STAT regression procedures such as PROC GLM and PROC GLMMIX, assuming that you have enabled ODS graphics.
The documentation for PROC GLM contains an example that uses data about the chemical composition of shards of pottery from four archaeological sites in Great Britain. Researchers want to determine which sites have shards that are chemically similar. The data are contained in the `pottery` data set. The following call to PROC GLM performs an ANOVA of the calcium oxide in the pottery at the sites.
The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of calcium oxide for the different sites. The ADJUST=TUKEY option is one way to adjust the confidence intervals for multiple comparisons.

ods graphics on; proc glm data=pottery plots(only)=( diffplot(center) /* diffogram */ meanplot(cl ascending) ); /* plot of means and CIs */ label Ca = "Calcium Oxide (%)"; class Site; model Ca = Site; lsmeans Site / pdiff=all cl adjust=tukey; /* all pairwise comparisons of means w/ adjusted CL */ ods output LSMeanDiffCL=MeanDiff; /* optional: save mean differences and CIs */ quit; |

Two graphs are requested: the diffogram (or "diffplot") and a "mean plot" that shows the group means and 95% confidence intervals. The ODS OUTPUT statement creates a data set from a table that contains the mean differences between pairs of groups, along with 95% confidence intervals for the differences. You can use that information to construct a plot of the mean differences, as shown later in this article.

The diffogram, which is shown to the right (click to enlarge), is my favorite graph for multiple comparisons of means. Every diffogram displays a diagonal reference line that has unit slope. Horizontal and vertical reference lines are placed along the axes at the location of the means of the groups. For these data, there are four vertical and four horizontal reference lines. At the intersection of most reference lines there is a small colored line segment. These segments indicate which of the 4(4-1)/2 = 6 pairwise comparisons of means are significant.

Let's start at the top of the diffogram. The mean for the Caldecot site is about 0.3, as shown by a horizontal reference line near Y=0.3. On the reference line for Caldecot are three line segments that are centered at the mean values for the other groups, which are (from left to right) IslandThorns, AshleyRails, and Llanederyn. The first two line segments (in blue) do not intersect the dashed diagonal reference line, which indicates that the means for the (IslandThorns, Caldecot) and (AshleyRails, Caldecot) pairs are significantly different. The third line segment (in red) intersects the diagonal reference line, which indicates that the (Llanederyn, Caldecot) comparison is not significant.

Similarly, the mean for the Llanederyn site is about 0.2, as shown by a horizontal reference line near Y=0.2. On the reference line for Llanederyn are two line segments, which represent the (IslandThorns, Llanederyn) and (AshleyRails, Llanederyn) comparisons. Neither line segment intersects the diagonal reference line, which indicates that the means for those pairs are significantly different.

Lastly, the mean for the AshleyRails site is about 0.05, as shown by a horizontal reference line near Y=0.05. The line segment on that reference line represents the (IslandThorns, AshleyRails) comparison. It intersects the diagonal reference line, which indicates that those means are not significantly different.

The colors for the significant/insignificant pairs depends on the ODS style, but it is easy to remember the simple rule: if a line segment intersects the diagonal reference line, then the corresponding group means are not significantly different. A mnemonic is "Intersect? Insignificant!"

The previous section shows how to use the diffogram to visually determine which pairs of means are significantly different. This section
reminds you that you should *not* try to use the mean plot (shown at the right) for making those inferences.

I have seen presentations in which the speaker erroneously claims that "the means of these groups are significantly different because their 95% confidence intervals do not overlap." **That is not a correct inference.**
In general, the overlap (or lack thereof) between two (1 – α)100% confidence intervals does not give sufficient information about whether the difference between the means is significant at the α level. In particular, you can construct examples where

- Two confidence intervals overlap, but the difference of means is significant. (See Figure 2 in High (2014).)
- Two confidence intervals do not overlap, but the difference of means is not significant.

The reason is twofold. First, the confidence intervals in the plot are constructed by using the sample sizes and standard deviations for each group, whereas tests for the difference between the means are constructed by using pooled standard deviations and sample sizes. Second, if you are making multiple comparisons, you need to adjust the widths of the intervals to accommodate the multiple (simultaneous) inferences.

"But Rick," you might say, "in the mean plot for these data, the Llanederyn and Caldecot confidence intervals overlap. The intervals for IslandThorns and AshleyRails also overlap. And these are exactly the two pairs that are not significantly different, as shown by the diffogram!" Yes, that is true *for these data*, but it is not true in general. Use the diffogram, not the means plot, to visualize multiple comparisons of means.

In addition to the diffogram, you can visualize comparisons of means by plotting the confidence intervals for the pairwise mean differences. Those intervals that contain 0 represent insignificant differences. Recall that the call to PROC GLM included an ODS output statement that created a data set (`MeanDiff`) that contains the mean differences. The following DATA step constructs labels for each pair and computes whether each pairwise difference is significant:

data Intervals; length Pair $28.; length S $16.; set MeanDiff; /* The next line is data dependent. For class variable 'C', concatenate C and _C variables */ Pair = catx(' - ', Site, _Site); Significant = (0 < LowerCL | UpperCL < 0); /* is 0 in interior of interval? */ S = ifc(Significant, "Significant", "Not significant"); run; title "Pairwise Difference of LSMeans (Tukey Adjustment)"; title2 "95% Confidence Intervals of Mean Difference"; footnote J=L "Pairs Whose Intervals Contain 0 Are Not Significantly Different"; proc sgplot data=Intervals; scatter y=Pair x=Difference / group=S name="CIs" xerrorlower=LowerCL xerrorupper=UpperCL; refline 0 / axis=x; yaxis reverse colorbands=odd display=(nolabel) offsetmin=0.06 offsetmax=0.06; keylegend "CIs" / sortorder=ascending; run; |

The resulting graph is shown to the right. For each pair of groups, the graph shows an estimate for the difference of means and the Tukey-adjusted 95% confidence intervals for the difference. Intervals that contain 0 indicate that the difference of means is not significant. Intervals that do not contain 0 indicate significant differences.

Although the diffogram and the difference-of-means plot provide the same information, I prefer the diffogram because it shows the values for the means rather than for the mean differences. Furthermore, the height of the difference-of-means plot needs to be increased as more groups are added (there are *k*(*k*-1)/2 rows for *k* groups), whereas the diffogram can accommodate a moderate number of groups without being rescaled. On the other hand, it can be difficult to read the reference lines in the diffogram when there are many groups, especially if some of the groups have similar means.

For more about multiple comparisons tests and how to visualize the differences between means, see the following references:

- High, R. (2014) "Plotting Differences among LSMEANS in Generalized Linear Models,"
*Proceedings of the SAS Global Forum 2014 Conference*. - Hsu, J. (1996)
*Multiple Comparisons: Theory and Methods*. - Westfall, Tobias, and Wolfinger (2011)
*Multiple Comparisons and Multiple Tests Using SAS*, Second Edition.

The post The diffogram and other graphs for multiple comparisons of means appeared first on The DO Loop.

]]>The post Graphs for multiple comparisons of means: The lines plot appeared first on The DO Loop.

]]>Warren's focus was on the plot itself, with an emphasis on how to create it. However, the plot is also interesting for the statistical information it provides. This article discusses how to interpret the lines plot in a multiple comparisons of means analysis.

You can use the LINES option in the LSMEANS statement to request a lines plot in SAS 9.4M5.
The following data are taken from *Multiple Comparisons and Multiple Tests* (p. 42-53 of the First Edition).
Researchers are studying the effectiveness of five weight-loss diets, denoted by A, B, C, D, and E. Ten male subjects are randomly assigned to each method. After a fixed length of time, the weight loss of each subject is recorded, as follows:

/* Data and programs from _Multiple Comparisons and Multiple Tests_ Westfall, Tobias, Rom, Wolfinger, and Hochberg (1999, First Edition) */ data wloss; do diet = 'A','B','C','D','E'; do i = 1 to 10; input WeightLoss @@; output; end; end; datalines; 12.4 10.7 11.9 11.0 12.4 12.3 13.0 12.5 11.2 13.1 9.1 11.5 11.3 9.7 13.2 10.7 10.6 11.3 11.1 11.7 8.5 11.6 10.2 10.9 9.0 9.6 9.9 11.3 10.5 11.2 8.7 9.3 8.2 8.3 9.0 9.4 9.2 12.2 8.5 9.9 12.7 13.2 11.8 11.9 12.2 11.2 13.7 11.8 11.5 11.7 ; |

You can use PROC GLM to perform a balanced one-way analysis of variance and use the MEANS or LSMEANS statement to request pairwise comparisons of means among the five diet groups:

proc glm data=wloss; class diet; model WeightLoss = diet; *means diet / tukey LINES; lsmeans diet / pdiff=all adjust=tukey LINES; quit; |

In general, I use the LSMEANS statement rather than the MEANS statement because LS-means are more versatile and handle unbalanced data. (More about this in a later section.) The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of weight loss for the different diets. The ADJUST=TUKEY option is a common way to adjust the widths of confidence intervals to accommodate the multiple comparisons. The analysis produces several graphs and tables, including the following lines plot.

In the lines plot, the vertical lines visually connect groups whose LS-means are "statistically indistinguishable." Statistically speaking, two means are "statistically indistinguishable" when their pairwise difference is not statistically significant.

If you have *k* groups, there are *k*(*k*-1)/2 pairwise differences that you can examine. The lines plot attempts to summarize those comparisons by connecting groups whose means are statistically indistinguishable. Often there are fewer lines than pairwise comparisons, so the lines plot displays a summary of which groups have similar means.

In the previous analysis, there are five groups, so there are 10 pairwise comparisons of means. The lines plot summarizes the results by using three vertical lines. The leftmost line (blue) indicates that the means of the 'B' and 'C' groups are statistically indistinguishable (they are not significantly different). Similarly, the upper right vertical bar (red) indicates that the means of the pairs ('E','A'), ('E','B'), and ('A','B') are not significantly different from each other. Lastly, the lower right vertical bar (green) indicates that the means for groups 'C' and 'D' are not significantly different. Thus in total, the lines plot indicates that five pairs of means are not significantly different.

The remaining pairs of mean differences (for example, 'E' and 'D') are significantly different. By using only three vertical lines, the lines plot visually associates pairs of means that are essentially the same. Those pairs that are not connected by a vertical line are significantly different.

Advantages of the lines plot include the following:

- The groups are ordered according to the observed means of the groups.
- The number of vertical lines is often much smaller than the number of pairwise comparisons between groups.

Notice that the groups in this example are the same size (10 subjects). When the group sizes are equal (the so-called "balanced ANOVA" case), the lines plot can always correctly represent the relationships between the group means. However, that is not always true for unbalanced data. Westfall et al. (1999, p. 69) provide an example in which using the LINES option on the MEANS statement produces a misleading plot.

The situation is less severe when you use the LSMEANS statement, but for unbalanced data it is sometimes impossible for the lines plot to accurately connect all groups that have insignificant mean differences. In those cases, SAS appends a footnote to the plot that alerts you to the situation and lists the additional significances not represented by the plot.

In my next blog post, I will show some alternative graphical displays that are appropriate for multiple comparisons of means for unbalanced groups.

In summary, the new lines plot in SAS/STAT software is a graphical version of an analysis that has been in SAS for decades. You can create the plot by using the LINES option in the LSMEANS statement. The lines plot indicates which groups have mean differences that are not significant. For balanced data (or nearly balanced), it does a good job of summarizes which differences of means are not significant. For highly unbalanced data, there are other graphs that you can use. Those graphs will be discussed in a future article.

- Westfall, Tobias, and Wolfinger (2011)
*Multiple Comparisons and Multiple Tests Using SAS*, Second Edition.

The post Graphs for multiple comparisons of means: The lines plot appeared first on The DO Loop.

]]>The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.

]]>- Simulate
*B*samples of size*N*from a bivariate normal distribution with correlation ρ. - Use PROC CORR to compute the sample correlation matrix for each of the
*B*samples. - Use the DATA step to extract the off-diagonal elements from the correlation matrices.

After the three steps, you obtain a distribution of *B* sample correlation coefficients that approximates the sampling distribution of the Pearson correlation coefficient for bivariate normal data.

There is a simpler way to simulate the correlation estimates: You can *directly* simulate from the Wishart distribution. Each draw from the Wishart distribution is a sample covariance matrix for a multivariate normal sample of size *N*. If you convert that covariance matrix to a correlation matrix, you can immediately extract the off-diagonal elements, as shown in the following SAS/IML statements:

%let rho = 0.8; /* correlation for bivariate normal distribution */ %let N = 20; /* sample size */ %let NumSamples = 2500; /* number of simulated samples */ /* generate sample correlation coefficients by using Wishart distribution */ proc iml; call randseed(12345); NumSamples = &NumSamples; DF = &N - 1; /* X ~ N obs from MVN(0, Sigma) */ Sigma = {1 &rho, /* covariance for MVN samples */ &rho 1 }; S = RandWishart(NumSamples, DF, Sigma); /* each row is 2x2 matrix */ Corr = j(NumSamples, 1); /* allocate vector for correlation estimates */ do i = 1 to nrow(S); /* convert to correlation; extract off-diagonal */ Corr[i] = cov2corr( shape(S[i,], 2, 2) )[1,2]; end; |

You can create a comparative histogram of the sample correlation coefficients. In the following graph, the histogram at the top of the panel displays the distribution of the simulated correlation coefficients from the three-step method. The bottom histogram displays the distribution of correlations coefficients that are generated from the Wishart distribution.

Visually, the histograms appear to be similar. You can use PROC NPAR1WAY to run various hypothesis tests that compare the distributions; all tests support the hypothesis that these two distributions are equivalent.

If you'd like to see the complete analysis, you can download the SAS program that runs both simulations and compares the resulting distributions.

Although the Wishart distribution is more efficient for this simulation,
recall that the Wishart distribution *assumes that the underlying data distribution is multivariate normal*. In contrast, the three-step simulation is more general. It can be used to generate correlation coefficients for *any* data distribution. So although the three-step simulation is not necessary for multivariate normal data, it is still an important technique to store in your simulation toolbox.

The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.

]]>The post Order correlations by magnitude appeared first on The DO Loop.

]]>Neither graph addresses a related problem: for each variable, which other variables are strongly correlated with it? Which are weakly correlated? In SAS, the CORR procedure supports a little-known option that answers that question. You can use the RANK option in the PROC CORR statement to order the correlations for each variable (independently) according to the magnitude of the correlations.

Notice that the RANK option does *not* compute "rank correlation." You can compute rank correlation by using the SPEARMAN option.

Consider the Sashelp.Heart data set, which contains data for 5209 patients who enrolled in the Framingham Heart Study. You might want to know which variables are highly correlated with weight, or smoking, or blood pressure, to name a few examples. The following call to PROC CORR uses the RANK option to order each row of correlations in the output:

/* Note: The MRW variable is similar to the body-mass index (BMI). */ proc corr data=sashelp.Heart RANK noprob; var Height Weight MRW Smoking Diastolic Systolic Cholesterol; run; |

The output orders each row according to the magnitude of the correlations. (Click to enlarge.) For example, look at the row for the Weight variable, which is highlighted by a red rectangle. Scanning across the row, you can see that the variables that are the most strongly correlated with Weight are MRW (which measures whether a patient is overweight) and the height. At the end of the row are the variables that are essentially uncorrelated with Weight, namely Smoking and Cholesterol. The numbers at the bottom of each cell indicate the number of nonmissing pairwise observations.

In a similar way, look at the row for the Smoking variable. That variable is most strongly correlated with Height and MRW. Notice that the correlation with MRW is negative, which shows that the correlations are ordered by absolute values (magnitude). The highly correlated variables—whether positively or negatively correlated—appear first and the uncorrelated variable (correlations near zero) occur last in each row.

It is common to want to examine the correlations between groups of variables. For example, a clinician might want to look at the correlations between clinical measurements (blood pressure, cholesterol,...) and genetic or lifestyle choices (weight, smoking habits,...). The following call to PROC CORR uses the VAR and WITH statements to compare groups of variables, and uses the RANK option to order the correlations along each row:

proc corr data=sashelp.Heart RANK noprob; var Height Weight MRW Smoking; /* genetic and lifestyle factors */ with Diastolic Systolic Cholesterol; /* clinical measurements */ run; |

Notice that the number of variables in the VAR statement determine the columns. The variables in the WITH statement determine the rows. Within each row, the variables in the VAR statement are ordered by the magnitude of the correlation. For these data, the Diastolic and Systolic variables are similar with respect to how they correlate with the column variables. The order of the column variables is the same for the first two rows. In contrast, the strength of the correlations between the Cholesterol variable and the column variables are in a different order.

In summary, you can use the RANK option in the PROC CORR statement to order the rows of a correlation matrix according to the magnitude (absolute value) of the correlations between each variable and the others. This makes it easy to find pairs of variables that are strongly correlated and pairs that are weakly correlated.

The post Order correlations by magnitude appeared first on The DO Loop.

]]>The post Create and interpret a weighted histogram appeared first on The DO Loop.

]]>
Before constructing a weighted histogram, let's review the construction of an unweighted histogram.
A histogram requires that you specify a set of evenly spaced *bins* that cover the range of the data.
An unweighted histogram of frequencies is constructed by counting the number of observations that are in each bin. Because counts are dependent on the sample size, *n*, histograms often display the proportion (or percentage) of values in each bin. The proportions are the counts divided by *n*.
On the proportion scale, the height of each bin is the sum of the quantity 1/*n*, where the sum is taken over all observations in the bin.

That fact is important because it reveals that the unweighted histogram is a special case of the weighted histogram. An unweighted histogram is equivalent to a weighted histogram in which each observation receives a unit weight. Therefore the quantity 1/*n* is the standardized weight of each observation: the weight divided by the sum of the weights. The formula is the same for non-unit weights: the height of each bin is the sum of the quantity *w*_{i} / Σ *w*_{i}, where the sum is taken over all observations in the bin. That is, you add up all the standardized weights in each bin to produce the bin height.

The SAS documentation for the WEIGHT statement includes the following example. Twenty subjects estimate the diameter of an object that is 30 cm across. Some people are placed closer to the object than others. The researcher believes that the precision of the estimate is inversely proportional to the distance from the object. Therefore the researcher weights each subject's estimate by using the inverse distance.

The following DATA step creates the data, and PROC SGPLOT creates a weighted histogram of the data by using the WEIGHT= option on the HISTOGRAM option. (The WEIGHT= option was added in SAS 9.4M1.)

data Size; input Distance ObjectSize @@; Wt = 1 / distance; /* precision */ x = ObjectSize; label x = "Estimate of Size"; datalines; 1.5 30 1.5 20 1.5 30 1.5 25 3 43 3 33 3 25 3 30 4.5 25 4.5 36 4.5 48 4.5 33 6 43 6 36 6 23 6 48 7.5 30 7.5 25 7.5 50 7.5 38 ; title "Weighted Histogram of Size Estimate"; proc sgplot data=size noautolegend; histogram x / WEIGHT=Wt scale=proportion datalabel binwidth=5; fringe x / lineattrs=(thickness=2 color=black) transparency=0.6; yaxis grid offsetmin=0.05 label="Weighted Proportion"; refline 30 / axis=x lineattrs=(pattern=dash); run; |

The weighted histogram is shown to the right. The data values are shown in the fringe plot beneath the histogram. The height of each bin is the sum of the weights of the observations in that bin. The dashed line represents the true diameter of the object. Most estimates are clustered around the true value, except for a small cluster of larger estimates. Notice that I use the SCALE=PROPORTION option to plot the weighted proportion of observations in each bin, although the default behavior (SCALE=PERCENT) would also be acceptable.

If you remove the WEIGHT= option and study the unweighted graph, you will see that the average estimate for the unweighted distribution (33.6) is not as close to the true diameter as the weighted estimate (30.1). Furthermore, the weighted standard deviation is about half the unweighted standard deviation, which shows that the weighted distribution of these data has less variance than the unweighted distribution.

By the way, although PROC UNIVARIATE can produce weighted statistics, it does not create weighted graphics as of SAS 9.4M5. One reason is that the graphics statements (CDFPLOT, HISTOGRAM, QQPLOT, etc) not only create graphs but also fit distributions and produce goodness-of-fit statistics, and those analyses do not support weight variables.

Although a weighted histogram is not conceptually complex, I understand a computation better when I program it myself. You can write a SAS program that computes a weighted histogram by using the following algorithm:

- Construct the bins. For this example, there are eight bins of width 5, and the first bin starts at x=17.5. (It is centered at x=20.) Initialize all bin heights to zero.
- For each observation, find the bin that contains it. Increment the bin height by the weight of that observation.
- Standardize the heights by dividing by the sum of weights. You can skip this step if the weights sum to unity.

A SAS/IML implementation of this algorithm requires only a few lines of code. A DATA step implementation that uses arrays is longer, but probably looks more familiar to many SAS programmers:

data BinHeights(keep=height:); array EndPt[8] _temporary_; binStart = 17.5; binWidth = 5; /* anchor and width for bins */ do i = 1 to dim(EndPt); /* define endpoints of bins */ EndPt[i] = binStart + (i-1)*binWidth; end; array height[7]; /* height of each bin */ set Size end=eof; /* for each observation ... */ sumWt + Wt; /* compute sum of weights */ Found=0; do i = 1 to dim(EndPt)-1 while (^Found); /* find bin for each obs */ Found = (EndPt[i] <= x < EndPt[i+1]); if Found then height[i] + Wt; /* increment bin height by weight */ end; if eof then do; do i = 1 to dim(height); /* scale heights by sum of weights */ height[i] = height[i] / sumWt; end; output; end; run; proc print noobs data=BinHeights; run; |

The computations from the DATA step match the data labels that appear on the weighted histogram in PROC SGPLOT.

In SAS, the HISTOGRAM statement in PROC SGPLOT supports the WEIGHT= option, which enables you to create a weighted histogram. A weighted histogram shows the weighted distribution of the data. If the histogram displays proportions (rather than raw counts), then the heights of the bars are the sum of the standardized weights of the observations within each bin. You can download the SAS program that computes the quantities in this article.

How can you interpret a weighted histogram? That depends on the meaning of the weight variables. For survey data and sampling weights, the weighted histogram estimates the distribution of a quantity in the population. For inverse variance weights (such as were used in this article), the weighted histogram overweights precise measurements and underweights imprecise measurements. When the weights are correct, the weighted histogram is a better estimate of the density of the underlying population and the weighted statistics (mean, variance, quantiles,...) are better estimates of the corresponding population quantities.

Have you ever plotted a weighted histogram? What was the context? Leave a comment.

The post Create and interpret a weighted histogram appeared first on The DO Loop.

]]>The post How to understand weight variables in statistical analyses appeared first on The DO Loop.

]]>How can you specify weights for a statistical analysis? Hmmm, that's a "weighty" question! Many people on discussion forums ask "What is a weight variable?" and "How do you choose a weight for each observation?" This article gives a brief overview of weight variables in statistics and includes examples of how weights are used in SAS.

One source of confusion is that different areas of statistics use weights in different ways. All weights are not created equal! The weights in survey statistics have a different interpretation from the weights in a weighted least squares regression.

Let's start with a basic definition.
A *weight variable* provides a value (the *weight*) for each observation in a data set.
The *i*_th weight value, *w*_{i}, is the weight for the *i*_th observation.
For most applications, a valid weight is nonnegative. A zero weight usually means that you want to exclude the observation from the analysis. Observations that have relatively large weights have more influence in the analysis than observations that have smaller weights. An unweighted analysis is the same as a weighted analysis in which all weights are 1.

There are several kinds of weight variables in statistics. At the 2007 Joint Statistical Meetings in Denver, I discussed weighted statistical graphics for two kinds of statistical weights: survey weights and regression weights. An audience member informed me that STATA software provides four definitions of weight variables, as follows:

**Frequency weights:**A frequency variable specifies that each observation is repeated multiple times. Each frequency value is a nonnegative integer.**Survey weights:**Survey weights (also called*sampling weights*or*probability weights*) indicate that an observation in a survey represents a certain number of people in a finite population. Survey weights are often the reciprocals of the selection probabilities for the survey design.**Analytical weights:**An analytical weight (sometimes called an*inverse variance weight*or a*regression weight*) specifies that the*i*_th observation comes from a sub-population with variance σ^{2}/*w*_{i}, where σ^{2}is a common variance and*w*_{i}is the weight of the*i*_th observation. These weights are used in multivariate statistics and in a meta-analyses where each "observation" is actually the mean of a sample.**Importance weights:**According to a STATA developer, an "importance weight" is a STATA-specific term that is intended "for programmers, not data analysts." The developer says that the formulas "may have no statistical validity" but can be useful as a programming convenience. Although I have never used STATA, I imagine that a primary use is to downweight the influence of outliers. The REWEIGHT statement in PROC REG served a similar purpose in the years before robust regression methods were implemented in SAS.

*Weight, weight,... please tell me! How to understand weight variables in #statitics*

Click To Tweet

I have previously argued that **a frequency variable is not a weight variable**. I provided an example that shows the distinction between a frequency variable and a weight variable in regression. Briefly,
a frequency variable is a notational convenience that enables you to compactly represent the data. A frequency variable determines the sample size (and the degrees of freedom), but using a frequency variable is always equivalent to "expanding" the data set. (To expand the data, create *f*_{i} identical observations when the *i*_th value of the frequency variable is *f*_{i}.) An analysis of the expanded data is identical to the same analysis on the original data that uses a frequency variable.

In SAS, the FREQ statement enables you to specify a frequency variable in most procedures. Ironically, in PROC FREQ you use the WEIGHT statement to specify frequencies. Because weights can be non-integer,the WEIGHT statement enables you to analyze tables that contain expected counts, percentages, and other non-integer values.

If you have survey data, you should analyze it by using survey weights. The sum of the survey weights equals the population size. Using survey weights enables you to make correct inferences about the finite population that is represented by the survey.

In SAS, you can use the SAS SURVEY procedures to analyze survey data. The SURVEY procedures (including SURVEYMEANS, SURVEYFREQ, and SURVEYREG) also support stratified samples and strata weights.

Inverse variance weights are appropriate for regression and other multivariate analyses. When you include a weight variable in a multivariate analysis, the crossproduct matrix is computed as X`WX, where W is the diagonal matrix of weights and X is the data matrix (possibly centered or standardized). In these analyses, the weight of an observation is assumed to be inversely proportional to the variance of the subpopulation from which that observation was sampled. You can "manually" reproduce a lot of formulas for weighted multivariate statistics by multiplying each row of the data matrix (and the response vector) by the square root of the appropriate weight.

In particular, if you use a weight variable in a regression procedure, you get a weighted regression analysis. For regression, the right side of the normal equations is X`WY.

You can also use weights to analyze a set of means, such as you might encounter in meta-analysis or an analysis of means. The weight that you specify for the *i*_th mean should be inversely proportional to the variance of the *i*_th sample. Equivalently, the weight for the *i*_th group is (approximately) proportional to the sample size of the *i*_th group.

In SAS, most regression procedures support WEIGHT statements. For example, PROC REG performs a weighted least squares regression. The multivariate analysis procedures (DISRIM, FACTOR, PRINCOMP, ...) use weights to form a weighted covariance or correlation matrix. You can use PROC GLM to compute a meta-analyze of data that are the means from previous studies.

Analysts can (and do!) create weights arbitrarily based on "gut feelings." You might say, "I don't trust the value of this observation, so I'm going to downweight it." Suppose you assign Observation 1 twice as much weight as Observation 2 because you feel that Observation 1 is twice as "trustworthy." How does a multivariate procedure interpret those weights?

In statistics, precision is the inverse of the variance. When you use those weights you are implicitly stating that you believe that Observation 2 is from a population whose variance is twice as large as the population variance for Observation 1. In other words, "less trust" means that you have less faith in the precision of the measurement for Observation 2 and more faith in the precision of Observation 1.

In SAS, many procedures support a WEIGHT statement. The documentation for the procedure describes how the procedure incorporates weights. In addition to the previously mentioned procedures, many Base SAS procedures compute weighted descriptive statistics. For some examples of weighted statistical analyses in SAS and how to interpret the results, see the following articles:

- How to compute and interpret a weighted mean
- How to compute and interpret weighted quantiles or weighted percentiles
- How to compute and visualize a weighted linear regression
- Create and interpret a weighted histogram

The post How to understand weight variables in statistical analyses appeared first on The DO Loop.

]]>The post Data-driven simulation appeared first on The DO Loop.

]]>In a large simulation study, it can be convenient to have a "control file" that contains the parameters for the study. My recent article about how to simulate multivariate normal clusters demonstrates a simple example of this technique. The simulation in that article uses an input data set that contains the parameters (mean, standard deviations, and correlations) for the simulation. A SAS procedure (PROC SIMNORMAL) simulates data based on the parameters in the input data set.

This is a powerful paradigm. Instead of hard-coding the parameters in the program (or as macro variables), the parameters are stored in a data set that is processed by the program. This is sometimes called *data-driven programming*. (Some people call it *dynamic programming*, but there is an optimization technique of the same name so I will use the term "data-driven.") In a data-driven program, when you want to run the program with new parameters, you merely modify the data set that contains the control parameters.

I have previously written about a different way to control a batch program by passing in parameters on the command line when you invoke the SAS program.

Before looking at data-driven programming, let's review the static approach. I will simulate clusters of univariate normal data as an example.

Suppose that you want to simulate normal data for three different groups. Each group has its own sample size (N), mean, and standard deviation. In my book *Simulating Data with SAS* (p. 206), I show how to simulate this sort of ANOVA design by using arrays, as follows.

/* Static simulation: Parameters embedded in the simulation program */ data AnovaStatic; /* define parameters for three simulated group */ array N[3] _temporary_ (50, 50, 50); /* sample sizes */ array Mean[3] _temporary_ (14.6, 42.6, 55.5); /* center for each group */ array StdDev[3] _temporary_ ( 1.7, 4.7, 5.5); /* spread for each group */ call streaminit(12345); do k = 1 to dim(N); /* for each group */ do i = 1 to N[k]; /* simulate N[k] observations */ x = rand("Normal", Mean[k], StdDev[k]); /* from k_th normal distribution */ output; end; end; run; |

The DATA step contains two loops, one for the groups and the other for the observations within each group. The parameters for each group are stored in arrays. Notice that if you want to change the parameters (including the number of groups), you need to edit the program. I call this method "static programming" because the behavior of the program is determined at the time that the program is written. This is a perfectly acceptable method for most applications. It has the advantage that you know exactly what the program will do by looking at the program.

An alternative is to put the parameters for each group into a file or data set. If the *k*_th row in the data set contains the parameters for the *k*_th group, then the implicit loop in the DATA step will iterate over all groups, regardless of the number of groups.
The following DATA step creates the parameters for three groups, which are read and processed by the second DATA step. The parameter values are the same as for the static example, but are transposed and processed row-by-row instead of via arrays:

/* Data-driven simulation: Parameters in a data set, processed by the simulation program */ data params; /* define parameters for each simulated group */ input N Mean StdDev; datalines; 50 14.6 1.7 50 42.6 4.7 50 55.5 5.5 ; data AnovaDynamic; call streaminit(12345); set params; /* implicit loop over groups k=1,2,... */ do i = 1 to N; /* simulate N[k] observations */ x = rand("Normal", Mean, StdDev); /* from k_th normal distribution */ output; end; run; |

Notice the difference between the static and dynamic techniques. The static technique simulates data from three groups whose parameters are specified in temporary arrays. The dynamic technique simulates data from an arbitrary number of groups. Currently, the PARAMS data specifies three groups, but if I change the PARAMS data set to represent 10 or 1000 groups, the AnovaDynamic DATA step will simulate data from the new design without any modification.

The data-driven technique is useful when the parameters are themselves the results of an analysis. For example, a common simulation technique is to generate the moments of real data (mean, variance, skewness, and so forth) and to use those statistics in place of the population parameters that they estimate. (See Chapter 16, "Moment Matching," in *Simulating Statistics with SAS*.)

The following call to PROC MEANS generates the sample mean and standard deviation for real data and writes those values to a data set:

proc means data=sashelp.iris N Mean StdDev stackods; class Species; var PetalLength; ods output Summary=params; run; |

The output data set from PROC MEANS creates a PARAMS data set that contains the variables (N, MEAN, and STDDEV) that are read by the simulation program. Therefore, you can immediately run the AnovaDynamic DATA step to simulate normal data from the sample statistics. A visualization of the resulting simulated data is shown below.

You can run PROC MEANS on other data and other variables and the AnovaDynamic step will continue to work without *any* modification. The simulation is controlled entirely by the values in the "control file," which is the PARAMS data set.

You can generalize this technique by wrapping the program in a SAS macro in which the name of the parameter file and the name of the simulated data set are provided at run time. With a macro implementation, you can read from multiple input files and write to multiple output data sets. You could use such a macro, for example, to break up a large simulation study into smaller independent sub-simulations, each controlled by its own file of input parameters. In a gridded environment, each sub-simulation can be processed independently and in parallel, thus reducing the total time required to complete the study.

Although this article discusses control files in the context of statistical simulation, other applications are possible. Have you used a similar technique to control a program by using an input file that contains the parameters for the program? Leave a comment.

The post Data-driven simulation appeared first on The DO Loop.

]]>The post Simulate multivariate normal data in SAS by using PROC SIMNORMAL appeared first on The DO Loop.

]]>
Most SAS procedures read and analyze raw data. However, some SAS procedures read and write special data sets that represent a statistical summary of data.
PROC SIMNORMAL can read a TYPE=CORR or TYPE=COV data set. Usually, these special data sets are created as an output data set from another procedure. For example, the following SAS statements compute the correlation between four variables from a sample of 50 *Iris versicolor* flowers:

proc corr data=sashelp.iris(where=(Species="Versicolor")) /* input raw data */ nomiss noprint outp=OutCorr; /* output statistics */ var PetalLength PetalWidth SepalLength SepalWidth; run; proc print data=OutCorr; run; |

The output data set contains summary statistics including the mean, standard deviations, and correlation matrix for the four variables in the analysis. PROC PRINT does not display the 'TYPE' attribute of this data set, but if you run PROC CONTENTS you will see a field labeled "Data Set Type," which has the value "CORR".

You can also create a TYPE=CORR or TYPE=COV data set by using the DATA step as shown in the documentation for PROC SIMNORMAL.

Recall that you can use the standard deviations and correlations to construct a covariance matrix. When you call PROC SIMNORMAL, it internally constructs the covariance matrix from the information in the OutCorr data set and use the mean and covariance matrix to simulate multivariate normal data. The following call to PROC SIMNORMAL simulates 50 observations from a multivariate normal population. The DATA step combines the original and simulated data; the call to PROC SGSCATTER overlays the original and the simulated samples. Click to enlarge the graph.

proc simnormal data=OutCorr outsim=SimMVN numreal = 50 /* number of realizations = size of sample */ seed = 12345; /* random number seed */ var PetalLength PetalWidth SepalLength SepalWidth; run; /* combine the original data and the simulated data */ data Both; set sashelp.iris(where=(Species="Versicolor")) /* original */ SimMVN(in=sim); /* simulated */ Simulated = sim; run; ods graphics / attrpriority=none; /* use different markers for each group */ title "Overlay of Original and Simulated MVN Data"; proc sgscatter data=Both; matrix PetalLength PetalWidth SepalLength SepalWidth / group=Simulated; run; ods graphics / attrpriority=none; /* reset markers */ |

Notice that the original data are rounded whereas the simulated data are not. Except for that minor difference, the simulated data appear to be similar to the original data. Of course, the simulated data will not match unless the original data is approximately multivariate normal.

The SIMNORMAL procedure supports the NUMREAL= option, which you can use to specify the size of the simulated sample. (NUMREAL stands for "number of realizations," which is the number of independent draws.) You can use this option to generate *multiple* samples from the same multivariate normal population. For example, suppose you are conducting a Monte Carlo study and you want to generate 100 samples of size N=50, each drawn from the same multivariate normal population. This is equivalent to drawing 50*100 observations where the first 50 observations represent the first sample, the next 50 observations represent the second sample, and so on. The following statements generate 50*100 observations and then construct an ID variable that identifies each sample:

%let N = 50; /* sample size */ %let NumSamples = 100; /* number of samples */ proc simnormal data=OutCorr outsim=SimMVN numreal = %sysevalf(&N*&NumSamples) seed = 12345; /* random number seed */ var PetalLength PetalWidth SepalLength SepalWidth; run; data SimMVNAll; set SimMVN; ID = floor((_N_-1) / &N) + 1; /* ID = 1,1,...,1, 2,2,...,2, etc */ run; |

After adding the ID variable, you can efficiently analyze all samples by using a single call to a procedure. The procedure should use a BY statement to analyze each sample. For example, you could use PROC CORR with a BY ID statement to obtain a Monte Carlo estimate of the sampling distribution of the correlation for multivariate normal data.

In summary, although the SAS/IML language is the best tool for general multivariate simulation tasks, you can use the SIMNORMAL procedure in SAS/STAT software to simulate multivariate normal data. The key is to construct a TYPE=CORR or TYPE=COV data set, which is then processed by PROC SIMNORMAL.

The post Simulate multivariate normal data in SAS by using PROC SIMNORMAL appeared first on The DO Loop.

]]>The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.

]]>Pearson's correlation measures the linear association between two variables. Because the correlation is bounded between [-1, 1], the sampling distribution for highly correlated variables is highly skewed. Even for bivariate normal data, the skewness makes it challenging to estimate confidence intervals for the correlation, to run one-sample hypothesis tests ("Is the correlation equal to 0.5?"), and to run two-sample hypothesis tests ("Do these two samples have the same correlation?").

In 1921, R. A. Fisher studied the correlation of bivariate normal data and discovered a wonderful transformation (shown to the right) that converts the skewed distribution of the sample correlation (*r*) into a distribution that is approximately normal.
Furthermore, whereas the variance of the sampling distribution of *r* depends on the correlation, the variance of the transformed distribution is *independent* of the correlation.
The transformation is called *Fisher's z transformation*.
This article describes Fisher's *z* transformation and shows how it transforms a skewed distribution into a normal distribution.

The following graph (click to enlarge) shows the sampling distribution of the correlation coefficient for bivariate normal samples of size 20 for four values of the population correlation, rho (ρ). You can see that the distributions are very skewed when the correlation is large in magnitude.

The graph was created by using simulated bivariate normal data as follows:

- For rho=0.2, generate M random samples of size 20 from a bivariate normal distribution with correlation rho. (For this graph, M=2500.)
- For each sample, compute the Pearson correlation.
- Plot a histogram of the M correlations.
- Overlay a kernel density estimate on the histogram and add a reference line to indicate the correlation in the population.
- Repeat the process for rho=0.4, 0.6, and 0.8.

The histograms approximate the sampling distribution of the correlation coefficient (for bivariate normal samples of size 20) for the various values of the population correlation. The distributions are not simple. Notice that the variance and the skewness of the distributions depend on the value the underlying correlation (ρ) in the population.

Fisher sought to transform these distributions into normal distributions. He proposed the transformation f(*r*) = arctanh(*r*), which is the inverse hyperbolic tangent function. The graph of arctanh is shown at the top of this article. Fisher's transformation can also be written as (1/2)log( (1+*r*)/(1-*r*) ). This transformation is sometimes called Fisher's "z transformation" because the letter *z* is used to represent the transformed correlation: *z* = arctanh(*r*).

How he came up with that transformation is a mystery to me, but he was able to show that arctanh is a normalizing and variance-stabilizing transformation. That is, when *r* is the sample correlation for bivariate normal data and *z* = arctanh(*r*) then the following statements are true (See Fisher, *Statistical Methods for Research Workers*, 6th Ed, pp 199-203):

- The distribution of
*z*is approximately normal and "tends to normality rapidly as the sample is increased" (p 201). - The standard error of
*z*is approximately 1/sqrt(N-3), which is independent of the value of the correlation.

The graph to the right demonstrates these statements. The graph is similar to the preceding panel, except these histograms show the distributions of the transformed correlations *z* = arctanh(*r*). In each cell, the vertical line is drawn at the value arctanh(ρ). The curves are normal density estimates with σ = 1/sqrt(N-3), where N=20.

The two features of the transformed variables are apparent. First, the distributions are normally distributed, or, to quote Fisher, "come so close to it, even for a small sample..., that the eye cannot detect the difference" (p. 202). Second, the variance of these distributions are constant and are independent of the underlying correlation.

From the graph of the transformed variables, it is clear why Fisher's transformation is important. If you want to test some hypothesis about the correlation, the test can be conducted in the *z* coordinates where all distributions are normal with a known variance. Similarly, if you want to compute a confidence interval, the computation can be made in the *z* coordinates and the results "back transformed" by using the inverse transformation, which is *r* = tanh(*z*).

You can perform the calculations by applying the standard formulas for normal distributions (see p. 3-4 of Shen and Lu (2006)), but most statistical software provides an option to use the Fisher transformation to compute confidence intervals and to test hypotheses. In SAS, the CORR procedure supports the FISHER option to compute confidence intervals and to test hypotheses for the correlation coefficient.

The following call to PROC CORR computes a sample correlation between the length and width of petals for 50 *Iris versicolor* flowers. The FISHER option specifies that the output should include confidence intervals based on Fisher's transformation. The RHO0= suboption tests the null hypothesis that the correlation in the population is 0.75. (The BIASADJ= suboption turns off a bias adjustment; a discussion of the bias in the Pearson estimate will have to wait for another article.)

proc corr data=sashelp.iris fisher(rho0=0.75 biasadj=no); where Species='Versicolor'; var PetalLength PetalWidth; run; |

The output shows that the Pearson estimate is *r*=0.787. A 95% confidence interval for the correlation is [0.651, 0.874]. Notice that *r* is not the midpoint of that interval. In the transformed coordinates, *z* = arctanh(0.787) = 1.06 *is* the center of a symmetric confidence interval (based on a normal distribution with standard error 1/sqrt(N-3)). However, the inverse transformation (tanh) is nonlinear, and the right half-interval gets compressed more than the left half-interval.

For the hypothesis test of ρ = 0.75, the output shows that the p-value is 0.574. The data do not provide evidence to reject the hypothesis that ρ = 0.75 at the 0.05 significance level. The computations for the hypothesis test use only the transformed (z) coordinates.

This article shows that Fisher's "*z* transformation," which is *z* = arctanh(*r*), is a normalizing transformation for the Pearson correlation of bivariate normal samples of size N. The transformation converts the skewed and bounded sampling distribution of *r* into a normal distribution for *z*. The standard error of the transformed distribution is 1/sqrt(N-3), which does not depend on the correlation. You can perform hypothesis tests in the *z* coordinates. You can also form confidence intervals in the *z* coordinates and use the inverse transformation (r=tanh(z)) to obtain a confidence interval for ρ.

The Fisher transformation is exceptionally useful for small sample sizes because, as shown in this article, the sampling distribution of the Pearson correlation is highly skewed for small N. When N is large, the sampling distribution of the Pearson correlation is approximately normal except for extreme correlations. Although the theory behind the Fisher transformation assumes that the data are bivariate normal, in practice the Fisher transformation is useful as long as the data are not too skewed and do not contain extreme outliers.

You can download the SAS program that creates all the graphs in this article.

The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.

]]>