The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.
]]>Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.
In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.
ods exclude all; proc corr data=sashelp.Heart; /* pairwise correlation */ var _NUMERIC_; ods output PearsonCorr = Corr; /* write correlations, p-values, and sample sizes to data set */ run; ods exclude none; |
The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:
proc iml; use Corr; read all var "Variable" into ColNames; /* get names of variables */ read all var (ColNames) into mCorr; /* matrix of correlations */ ProbNames = "P"+ColNames; /* variables for p-values are named PX, PY, PZ, etc */ read all var (ProbNames) into mProb; /* matrix of p-values */ close Corr; call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix"; |
The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)
These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.
Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)
numCols = ncol(mCorr); /* number of variables */ numPairs = numCols*(numCols-1) / 2; length = 2*nleng(ColNames) + 5; /* max length of new ID variable */ PairNames = j(NumPairs, 1, BlankStr(length)); i = 1; do row= 2 to numCols; /* construct the pairwise names */ do col = 1 to row-1; PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]); i = i + 1; end; end; lowerIdx = loc(row(mCorr) > col(mCorr)); /* indices of lower-triangular elements */ Corr = mCorr[ lowerIdx ]; Prob = mProb[ lowerIdx ]; Significant = choose(Prob > 0.05, "No ", "Yes"); /* use alpha=0.05 signif level */ create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"}; append; close; QUIT; |
You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.
proc sort data=CorrPairs; by Corr; run; ods graphics / width=600px height=800px; title "Pairwise Correlations"; proc sgplot data=CorrPairs; hbar PairNames / response=Corr group=Significant; refline 0 / axis=x; yaxis discreteorder=data display=(nolabel) labelattrs=(size=6pt) fitpolicy=none offsetmin=0.012 offsetmax=0.012 /* half of 1/k, where k=number of catgories */ colorbands=even colorbandsattrs=(color=gray transparency=0.9); xaxis grid display=(nolabel); keylegend / position=topright location=inside across=1; run; |
The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.
The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).
Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.
The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.
]]>The post What is rank correlation? appeared first on The DO Loop.
]]>Why would anyone want a different estimate of correlation? Well, the Pearson correlation, which is also known as the product-moment correlation, uses empirical moments of the data (means and standard deviations) to estimate the linear association between two variables. However, means and standard deviations can be unduly influenced by outliers in the data, so the Pearson correlation is not a robust statistic.
A simple robust alternative to the Pearson correlation is called the Spearman rank correlation, which is defined as the Pearson correlation of the ranks of each variable. (If a variable contains tied values, replace those values by their average rank.) The Spearman rank correlation is simple to compute and conceptually easy to understand. Some advantages of the rank correlation are
PROC CORR in SAS supports several measures of correlation, including the Pearson and Spearman correlations. For data without outliers, the two measures are often similar. For example, the following call to PROC CORR computes the Spearman rank correlation between three variables in the Sashelp.Class data set:
/* Compute PEARSON and SPEARMAN rank correlation by using PROC CORR in SAS */ proc corr data=sashelp.class noprob nosimple PEARSON SPEARMAN; var height weight age; run; |
According to both statistics, these variables are very positively correlated, with correlations in the range [0.7, 0.88]. Notice that the rank correlations (the lower table) are similar to the Pearson correlations for these data. However, if the data contain outliers, the rank correlation estimate is less influenced by the magnitude of the outliers.
As mentioned earlier, the Spearman rank correlation is conceptually easy to understand. It consists of two steps: compute the ranks of each variable and compute the Pearson correlation between the ranks. It is instructive to reproduce each step in the Spearman computation. You can use PROC RANK in SAS to compute the ranks of the variables, then use PROC CORR with the PEARSON option to compute the Pearson correlation of the ranks. If the data do not contain any missing values, then the following statements implement to two steps that compute the Spearman rank correlation:
/* Compute the Spearman rank correlation "manually" by explicitly computing ranks */ /* First compute ranks; use average rank for ties */ proc rank data=sashelp.class out=classRank ties=mean; var height weight age; ranks RankHeight RankWeight RankAge; run; /* Then compute Pearson correlation on the ranks */ proc corr data=classRank noprob nosimple PEARSON; var RankHeight RankWeight RankAge; run; |
The resulting table of correlations is the same as in the previous section and is not shown. Although PROC CORR can compute the rank correlation directly, it is comforting that these two steps produce the same answer. Furthermore, this two-step method can be useful if you decide to implement a rank-based statistic that is not produced by any SAS procedure. This two-step method is also the way to compute the Spearman correlation of character ordinal variables because PROC CORR does not analyze character variables. However, PROC RANK supports both character and numeric variables.
If you have missing values in your data, then make sure you delete the observations that contain missing values before you call PROC RANK. Equivalently, you can use a WHERE statement to omit the missing values. For example, you could insert the following statement into the PROC RANK statements:
where height^=. & weight^=. & age^=.;
In the SAS/IML language, the CORR function computes the Spearman rank correlation directly, as follows. The results are the same as the results from PROC CORR, and are not shown.
proc iml; use sashelp.class; read all var {height weight age} into X; close; RankCorr = corr(X, "Spearman"); /* compute rank correlation */ |
If you ever need to compute a rank-based statistic manually, you can also use the RANKTIE function to compute the ranks of the elements in a numerical vector, such as
ranktie(X[ ,1], "Mean");
The Spearman rank correlation is a robust measure of the linear association between variables. It is related to the classical Pearson correlation because it is defined as the Pearson correlation between the ranks of the individual variables. It has some very nice properties, including being robust to outliers and being invariant under monotonic increasing transformations of the data. For other measures of correlation that are supported in SAS, see the PROC CORR documentation.
The post What is rank correlation? appeared first on The DO Loop.
]]>The post Robust principal component analysis in SAS appeared first on The DO Loop.
]]>This article shows how to implement a classical (non-robust) PCA by using the SAS/IML language and how to modify that classical analysis to create a robust PCA.
The PRINCOMP procedure in SAS computes a classical principal component analysis. You can analyze the correlation matrix (the default) or the covariance matrix of the variables (the COV option). You can create scree plots, pattern plots, and score plots automatically by using ODS graphics. If you want to apply rotations to the factor loadings, you can use the FACTOR procedure.
Let's use PROC PRINCOMP perform a simple PCA. The PROC PRINCOMP results will be the basis of comparison when we implement the PCA in PROC IML. The following example is taken from the Getting Started example in the PROC PRINCOMP documentation. The procedure analyzes seven crime rates for the 50 US states in 1977, based on the correlation matrix. (The documentation provides the data.)
proc princomp data=Crime /* add COV option for covariance analysis */ outstat=PCAStats_CORR out=Components_CORR plots=score(ncomp=4); var Murder Rape Robbery Assault Burglary Larceny Auto_Theft; id State; ods select Eigenvectors ScreePlot '2 by 1' '4 by 3'; run; proc print data=Components_CORR(obs=5); var Prin:; run; |
To save space, the output from PROC PRINCOMP is not shown, but it includes a table of the eigenvalues and principal component vectors (eigenvectors) of the correlation matrix, as well as a plot of the scores of the observations, which are the projection of the observations onto the principal components. The procedure writes two data sets: the eigenvalues and principal components are contained in the PCAStats_CORR data set and the scores are contained in the Components_CORR data set.
Assume that the data consists of n observations and p variables and assume all values are nonmissing. If you are comfortable with multivariate analysis, a principal component analysis is straightforward: the principal components are the eigenvectors of a covariance or correlation matrix, and the scores are the projection of the centered data onto the eigenvector basis. However, if it has been a few years since you studied PCA, Abdi and Williams (2010) provides a nice overview of the mathematics. The following matrix computations implement the classical PCA in the SAS/IML language:
proc iml; reset EIGEN93; /* use "9.3" algorithm, not vendor BLAS (SAS 9.4m3) */ varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"}; use Crime; read all var varNames into X; /* X = data matrix (assume nonmissing) */ close; S = cov(X); /* estimate of covariance matrix */ R = cov2corr(S); /* = corr(X) */ call eigen(D, V, R); /* D=eigenvalues; columns of V are PCs */ scale = T( sqrt(vecdiag(S)) ); /* = std(X) */ B = (X - mean(X)) / scale; /* center and scale data about the mean */ scores = B * V; /* project standardized data onto the PCs */ print V[r=varNames c=("Prin1":"Prin7") L="Eigenvectors"]; print (scores[1:5,])[c=("Prin1":"Prin7") format=8.5 L="Scores"]; |
The principal components (eigenvectors) and scores for these data are identical to the same quantities that were produced by PROC PRINCOMP. In the preceding program I could have directly computed R = corr(X) and scale = std(X), but I generated those quantities from the covariance matrix because that is the approach used in the next section, which computes a robust PCA.
If you want to compute the PCA from the covariance matrix, you would use S in the EIGEN call, and define B = X - mean(X) as the centered (but not scaled) data matrix.
This section is based on a similar robust PCA computation in Wicklin (2010). The main idea behind a robust PCA is that if there are outliers in the data, the covariance matrix will be unduly influenced by those observations. Therefore you should use a robust estimate of the covariance matrix for the eigenvector computation. Also, the arithmetic mean is unduly influenced by the outliers, so you should center the data by using a robust estimate of center before you form the scores.
SAS/IML software provides the MCD subroutine for computing robust covariance matrices and robust estimates of location. If you replace the COV and MEAN functions in the previous section by a call to the MCD subroutine, you obtain a robust PCA, as follows:
/* get robust estimates of location and covariance */ N = nrow(X); p = ncol(X); /* number of obs and variables */ optn = j(8,1,.); /* default options for MCD */ optn[4]= floor(0.75*N); /* override default: use 75% of data for robust estimates */ call MCD(rc,est,dist,optn,x); /* compute robust estimates */ RobCov = est[3:2+p, ]; /* robust estimate of covariance */ RobLoc = est[1, ]; /* robust estimate of location */ /* use robust estimates to perform PCA */ RobCorr = cov2corr(RobCov); /* robust correlation */ call eigen(D, V, RobCorr); /* D=eigenvalues; columns of V are PCs */ RobScale = T( sqrt(vecdiag(RobCov)) ); /* robust estimates of scale */ B = (x-RobLoc) / RobScale; /* center and scale data */ scores = B * V; /* project standardized data onto the PCs */ |
Notice that the SAS/IML code for the robust PCA is very similar to the classical PCA. The only difference is that the robust estimates of covariance and location (from the MCD call) are used instead of the classical estimates.
If you want to compute the robust PCA from the covariance matrix, you would use RobCov in the EIGEN call, and define B = X - RobLoc as the centered (but not scaled) data matrix.
You can create a score plot to compare the classical scores to the robust scores. The Getting Started example in the PROC PRINCOMP documentation shows the classical scores for the first three components. The documentation suggests that Nevada, Massachusetts, and New York could be outliers for the crime data.
You can write the robust scores to a SAS data set and used the SGPLOT procedure to plot the scores of the classical and robust PCA on the same scale. The first and third component scores are shown to the right, with abbreviated state names serving as labels for the markers. (Click to enlarge.) You can see that the robust first-component scores for California and Nevada are separated from the other states, which makes them outliers in that dimension. (The first principal component measures overall crime rate.) The robust third-component scores for New York and Massachusetts are also well-separated and are outliers for the third component. (The third principal component appears to contrast murder with rape and auto theft with other theft.)
Because the crime data does not have huge outliers, the robust PCA is a perturbation of the classical PCA, which makes it possible to compare the analyses. For data that have extreme outliers, the robust analysis might not resemble its classical counterpart.
If you run an analysis like this on your own data, remember that eigenvectors are not unique and so there is no guarantee that the eigenvectors for the robust correlation matrix will be geometrically aligned with the eigenvectors from the classical correlation matrix. For these data, I multiplied the second and fourth robust components by -1 because that seems to make the score plots easier to compare.
In summary, you can implement a robust principal component analysis by using robust estimates for the correlation (or covariance) matrix and for the "center" of the data. The MCD subroutine in SAS/IML language is one way to compute a robust estimate of the covariance and location of multivariate data. The SAS/IML language also provides ways to compute eigenvectors (principal components) and project the (centered and scaled) data onto the principal components.
You can download the SAS program that computes the analysis and creates the graphs.
As discussed in Hubert, Rousseeuw, and Branden (2005), the MCD algorithm is most useful for 100 or fewer variables. They describe an alternative computation that can support a robust PCA in higher dimensions.
The post Robust principal component analysis in SAS appeared first on The DO Loop.
]]>The post The curse of non-unique eigenvectors appeared first on The DO Loop.
]]>I've been asked variations of this question dozens of times. The answer is usually "both answers are correct."
The mathematical root of the problem is that eigenvectors are not unique. It is easy to show this: If v is an eigenvector of the matrix A, then by definition A v = λ v for some scalar eigenvalue λ. Notice that if you define u = α v for a scalar α ≠ 0, then u is also an eigenvector because A u = α A v = α λ v = λ u. Thus a multiple of an eigenvector is also an eigenvector.
Most statistical software (including SAS) tries to partially circumvent this problem by standardizing an eigenvector to have unit length (|| v || = 1). However, note that v and -v are both eigenvectors that have the same length. Thus even a standardized eigenvector is only unique up to a ± sign, and different software might return eigenvectors that differ in sign. In fact for some problems, the same software can return different answers when run on different operating systems (Windows versus Linux), or when using vendor-supplied basic linear algebra subroutines such as the Intel Math Kernel Library (MKL).
To further complicate the issue, software might sort the eigenvalues and eigenvectors in different ways. Some software (such as MATLAB) orders eigenvalues by magnitude, which is the absolute value of the eigenvalue. Other software (such as SAS) orders eigenvalues according to the value (of the real part) of the eigenvalues. (For most statistical computations, the matrices are symmetric and positive definite (SPD). For SPD matrices, which have real nonnegative eigenvalues, these two orderings are the same.) The situation becomes more complicated if the eigenvalues are not distinct, and that case is not dealt with in this article.
To illustrate the fact that different software and numerical algorithms can produce different eigenvectors, let's examine the eigenvectors of the following 3 x 3 matrix:
The eigenvectors of this matrix will be computed by using five different software packages: SAS, Intel's MKL, MATLAB, Mathematica, and R. The eigenvalues for this matrix are unique and are approximately 16.1, 0, and -1.1. Notice that this matrix is not positive definite, so the order of the eigenvectors will vary depending on the software. Let's compute the eigenvectors in five different ways.
Method 1: SAS/IML EIGEN Call: The following statements compute the eigenvalues and eigenvectors of M by using a built-in algorithm in SAS. This algorithm was introduce in SAS version 6 and was the default algorithm until SAS 9.4.
proc iml; reset FUZZ; /* print very small numbers as 0 */ M = {1 2 3, 4 5 6, 7 8 9}; reset EIGEN93; /* use "9.3" algorithm; no vendor BLAS (option required for SAS 9.4m3) */ call eigen(EigVal, SASEigVec, M); print EigVal, SASEigVec[colname=("E1":"E3")]; |
Notice that the eigenvalues are sorted by their real part, not by their magnitude. The eigenvectors are returned in a matrix. The i_th column of the matrix is an eigenvector for the i_th eigenvalue. Notice that the eigenvector for the largest eigenvalue (the first column) has all positive components. The eigenvector for the zero eigenvalue (the second column) has a negative component in the second coordinate. The eigenvector for the negative eigenvalue (the third column) has a negative component in the third coordinate.
Method 2: Intel MKL BLAS: Starting with SAS/IML 14.1, you can instruct SAS/IML to call the Intel Math Kernel Library for eigenvalue computation if you are running SAS on a computer that has the MKL installed. This feature is the default behavior in SAS/IML 14.1 (SAS 9.4m3), which is why the previous example used RESET NOEIGEN93 to get the older "9.3 and before" algorithm. The output for the following statements assumes that you are running SAS 9.4m3 or later and your computer has Intel's MKL.
reset NOEIGEN93; /* use Intel MKL, if available */ call eigen(EigVal, MKLEigVec, M); print MKLEigVec[colname=("E1":"E3")]; |
This is a different result than before, but it is still a valid set of eigenvectors The first and third eigenvectors are the negative of the eigenvectors in the previous experiment. The eigenvectors are sorted in the same order, but that is because SAS (for consistency with earlier releases) internally sorts the eigenvectors that the MKL returns.
Method 3: MATLAB: The following MATLAB statements compute the eigenvalue and eigenvectors for the same matrix:
M = [1, 2, 3; 4, 5, 6; 7, 8, 9]; [EigVec, EigVal] = eig(M); |
EigVec = -0.2320 -0.7858 0.4082 -0.5253 -0.0868 -0.8165 -0.8187 0.6123 0.4082 |
The eigenvalues are not displayed, but you can tell from the output that the eigenvalues are ordered by magnitude: 16.1, -1.1, and 0. The eigenvectors are the same as the MKL results (within rounding precision), but they are presented in a different order.
Method 4: Mathematica: This example matrix is used in the Mathematica documentation for the Eigenvectors function:
Eigenvectors[{{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}] |
0.283349 -1.28335 1 0.641675 -0.141675 -2 1 1 1 |
This is a different result, but still correct. The symbolic computations in Mathematica do not standardize the eigenvectors to unit length. Instead, they standardize them to have a 1 in the last component. The eigenvalues are sorted by magnitude (like the MATLAB output), but the first column has opposite signs from the MATLAB output.
Method 5: R: The R documentation states that the eigen function in R calls the LAPACK subroutines. Thus I expect it to get the same result as MATLAB.
M <- matrix(c(1:9), nrow=3, byrow=TRUE) r <- eigen(M) r$vectors |
[,1] [,2] [,3] [1,] -0.2319707 -0.78583024 0.4082483 [2,] -0.5253221 -0.08675134 -0.8164966 [3,] -0.8186735 0.61232756 0.4082483 |
Except for rounding, this result is the same as the MATLAB output.
This article used a simple 3 x 3 matrix to demonstrate that different software packages might produce different eigenvectors for the same input matrix. There were four different answers produced, all of which are correct. This is a result of the mathematical fact that eigenvectors are not unique: any multiple of an eigenvector is also an eigenvector! Different numerical algorithms can produce different eigenvectors, and this is compounded by the fact that you can standardize and order the eigenvectors in several ways.
Although it is hard to compare eigenvectors from different software packages, it is not impossible. First, make sure that the eigenvectors are ordered the same way. (You can skip this step for symmetric positive definite matrices.) Then make sure they are standardized to unit length. If you do those two steps and the eigenvalues are distinct, then the eigenvectors will agree up to a ± sign.
The post The curse of non-unique eigenvectors appeared first on The DO Loop.
]]>The post Dimension reduction: Guidelines for retaining principal components appeared first on The DO Loop.
]]>Many researchers have proposed methods for choosing the number of principal components. Some methods are heuristic, others are statistical. No method is perfect. Often different techniques result in different suggestions.
This article uses SAS to implement the broken stick model and compares that method with three other simple rules for dimension reduction. A few references are provided at the end of this article.
Let's start with an example. In SAS, you can use the PRINCOMP procedure to conduct a principal component analysis. The following example is taken from the Getting Started example in the PROC PRINCOMP documentation. The program analyzes seven crime rates for the 50 US states in 1977. (The documentation shows how to generate the data set.) The following call generates a scree plot, which shows the proportion of variance explained by each component. It also writes the Eigenvalues table to a SAS data set:
proc princomp data=Crime plots=scree; var Murder Rape Robbery Assault Burglary Larceny Auto_Theft; id State; ods output Eigenvalues=Evals; run; |
The panel shows two graphs that plot the numbers in the "Eigenvalues of the Correlation Matrix" table. The plot on the left is the scree plot, which is a graph of the eigenvalues. The sum of the eigenvalues is 7, which is the number of variables in the analysis. If you divide each eigenvalue by 7, you obtain the proportion of variance that each principal component explains. The graph on the right plots the proportions and the cumulative proportions.
The scree plot is my favorite graphical method for deciding how many principal components to keep. If the scree plot contains an "elbow" (a sharp change in the slopes of adjacent line segments), that location might indicate a good number of principal components (PCs) to retain. For this example, the scree plot shows a large change in slopes at the second eigenvalue and a smaller change at the fourth eigenvalue. From the graph of the cumulative proportions, you can see that the first two PCs explain 76% of the variance in the data, whereas the first four PCs explain 91%.
If "detect the elbow" is too imprecise for you, a more precise algorithm is to start at the right-hand side of the scree plot and look at the points that lie (approximately) on a straight line. The leftmost point along the trend line indicates the number of components to retain. (In geology, "scree" is rubble at the base of a cliff; the markers along the linear trend represent the rubble that can be discarded.) For the example data, the markers for components 4–7 are linear, so components 1–4 would be kept. This rule (and the scree plot) was proposed by Cattell (1966) and revised by Cattell and Jaspers (1967).
D. A. Jackson (1993) says that the broken-stick method is one of the better methods for choosing the number of PCs. The method provides "a good combination of simplicity of calculation and accurate evaluation of dimensionality relative to the other statistical approaches" (p. 2212).
The broken-stick model retains components that explain more variance than would be expected by randomly dividing the variance into p parts. As I discussed last week, if you randomly divide a quantity into p parts, the expected proportion of the kth largest piece is (1/p)Σ(1/i) where the summation is over the values i=k..p.
For example, if p=7 then
E1 = (1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.37,
E2 = (1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.228,
E3 = (1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.156, and so forth.
I think of the "expected proportions" as corresponding to a null model that contains uncorrelated (noise) variables. If you plot the eigenvalues of the correlation matrix against the broken-stick proportions, the observed proportions that are higher than the expected proportions indicate which principal components to keep.
The plot to the right shows the scree plot overlaid on a dashed curve that indicates the expected proportions that result from a broken-stick model. An application of the broken-stick model keeps one PC because only the first observed proportion of variance is higher than the corresponding broken-stick proportion.
How can you compute the points on the dashed curve? The expected proportions in the broken-stick model for p variables are proportional to the cumulative sums of the sequence of ratios {1/p, 1/(p-1), ..., 1}. You can use the CUSUM function in SAS/IML to compute a cumulative sum of a sequence, as shown below. Notice that the previous call to PROC PRINCOMP used the ODS OUTPUT statement to create a SAS data set that contains the values in the Eigenvalue table. The SAS/IML program reads in that data and compares the expected proportions to the observed proportions. The number of components to retain is computed as the largest integer k for which the first k components each explain more variance than the broken-stick model (null model).
proc iml; use Evals; read all var {"Number" "Proportion"}; close; /* Broken Stick (Joliffe 1986; J. E. Jackson, p. 47) */ /* For random p-1 points in [0,1], expected lengths of the p subintervals are: */ p = nrow(Proportion); g = cusum(1 / T(p:1)) / p; /* expected lengths of intervals (smallest to largest) */ ExpectedLen = g[p:1]; /* reverse order: largest to smallest */ keep = 0; /* find first k for which ExpectedLen[i] < Proportion[i] if i<=k */ do i = 1 to p while(ExpectedLen[i] < Proportion[i]); keep = i; end; print Proportion ExpectedLen keep; |
As seen in the graph, only the first component is retained under the broken-stick model.
The average-eigenvalue test (Kaiser-Guttman test) retains the eigenvalues that exceed the average eigenvalue. For a p x p correlation matrix, the sum of the eigenvalues is p, so the average value of the eigenvalues is 1. To account for sampling variability, Jolliffe (1972) suggested a more liberal criterion: retain eigenvalues greater than 0.7 times the average eigenvalue. These two suggestions are implemented below:
/* Average Root (Kaiser 1960; Guttman 1954; J. E. Jackson, p. 47) */ mean = mean(Proportion); keepAvg = loc( Proportion >= mean )[<>]; /* Scaled Average Root (Joliffe 1972; J. E. Jackson, p. 47-48) */ keepScaled = loc( Proportion >= 0.7*mean )[<>]; print keepAvg keepScaled; |
For completeness, the following statement write the broken-stick proportions to a SAS data set and call PROC SGPLOT to overlay the proportion of variance for the observed data on the broken-stick model:
/* write expected proportions for broken-stick model */ create S var {"Number" "Proportion" "ExpectedLen"}; append; close; quit; title "Broken Stick Method for Retaining Principal Components"; proc sgplot data=S; label ExpectedLen = "Broken-Stick Rule" Proportion = "Proportion of Variance" Number = "Number of Components"; series x=Number y=ExpectedLen / lineattrs=(pattern=dash); series x=Number y=Proportion / lineattrs=(thickness=2); keylegend / location=inside position=topright across=1; xaxis grid; yaxis label = "Proportion of Variance"; run; |
Sometimes people ask why PROC PRINCOMP doesn't automatically choose the "correct" number of PCs to use for dimension reduction. This article describes four popular heuristic rules, all which give different answers! The rules in this article are the scree test (2 or 4 components), the broken-stick rule (1 component), the average eigenvalue rule (2 components), and the scaled eigenvalue rule (3 components).
So how should a practicing statistician decide how many PCs to retain? First, remember that these guidelines do not tell you how many components to keep, they merely make suggestions. Second, recognize that any reduction of dimension requires a trade-off between accuracy (high dimensions) and interpretability (low dimensions). Third, these rules—although helpful—cannot replace domain-specific knowledge of the data. Try each suggestion and see if the resulting model contains the features in the data that are important for your analysis.
The post Dimension reduction: Guidelines for retaining principal components appeared first on The DO Loop.
]]>The post Flip it. Flip it good. appeared first on The DO Loop.
]]>No, SAS/IML does not have equivalent built-in functions, but as Devo once sang: "When a problem comes along, you must whip it." This article shows how to implement these functions in SAS/IML. Or, to paraphrase Devo's maxim: "When a matrix comes along, you must flip it. It's not too late to flip it. Flip it good!"
The key to reversing the columns or rows of a matrix is to use subscript notation. Recall that in the SAS/IML language, the expression A[ ,1] extracts the first column of A. (When the row subscript is not specified, it means "use all rows.") Similarly, A[ ,1:3] extracts the first three columns of A (in order), whereas A[ ,3:1] extracts the first three columns of A in reverse order. A similar notation is valid for rows. The expression A[3:1, ] extracts the first three rows of A in reverse order. Therefore, the following SAS/IML functions implement the functionality of the fliplr and flipud functions for matrices:
proc iml; /* reverse columns of a matrix */ start fliplr(A); return A[ , ncol(A):1]; finish; /* reverse rows of a matrix */ start flipud(A); return A[nrow(A):1, ]; finish; /* test the functions on a 2x3 matrix */ x = {1 2 3, 4 5 6}; lr = fliplr(x); ud = flipud(x); print x, lr, ud; |
You can also reverse all elements of a matrix. Recall that a SAS/IML matrix is stored in row-major order, which means that the elements are enumerated by counting across the first row, then across the second row, and so forth. Thus if A is an n x m matrix, then A[n*m:1] returns the elements of A in reverse order. However, the elements are returned in a column vector. If you want the elements in a matrix that has the same dimensions as A, you can "shape it up" by using the SHAPE function, as follows:
/* reverse elements of a matrix */ start reverse(A); n = nrow(A); m = ncol(A); return shape(A[n*m:1], n, m); /* Now flip it into SHAPE */ finish; rev = reverse(x); print rev; |
The MATLAB language also supports the rot90 function, which rotates a matrix. I have previously shown how to rotate the elements of a matrix in the SAS/IML language.
In conclusion, although SAS/IML does not contain built-in functions for reversing the rows and columns of a matrix, these functions are easily defined in terms of matrix subscript operations. In short, when given a matrix, it is not hard to flip it, flip it good!
The post Flip it. Flip it good. appeared first on The DO Loop.
]]>The post Random segments and broken sticks appeared first on The DO Loop.
]]>You can find a discussion and solution to these problems on many websites, but I like the Cut-The-Knot.org website, which includes proofs and interactive Java applets.
You can simulate these problems in SAS by writing a DATA step or a SAS/IML program. I discuss the DATA step at the end of this article. The body of this article presents a SAS/IML simulation and constructed helper modules that solve the general problem. The simulation will do the following:
In many languages (including the SAS DATA step), you would write a loop that performs these operations for each random sample. You would then estimate the expected length by computing the mean value of the largest segment for each sample. However, in the SAS/IML language, you can use matrices instead of using a loop. Each sample of random points can be held in the column of a matrix. The lengths of the segments can also be held in a matrix. The largest segment for each trial is stored in a row vector.
The following SAS/IML modules help solve the general simulation problem for k random points. Because the points are ordered, the lengths of the segments are the differences between adjacent rows. You can use the DIF function for this computation, but the following program uses the DifOp module to construct a small difference operator, and it uses matrix multiplication to compute the differences.
proc iml; /* Independently sort column in a matrix. See http://blogs.sas.com/content/iml/2011/03/14/sorting-rows-of-a-matrix.html */ start SortCols(A); do i = 1 to ncol(A); v = A[ ,i]; call sort(v); A[ ,i] = v; /* get i_th col and sort it */ end; finish; /* Generate a random (k x NSim) matrix of points, then sort each column. */ start GenPts(k, NSim); x = j(k, NSim); /* allocate k x NSim matrix */ call randgen(x, "Uniform"); /* fill with random uniform in (0,1) */ if k > 1 then run SortCols(x); return x; finish; /* Return matrix for difference operator. See http://blogs.sas.com/content/iml/2017/07/24/difference-operators-matrices.html */ start DifOp(dim); D = j(dim-1, dim, 0); /* allocate zero matrix */ n = nrow(D); m = ncol(D); D[do(1,n*m, m+1)] = -1; /* assign -1 to diagonal elements */ D[do(2,n*m, m+1)] = 1; /* assign +1 to super-diagonal elements */ return D; finish; /* Find lengths of segments formed by k points in the columns of x. Assume each column of x is sorted and all points are in (0,1). */ start SegLengths(x); /* append 0 and 1 to top and bottom (respectively) of each column */ P = j(1, ncol(x), 0) // x // j(1, ncol(x), 1); D = DifOp(nrow(P)); /* construct difference operator */ return D*P; /* use difference operator to find lengths */ finish; P = {0.1 0.2 0.3, 0.3 0.8 0.5, 0.7 0.9 0.8 }; L = SegLengths(P); print L[label="Length (k=3)"]; |
The table shows the lengths of three different sets of points for k=3. The first column of P corresponds to points at locations {0.1, 0.3, 0.7}. These three points divide the interval [0, 1] into four segments of lengths 0.1, 0.2, 0.4, and 0.3. Similar computations hold for the other columns.
For k=1, the problem generates a random point in (0, 1) and asks for the expected length of the longer segment. Obviously the expected length is greater than 1/2, and you can read the Cut-The-Knot website to find a proof that shows that the expected length is 3/4 = 0.75.
The following SAS/IML statements generate one million random points and compute the larger of the segment lengths. The average value of the larger segments is computed and is very close to the expected value:
call randseed(54321); k = 1; NSim = 1E6; x = GenPts(k, NSim); /* simulations of 1 point dropped onto (0,1) */ L = SegLengths(x); /* lengths of segments */ Largest = L[<>, ]; /* max length among the segments */ mean = mean(Largest`); /* average of the max lengths */ print mean; |
You might not be familiar with the SAS/IML max subscript operator (<>) and the min subscript operator (><). These operators compute the minimum or maximum values for each row or column in a matrix.
For k=2, the problem generates two random points in (0, 1) and asks for the expected length of the longest segment. You can also ask for the average shortest length. The Cut-The-Knot website shows that the expected length for the longest segment is 11/18 = 0.611, whereas the expected length of the shortest segment is 2/18 = 0.111.
The following SAS/IML statements simulate choosing two random points on one million unit intervals. The program computes the one million lengths for the resulting longest and shortest segments. Again, the average values of the segments are very close to the expected values:
k = 2; NSim = 1E6; x = GenPts(k, NSim); /* simulations of 2 points dropped onto (0,1) */ L = SegLengths(x); /* lengths of segments */ maxL = L[<>, ]; /* max length among the segments */ meanMax = mean(maxL`); /* average of the max lengths */ minL = L[><, ]; /* min length among the segments */ meanMin = mean(minL`); /* average of the max lengths */ print meanMin meanMax; |
You can use the previous simulation to estimate the broken stick probability. Recall that three line segments can form a triangle provided that they satisfy the triangle inequality: the sum of the two smaller lengths must be greater than the third length. If you randomly choose two points in (0,1), the probability that the resulting three segments can form a triangle is 1/4, which is smaller than what most people would guess.
The vectors maxL and minL each contain one million lengths, so it is trivial to compute the vector of that contains the third lengths.
/* what proportion of randomly broken sticks form triangles? */ medL = 1 - maxL - minL; /* compute middle length */ isTriangle = (maxL <= minL + medL); /* do lengths satisfy triangle inequality? */ prop = mean(isTriangle`); /* proportion of segments that form a triangle */ print prop; |
As expected, about 0.25 of the simulations resulted in segments that satisfy the triangle inequality.
In conclusion, this article shows how to use the SAS/IML language to solve several classical problems in probability. By using matrices, you can run the simulation by using vectorized computations such as matrix multiplication finding the minimum or maximum values of columns. (However, I had to use a loop to sort the points. Bummer!)
If you want to try this simulation yourself in the DATA step, I suggest that you transpose the SAS/IML setup. Use arrays to hold the random points and use the CALL SORTN subroutine to sort the points. Use the LARGEST function and the SMALLEST function to compute the largest and smallest elements in an array. Feel free to post your solution (and any other thoughts) in the comments.
The post Random segments and broken sticks appeared first on The DO Loop.
]]>The post Difference operators as matrices appeared first on The DO Loop.
]]>For example, the following SAS/IML statements define a column vector that has five observations and calls the DIF function to compute the first-order differences between adjacent observations. By convention, the DIF function returns a vector that is the same size as the input vector and inserts a missing value in the first element.
proc iml; x = {0, 0.1, 0.3, 0.7, 1}; dif = dif(x); /* by default DIF(x, 1) ==> first-order differences */ print x dif; |
The difference operator is a linear operator that can be represented by a matrix. The first nonmissing value of the difference is x[2]-x[1], followed by x[3]-x[2], and so forth. Thus the linear operator can be represented by the matrix that has -1 on the main diagonal and +1 on the super-diagonal (above the diagonal). An efficient way to construct the difference operator is to start with the zero matrix and insert ±1 on the diagonal and super-diagonal elements. You can use the DO function to construct the indices for the diagonal and super-diagonal elements in a matrix:
start DifOp(dim); D = j(dim-1, dim, 0); /* allocate zero martrix */ n = nrow(D); m = ncol(D); diagIdx = do(1,n*m, m+1); /* index diagonal elements */ superIdx = do(2,n*m, m+1); /* index superdiagonal elements */ *subIdx = do(m+1,n*m, m+1); /* index subdiagonal elements (optional) */ D[diagIdx] = -1; /* assign -1 to diagonal elements */ D[superIdx] = 1; /* assign +1 to super-diagonal elements */ return D; finish; B = DifOp(nrow(x)); d = B*x; print B, d[L="Difference"]; |
You can see that the DifOp function constructs an (n-1) x n matrix, which is the correct dimension for transforming an n-dimensional vector into an (n-1)-dimensional vector. Notice that the matrix multiplication omits the element that previously held a missing value.
You probably would not use a matrix multiplication in place of the DIF function if you needed the first-order difference for a single time series. However, the matrix formulation makes it possible to use one matrix multiplication to find the difference for many time series.
The following matrix contains three time-series, one in each column. The B matrix computes the first-order difference for all columns by using a single matrix-matrix multiplication. The same SAS/IML code is valid whether the X matrix has three columns or three million columns.
/* The matrix can operate on a matrix where each column is a time series */ x = {0 0 0, 0.1 0.2 0.3, 0.3 0.8 0.5, 0.7 0.9 0.8, 1 1 1 }; B = DifOp(nrow(x)); d = B*x; /* apply the difference operator */ print d[L="Difference of Columns"]; |
Other operators in time series analysis can also be represented by matrices. For example, the first-order lag operator is represented by a matrix that has +1 on the super-diagonal. Moving average operators also have matrix representations.
The matrix formulation is efficient for short time series but is not efficient for a time series that contains thousands of elements. If the time series contains n elements, then the dense-matrix representation of the difference operator contains about n2 elements, which consumes a lot of RAM when n is large. However, as we have seen, the matrix representation of an operator is advantageous when you want to operate on a large number of short time series, as might arise in a simulation.
The post Difference operators as matrices appeared first on The DO Loop.
]]>The post A quantile definition for skewness appeared first on The DO Loop.
]]>Moment-based statistics are sensitive to extreme outliers. A single extreme observation can radically change the mean, standard deviation, and skewness of data. It is not surprising, therefore, that there are alternative definitions of skewness. One robust definition of skewness that is intuitive and easy to compute is a quantile definition, which is also known as the Bowley skewness or Galton skewness.
The quantile definition of skewness uses Q1 (the lower quartile value), Q2 (the median value), and Q3 (the upper quartile value). You can measure skewness as the difference between the lengths of the upper quartile (Q3-Q2) and the lower quartile (Q2-Q1), normalized by the length of the interquartile range (Q3-Q1). In symbols, the quantile skewness γ_{Q} is
You can visualize this definition by using the figure to the right. For a symmetric distribution, the quantile skewness is 0 because the length Q3-Q2 is equal to the length Q2-Q1. If the right length (Q3-Q2) is larger than the left length (Q2-Q1), then the quantile skewness is positive. If the left length is larger, then the quantile skewness is negative. For the extreme cases when Q1=Q2 or Q2=Q3, the quantile skewness is ±1. Consequently, whereas the Pearson skewness can be any real value, the quantile skewness is bounded in the interval [-1, 1]. The quantile skewness is not defined if Q1=Q3, just as the Pearson skewness is not defined when the variance of the data is 0.
There is an intuitive interpretation for the quantile skewness formula. Recall that the relative difference between two quantities R and L can be defined as their difference divided by their average value. In symbols, RelDiff = (R - L) / ((R+L)/2). If you choose R to be the length Q3-Q2 and L to be the length Q2-Q1, then quantile skewness is half the relative difference between the lengths.
It is instructive to simulate some skewed data and compute the two measures of skewness. The following SAS/IML statements simulate 1000 observations from a Gamma(a=4) distribution. The Pearson skewness of a Gamma(a) distribution is 2/sqrt(a), so the Pearson skewness for a Gamma(4) distribution is 1. For a large sample, the sample skewness should be close to the theoretical value. The QNTL call computes the quantiles of a sample.
/* compute the quantile skewness for data */ proc iml; call randseed(12345); x = j(1000, 1); call randgen(x, "Gamma", 4); skewPearson = skewness(x); /* Pearson skewness */ call qntl(q, x, {0.25 0.5 0.75}); /* sample quartiles */ skewQuantile = (q[3] -2*q[2] + q[1]) / (q[3] - q[1]); print skewPearson skewQuantile; |
For this sample, the Pearson skewness is 1.03 and the quantile skewness is 0.174. If you generate a different random sample from the same Gamma(4) distribution, the statistics will change slightly.
In general, there is no simple relationship between quantile skewness and Pearson skewness for a data distribution. (This is not surprising: there is also no simple relationship between a median and a mean, nor between the interquartile range and the standard deviation.) Nevertheless, it is interesting to compare the Pearson skewness to the quantile skewness for a particular probability distribution.
For many probability distributions, the Pearson skewness is a function of the parameters of the distribution. To compute the quantile skewness for a probability distribution, you can use the quantiles for the distribution. The following SAS/IML statements compute the skewness for the Gamma(a) distribution for varying values of a.
/* For Gamma(a), the Pearson skewness is skewP = 2 / sqrt(a). Use the QUANTILE function to compute the quantile skewness for the distribution. */ skewP = do(0.02, 10, 0.02); /* Pearson skewness for distribution */ a = 4 / skewP##2; /* invert skewness formula for the Gamma(a) distribution */ skewQ = j(1, ncol(skewP)); /* allocate vector for results */ do i = 1 to ncol(skewP); Q1 = quantile("Gamma", 0.25, a[i]); Q2 = quantile("Gamma", 0.50, a[i]); Q3 = quantile("Gamma", 0.75, a[i]); skewQ[i] = (Q3 -2*Q2 + Q1) / (Q3 - Q1); /* quantile skewness for distribution */ end; title "Pearson vs. Quantile Skewness"; title2 "Gamma(a) Distributions"; call series(skewP, skewQ) grid={x y} label={"Pearson Skewness" "Quantile Skewness"}; |
The graph shows a nonlinear relationship between the two skewness measures. This graph is for the Gamma distribution; other distributions would have a different shape. If a distribution has a parameter value for which the distribution is symmetric, then the graph will go through the point (0,0). For highly skewed distributions, the quantile skewness will approach ±1 as the Pearson skewness approaches ±∞.
Several researchers have noted that there is nothing special about using the first and third quartiles to measure skewness. An alternative formula (sometimes called Kelly's coefficient of skewness) is to use deciles: γ_{Kelly} = ((P90 - P50) - (P50 - P10)) / (P90 - P10). Hinkley (1975) considered the q_th and (1-q)_th quantiles for arbitrary values of q.
The quantile definition of skewness is easy to compute. In fact, you can compute the statistic by hand without a calculator for small data sets. Consequently, the quantile definition provides an easy way to quickly estimate the skewness of data. Since the definition uses only quantiles, the quantile skewness is robust to extreme outliers.
At the same time, the Bowley-Galton quantile definition has several disadvantages. It uses only the central 50% of the data to estimate the skewness. Two different data sets that have the same quartile statistics will have the same quantile skewness, regardless of the shape of the tails of the distribution. And, as mentioned previously, the use of the 25th and 75th percentiles are somewhat arbitrary.
Although the Pearson skewness is widely used in the statistical community, it is worth mentioning that the quantile definition is ideal for use with a box-and-whisker plot. The Q1, Q2, and Q2 quartiles are part of every box plot. Therefore you can visually estimate the quantile skewness as the relative difference between the lengths of the upper and lower boxes.
The post A quantile definition for skewness appeared first on The DO Loop.
]]>The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.
]]>SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.
For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)
This article shows three ways to visualize prediction regions in SAS:
This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM. I am grateful to my colleague Chris B. for discussions relevant to this topic.
To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.
proc logistic data=LogisticData; model y(Event='1') = x1 x2; store work.LogiModel; /* save model to item store */ run; data Grid; /* create grid in (x1,x2) coords */ do x1 = 0 to 1 by 0.02; do x2 = -7.5 to 7.5 by 0.3; output; end; end; run; proc plm restore=work.LogiModel; /* use PROC PLM to score model on a grid */ score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink; /* evaluate the model on new data */ run; |
This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.
After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!
Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.
There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:
proc logistic data=LogisticData; model y(Event='1') = x1 x2; effectplot contour(x=x1 y=x2); /* 2. contour plot with scatter plot overlay */ run; |
Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.
The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.
In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.
When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.
This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.
This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.
You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?
The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.
]]>