The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.

]]>The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have the same statistics sounds like a fun challenge! As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.

The Wikipedia article notes that there are "several methods to generate similar data sets with identical statistics and dissimilar graphics," but I did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!

I used the following approach to construct the first two data sets:

- Use a simulation to create linear data with random normal errors: Y
_{1}= 3 + 0.5 X + ε, where ε ~ N(0,1). - Compute the regression estimates (b0 and b1) and the sample correlation for the linear data. These are the target statistics. They define three equations that the second data set must match.
- The response variable for the second data set is of the form Y
_{2}= β_{0}+ β_{1}X + β_{2}X^{2}. There are three equations and three unknowns parameters, so we can solve a system of nonlinear equations to find β.

From geometric reasoning, there are three different solution for the β parameter: One with β_{2} > 0 (a parabola that opens up), one with β_{2} = 0 (a straight line), and one with β_{2} < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.

You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:

proc iml; call randseed(12345); x = T( do(4, 14, 0.2) ); /* evenly spaced X */ eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */ y = 3 + 0.5*x + eps; /* linear Y + error */ /* Helper function. Return paremater estimates for linear regression. Args are col vectors */ start LinearReg(Y, tX); X = j(nrow(tX), 1, 1) || tX; b = solve(X`*X, X`*Y); /* solve normal equation */ return b; finish; targetB = LinearReg(y, x); /* compute regression estimates */ targetCorr = corr(y||x)[2]; /* compute sample correlation */ print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"]; |

You can use these values as the target values. The next step is to find a parameter vector β such that
Y_{2} = β_{0} + β_{1} X + β_{2} X^{2} has the same regression line and corr(Y_{2}, X) has the same sample correlation. For uniqueness, set β_{2} < 0.

You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.)
The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y_{2} and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.

/* Define system of simultaneous equations: https://blogs.sas.com/content/iml/2018/02/28/solve-system-nonlinear-equations-sas.html */ /* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */ start LinearFitCorr(beta) global(x); y2 = beta[1] + beta[2]*x + beta[3]*x##2; /* construct quadratic Y */ b = LinearReg(y2, x); /* linear fit */ corr = corr(y2||x)[2]; /* sample corr */ return ( b` || corr); /* return row vector */ finish; /* This function returns the vector quantity (beta - target). Find value that minimizes Sum | F_i(beta)-Target+i |^2 */ start Func(beta) global(targetB, targetCorr); target = rowvec(targetB) || targetCorr; G = LinearFitCorr(beta) - target; return( G ); /* return row vector */ finish; /* now let's solve for quadratic parameters so that same linear fit and correlation occur */ beta0 = {-5 1 -0.1}; /* initial guess */ con = {. . ., /* constraint matrix */ 0 . 0}; /* quadratic term is negative */ optn = ncol(beta0) || 0; /* LS with 3 components || amount of printing */ /* minimize sum( beta[i] - target[i])**2 */ call nlphqn(rc, beta, "Func", beta0, optn) blc=con; /* LS solution */ print beta[L="Optimal beta"]; /* How nearly does the solution solve the problem? Did we match the target values? */ Y2Stats = LinearFitCorr(beta); print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3]; |

The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y_{2} = -4.955 + 2.566*X - 0.118*X^{2}.
The following statements create scatter plots of the new Anscombe-like data:

y2 = beta[1] + beta[2]*x + beta[3]*x##2; create Anscombe2 var {x y y2}; append; close; QUIT; ods layout gridded columns=2 advance=table; proc sgplot data=Anscombe2 noautolegend; scatter x=X y=y; lineparm x=0 y=3.561 slope=0.447 / clip; run; proc sgplot data=Anscombe2 noautolegend; scatter x=x y=y2; lineparm x=0 y=3.561 slope=0.447 / clip; run; ods layout end; |

Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.

You can create the other data sets similarly. For example:

- For the data set that consist of points along a line except for one outlier, there are three free parameters. Most of the points fall along the line Y
_{3}= a + b*X and one point is at the height Y_{3}= c. Therefore, you can run an optimization to solve for the values of (a, b, c) that match the target statistics. -
I'm not sure how to formulate the requirements for the fourth data set. It looks like all but one point have coordinates (X, Y
_{4}), where X is a fixed value and the vertical mean of the cluster is c. The outlier has coordinate (a, b). I'm not sure whether the variance of the cluster is important.

In summary, you can solve a system of equations to construct data similar to Anscombe's quartet. By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.

To be fair, the technique I've presented does not enable you to reproduce Anscombe's quartet in its entirety. My data share a linear fit and sample correlation, whereas Anscombe's data share seven statistics!

Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet. However he did it, Anscombe's quartet is truly remarkable!

You can download the SAS program that creates the results in this article.

The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.

]]>The post Efficient evaluation of a quadratic form appeared first on The DO Loop.

]]>x` A x

This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like

x` A

where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.

For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example,
if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x
equals the second degree polynomial
q(x1, x2) = 9*x1^{2} - 4*x1 x2 + 5*x2^{2}. A contour plot of this polynomial is shown below.

Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` I_{d} x = x` x = Σ_{i} x_{i}^{2}. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x1^{2} + 4*x2^{2}. If you add in off-diagonal elements, you get cross terms in the polynomial.

If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).

proc iml; /* q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 + 2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3 */ A = {4 3 2, 3 6 1, 2 1 9}; x = {-1, 2, 0.5}; q1 = x`*A*x; print q1; |

When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory. Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:

w = {4, 6, 9}; q2 = x`*(w#x); /* more efficient than q = x`*diag(w)*x; */ print q2; |

In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function.
(Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.)
An important example is the squared Mahalanobis distance x` Σ^{-1} x, which is a quadratic form.

As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance. The trick is to compute the Cholesky decomposition of the SPD matrix. I'll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:

/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite. Let A = U`*U be the Cholesky decomposition of A. Then q = x`(U`*U)*x = (U*x)`(Ux) So define z = U*x and compute q = z`*z */ start EvalSPDQuadForm(A, x); U = root(A); /* Cholesky root */ z = U*x; return (z` * z); /* dot product of vectors */ finish; /* Run on large example */ call randseed(123); N = 1000; A = toeplitz(N:1); x = randfun(N, "Normal"); q3 = EvalSPDQuadForm(A, x); /* efficient */ qCheck = x`*A*x; /* check computation by using a slower method */ print q3 qCheck; |

You can use a similar trick to evaluate the quadratic form x` A^{-1} x.
I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.

/* Evaluate quadratic form x`*inv(A)*x, where A is symmetric positive definite. Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition. To compute q = x` * inv(A) * x = x` * w, you need to solve for w. (U`U)*w = x, so First solve triangular system U` z = x for z, then solve triangular system U w = z for w */ start EvalInvQuadForm(A, x); U = root(A); /* Cholesky root */ z = trisolv(2, U, x); /* solve linear system */ w = trisolv(1, U, z); return (x` * w); /* dot product of vectors */ finish; /* test the function */ q4 = EvalInvQuadForm(A, x); /* efficient */ qCheck = x`*Solve(A,x); /* check computation by using a slower method */ print q4 qCheck; |

You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.

In summary, you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.

The post Efficient evaluation of a quadratic form appeared first on The DO Loop.

]]>The post 4 ways to compute an SSCP matrix appeared first on The DO Loop.

]]>If you are performing a least squares regressions of "long" data (many observations but few variables), forming the SSCP matrix consumes most of the computational effort. In fact, the PROC REG documentation points out that for long data "you can save 99% of the CPU time by reusing the SSCP matrix rather than recomputing it." That is because the SSCP matrix for an n x p data matrix is a symmetric p x p matrix. When n ≫ p, forming the SSCP matrix requires computing with a lot of rows. After it is formed, it is relatively simple to solve a p x p linear system.

Conceptually, the simplest way to compute the SSCP matrix is by multiplying the matrices in an efficient manner. This is what the SAS/IML matrix language does. It recognizes that X`*X is a special kind of matrix multiplication and uses an efficient algorithm to form the product. However, this approach requires that you be able to hold the entire data matrix in RAM, which might not be possible when you have billions of rows.

The following SAS DATA Step creates a data matrix that contains 426 rows and 10 variables. The PROC IML step reads the data into a matrix and forms the SSCP matrix:

/* remove any rows that contain a missing value: https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */ data cars; set sashelp.cars; if not nmiss(of _numeric_); run; proc iml; use cars; read all var _NUM_ into X[c=varNames]; close; /* If you want, you can add an intercept column: X = j(nrow(X),1,1) || X; */ n = nrow(X); p = ncol(X); SSCP = X`*X; /* 1. Matrix multiplication */ print n p, SSCP; |

Although the data has 426 rows, it only has 10 variables. Consequentially, the SSCP matrix is a small 10 x 10 symmetric matrix. (To simplify the code, I did not add an intercept column, so technically this SSCP matrix would be used for a no-intercept regression.)

The (i,j)th element of the SSCP matrix is the inner product of the i_th column and the j_th column. In general, Level 1 BLAS computations (inner products) are not as efficient as Level 3 computations (matrix products). However, if you have the data variables (columns) in an array of vectors, you can compute the p(p+1)/2 elements of the SSCP matrix by using the following loops over columns. This computation assumes that you can hold at least two columns in memory at the same time:

/* 2. Compute SSCP[i,j] as an inner-product of i_th and j_th columns */ Y = j(p, p, .); do i = 1 to p; u = X[,i]; /* u = i_th column */ do j = 1 to i; v = X[,j]; /* v = j_th column */ Y[i,j] = u` * v; Y[j,i] = Y[i,j]; /* assign symmetric element */ end; end; |

The third approach is to compute the SSCP matrix as a sum of outer products of rows.
Before I came to SAS, I considered the outer-product method to be inferior to the other two. After all, you need to form n matrices (each p x p) and add these matrices together. This did not seem like an efficient scheme. However, when I came to SAS I learned that this method is extremely efficient for dealing with Big Data because *you never need to keep more than one row of data into memory!* A SAS procedure like PROC REG has to read the data anyway, so as it reads each row, it also forms outer product and updates the SSCP. When it finishes reading the data, the SSCP is fully formed and ready to solve!

I've recently been working on parallel processing, and the outer-product SSCP is ideally suited for reading and processing data in parallel. Suppose you have a grid of G computational nodes, each holding part of a massive data set. If you want to perform a linear regression on the data, each node can read its local data and form the corresponding SSCP matrix. To get the full SSCP matrix, you merely need to add the G SSCP matrices together, which are relatively small and thus cheap to pass between nodes. Consequently, any algorithm that uses the SSCP matrix can greatly benefit from a parallel environment when operating on Big Data. You can also use this scheme for streaming data.

For completeness, here is what the outer-product method looks like in SAS/IML:

/* 3. Compute SSCP as a sum of rank-1 outer products of rows */ Z = j(p, p, 0); do i = 1 to n; w = X[i,]; /* Note: you could read the i_th row; no need to have all rows in memory */ Z = Z + w` * w; end; |

For simplicity, the previous algorithm works on one row at a time. However, it can be more efficient to process multiple rows. You can easily buffer a block of k rows and perform an outer product of the partial data matrix. The value of k depends on the number of variables in the data, but typically the block size, k, is dozens or hundreds. In a procedure that reads a data set (either serially or in parallel), each operation would read k observations except, possibly, the last block, which would read the remaining observations. The following SAS/IML statements loop over blocks of k=10 observations at a time. You can use the FLOOR-MOD trick to find the total number of blocks to process, assuming you know the total number of observations:

/* 4. Compute SSCP as the sum of rank-k outer products of k rows */ /* number of blocks: https://blogs.sas.com/content/iml/2019/04/08/floor-mod-trick-items-to-groups.html */ k = 10; /* block size */ numBlocks = floor(n / k) + (mod(n, k) > 0); /* FLOOR-MOD trick */ W = j(p, p, 0); do i = 1 to numBlocks; idx = (1 + k*(i-1)) : min(k*i, n); /* indices of the next block of rows to process */ A = X[idx,]; /* partial data matrix: k x p */ W = W + A` * A; end; |

All computations result in the same SSCP matrix. The following statements compute the sum of squares of the differences between elements of X`*X (as computed by using matrix multiplication) and the other methods. The differences are zero, up to machine precision.

diff = ssq(SSCP - Y) || ssq(SSCP - Z) || ssq(SSCP - W); if max(diff) < 1e-12 then print "The SSCP matrices are equivalent."; print diff[c={Y Z W}]; |

In summary, there are several ways to compute a sum of squares and crossproducts (SSCP) matrix. If you can hold the entire data in memory, a matrix multiplication is very efficient. If you can hold two variables of the data at a time, you can use the inner-product method to compute individual cells of the SSCP. Lastly, you can process one row at a time (or a block of rows) and use outer products to form the SSCP matrix without ever having to hold large amounts of data in RAM. This last method is good for Big Data, streaming data, and parallel processing of data.

The post 4 ways to compute an SSCP matrix appeared first on The DO Loop.

]]>The post Use the FLOOR-MOD trick to allocate items to groups appeared first on The DO Loop.

]]>
The problem of allocating a discrete number of items to groups comes up often in computer programming.
I call the solution *the FLOOR-MOD trick*
because you can use the FLOOR function to compute the base number in each group and use the MOD function to compute the number of remainders.
Although the problem is elementary, this article describes two programming tips that you can use in a vectorized computer language like SAS/IML:

- You can compute all the group sizes in one statement. You do not need to use a loop to assign the remaining items.
- If you enumerate the patients, you can directly assign consecutive numbers to each group. There is no need to deal with the remainders at the end of the process.

Although I use patients and treatment groups for the example, I encountered this problem as part of a parallel programming problem where I wanted to assign B tasks evenly among k computational resources. In my example, there were B = 14 tasks and k = 3 resources.

Most computer languages support the FLOOR function for integer division and the MOD function for computing the remainder. The FLOOR-MOD trick works like this. If you want to divide B items into k groups, let A be the integer part of B / k and let C be the remainder, C = B - A*k. Then B = A*k + C, where C < k. In computer code, A = FLOOR(B/k) and C = MOD(B, k).

There are many ways to distribute the remaining items, but for simplicity let's give an extra item to each of the first C groups. For example, if B = 14 and k = 3, then A = floor(14/3) = 4 and C = 2. To allocate 14 items to 3 groups, give 4 items to all groups, and give an extra item to the first C=2 groups.

In a vector programming language, you can assign the items to each group without doing any looping, as shown in the following SAS/IML program:

/* Suppose you have B tasks that you want to divide as evenly as possible among k Groups. How many tasks should you assign to the i_th group? */ proc iml; /* B = total number of items or tasks (B >= 0, scalar) k = total number of groups or workers (k > 0, scalar) i = scalar or vector that specifies the group(s), max(i)<=k Return the number of items to allocate to the i_th group */ start AssignItems(B, k, i); n = floor(B / k) + (i <= mod(B, k)); /* the FLOOR-MOD trick */ return(n); finish; /* Test the code. Assign 14 tasks to 3 Groups. */ nItems = 14; nGroups = 3; idx = T(1:nGroups); /* get number of items for all groups */ Group = char(idx); /* ID label for each group */ n = AssignItems(nItems, nGroups, idx); print n[c={"Num Items"} r=Group L="Items per Group"]; |

The AssignItems function is a "one-liner."
The interesting part of the AssignItems function is the binary expression `i <= mod(B, k)`, which is valid even when `i` is a vector. In this example, the expression evaluates to the vector {1, 1, 0}, which assigns an extra item to each of the first two groups.

A related problem is figuring out which items get sent to which groups. In the example where B=14 and k=3, I want to put items 1-5 in the first group, items 6-10 in the second group, and items 11-14 in the last group. There is a cool programming trick, called *the CUSUM-LAG trick*, which enables you to find these indices. The following function is copied from my article on the CUSUM-LAG trick. After you find the number of items in each group, you can use the ByGroupIndices function to find the item numbers in each group:

/* Return kx2 matrix that contains the first and last elements for k groups that have sizes s[1], s[2],...,s[k]. The i_th row contains the first and last index for the i_th group. */ start ByGroupIndices( s ); size = colvec(s); /* make sure you have a column vector */ endIdx = cusum(size); /* locations of ending index */ beginIdx = 1 + lag(endIdx); /* begin after each ending index ... */ beginIdx[1] = 1; /* ...and at 1 */ return ( beginIdx || endIdx ); finish; /* apply the CUSUM-LAG trick to the item allocation problem */ GroupInfo = n || ByGroupIndices(n); print GroupInfo[r=Group c={"NumItems" "FirstItem" "LastItem"}]; |

The table shows the number of items allocated to each group, as well as the item indices in each group. You can use the FLOOR-MOD trick to get the number of items and the CUSUM-LAG trick to get the indices. In a vector language, you can implement the entire computation without using any loops.

The post Use the FLOOR-MOD trick to allocate items to groups appeared first on The DO Loop.

]]>The post Convergence in mixed models: When the estimated G matrix is not positive definite appeared first on The DO Loop.

]]>
For mixed models, several problems can occur if you have a misspecified model. One issue
results in the following note in the SAS log:

`NOTE: Estimated G matrix is not positive definite.`

This article describes what the note means, why it might occur, and what to do about it. If you encounter this note during a BY-group analysis or simulation study, this article shows how to identify the samples that have the problem.

There are two excellent references that discuss issues related to convergence in the SAS mixed model procedures:

- Kiernan, Tao, and Gibbs (2012), "Tips and Strategies for Mixed Modeling with SAS/STAT Procedures"
- Kiernan (2018), "Insights into Using the GLIMMIX Procedure to Model Categorical Outcomes with Random Effects"

Before we discuss convergence, let's review what the G matrix is and why it needs to be positive definite.
The matrix formulation of a mixed model is

Y = X β + Z γ + ε

where β is a vector of fixed-effect parameters. The random effects are assumed to be random realizations from multivariate normal distributions. In particular, γ ~ MVN(0, G) and ε ~ MVN(0, R), where G and R are covariance matrices. The variance-covariance matrix G is often used to specify subject-specific effects, whereas R specifies residual effects. A goal of mixed models is to specify the structure of the G and/or R matrices and estimate the variance-covariance parameters.

Because G is a covariance matrix, G must be positive semidefinite. A nondegenerate covariance matrix will be fully positive definite. However, estimates of G might not have this property. SAS alerts you if the estimate is not positive definite.

As stated in Kiernan (2018, p. ), "It is important that you do not ignore this message."

A SAS Usage Note states that the PROC MIXED message means that "one or more variance components on the RANDOM statement is/are estimated to be zero and could/should be removed from the model." Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) describe several reasons why an estimated G matrix can fail to be positive definite. Two of the more common reasons include:

- There is not enough variation in the response. After controlling for the fixed effects, there isn't any (or much) variation for the random effects. You might want to remove the random effect from the model. (KTG, p. 9)
- The model is misspecified. You might need to modify the model or change the covariance structure to use fewer parameters. (Kiernan, p. 18)

If you encounter the note `"Estimated G matrix is not positive definite"` for real data, you should modify the model or collect more data. In a simulation study, however, there might be simulated samples that do not fit the model *even when the data is generated from the model!* This can happen for very small data sets and for studies in which the variance components are very small. In either case, you might want to identify the samples for which the model fails to converge, either to exclude them from the analysis or to analyze them by using a different model. The same situation can occur for few BY groups during a traditional BY-group analysis.

The following SAS program illustrates how to detect the samples for which the estimated G matrix is not positive definite. The 'SP' data are from an example in the PROC MIXED documentation. The data set named 'ByData' is constructed for this blog post. It contains four copies of the real data, along with an ID variable with values 1, 2, 3, and 4. For the first copy, the response variable is set to 0, which means that there is no variance in the response. For the third copy, the response variable is simulated from a normal distribution. It has variance, but does not conform to the model. The call to PROC MIXED fits the same random-effects model to all four samples:

/* Example from PROC MIXED documentation: https://bit.ly/2YEmpGw */ data sp; input Block A B Y @@; datalines; 1 1 1 56 1 1 2 41 1 2 1 50 1 2 2 36 1 3 1 39 1 3 2 35 2 1 1 30 2 1 2 25 2 2 1 36 2 2 2 28 2 3 1 33 2 3 2 30 3 1 1 32 3 1 2 24 3 2 1 31 3 2 2 27 3 3 1 15 3 3 2 19 4 1 1 30 4 1 2 25 4 2 1 35 4 2 2 30 4 3 1 17 4 3 2 18 ; /* concatenate four copies of the data, but modify Y for two copies */ data ByData; call streaminit(1); set sp(in=a1) sp(in=a2) sp(in=a3) sp(in=a4); SampleID = a1 + 2*a2 + 3*a3 + 4*a4; if a1 then Y = 0; /* response is constant (no variation) */ if a3 then Y = rand("Normal", 31, 10); /* response is indep of factors */ run; /* suppress ODS output: https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html */ %macro ODSOff(); ods graphics off; ods exclude all; ods noresults; %mend; %macro ODSOn(); ods graphics on; ods exclude none; ods results; %mend; %ODSOff proc mixed data=ByData; by sampleID; class A B Block; model Y = A B A*B; random Block A*Block; ods output ConvergenceStatus=ConvergeStatus; run; %ODSOn |

The SAS log displays various notes about convergence. The second and fourth samples (the doc example) converged without difficulty. The log shows a WARNING for the first group. The third group displays the note `Estimated G matrix is not positive definite.`

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive residual variance estimate. WARNING: Stopped because of infinite likelihood. NOTE: The above message was for the following BY group: SampleID=1 NOTE: Convergence criteria met. NOTE: The above message was for the following BY group: SampleID=2 NOTE: Convergence criteria met. NOTE: Estimated G matrix is not positive definite. NOTE: The above message was for the following BY group: SampleID=3 NOTE: Convergence criteria met. NOTE: The above message was for the following BY group: SampleID=4 |

Notice that the ODS OUTPUT statement writes the ConvergenceStatus table for each BY group to a SAS data set called ConvergeStatus. For each BY group, the ConvergenceStatus table is appended to a SAS data set called ConvergeStatus. The result of the PROC PRINT statement shows the columns of the table for each of the four groups:

proc print data=ConvergeStatus; run; |

The SampleID column is the value of the BY-group variable. The Reason column explains why the optimization process terminated. The Status column is 0 if the model converged and nonzero otherwise. Note that the third model converged, even though the G matrix was not positive definite!

To detect nonpositive definite matrices, you need to look at the pdG column, The pdG indicates which models had a positive definite G matrix (pdG=1) or did not (pdG=0). Although I do not discuss it in this article, the pdH column is an indicator variable that has value 0 if the SAS log displays the message
`NOTE: Convergence criteria met but final hessian is not positive definite.`

The pdG column tells you which models did not have a positive definite variance matrix. You can merge the ConverenceStatus table with the original data and exclude (or keep) the samples that did not converge or that had invalid variance estimates, as shown in the following DATA step:

/* keep samples that converge and for which gradient and Hessian are positive definite */ data GoodSamples; merge ByData ConvergeStatus; by SampleID; if Status=0 & pdG & pdH; run; |

If you run PROC MIXED on the samples in the GoodSamples data set, all samples will converge without any scary-looking notes.

In summary,
this article discusses the note `Estimated G matrix is not positive definite,` which sometimes occurs when using PROC MIXED or PROC GLMMIX in SAS. You should not ignore this note. This note can indicate that the model is misspecified. This article shows how you can detect this problem programmatically during a simulation study or a BY-group analysis. You can use the ConvergenceStatus table to identify the BY groups for which this the problem occurs.
Kiernan, Tao, and Gibbs (2012) and
Kiernan (2018) provide more insight into this note and other problems that you might encounter when fitting mixed models.

The post Convergence in mixed models: When the estimated G matrix is not positive definite appeared first on The DO Loop.

]]>The post Matrix operations and BY groups appeared first on The DO Loop.

]]>I previously showed that you can perform BY-group processing in SAS/IML by using the UNIQUEBY technique, so this article uses the UNIQUE-LOC technique. The statistical application is simulating clusters of data. If you have a SAS data set that contains the centers and covariance matrices for several groups of observations, you can then read that information into SAS/IML and simulate new observations for each group by using a multivariate normal distribution.

A BY-group analysis in SAS/IML usually starts with a SAS data set that contains a bunch of covariance or correlation matrices. A simple example is a correlation analysis of each species of flower in Fisher's iris data set. The BY-group variable is the species of iris: Setosa, Versicolor, or Virginica. The variables are measurements (in mm) of the sepals and petals of 150 flowers, 50 from each species. A panel of scatter plots for the iris data is shown to the right. You can see that the three species appear to be clustered. From the shapes of the clusters, you might decide to model each cluster by using a multivariate normal distribution.

You can use the OUTP= and COV options in PROC CORR to output mean and covariance statistics for each group, as follows:

proc corr data=sashelp.iris outp=CorrOut COV noprint; by Species; var Petal: Sepal:; run; proc print data=CorrOut(where=(_TYPE_ in ("N", "MEAN", "COV"))) noobs; where Species="Setosa"; /* just view information for one group */ by Species _Type_ notsorted; var _NAME_ Petal: Sepal:; run; |

The statistics for one of the groups (Species='Setosa') are shown. The number of observations in the group (N) is actually a scalar value, but it was replicated to fit into a rectangular data set.

This section reads the sample size, mean vector, and covariance matrix for all groups. A WHERE clause selects only the observations of interest:

/* Read in N, Mean, and Cov for each species. Use to create a parametric bootstrap by simulating N[i] observations from a MVN(Mean[i], Cov[i]) distribution */ proc iml; varNames = {'PetalLength' 'PetalWidth' 'SepalLength' 'SepalWidth'}; use CorrOut where (_TYPE_="N" & Species^=" "); /* N */ read all var varNames into mN[rowname=Species]; /* read for all groups */ print mN[c=varNames]; use CorrOut where (_TYPE_="MEAN" & Species^=" "); /* mean */ read all var varNames into mMean[rowname=Species]; /* read for all groups */ print mMean[c=varNames]; use CorrOut where (_TYPE_="COV" & Species^=" "); /* covariance */ read all var varNames into mCov[rowname=Species]; /* read for all groups */ close; print mCov[c=varNames]; |

The output (not shown) shows that the matrices mN, mMean, and mCov contain the vertical concatenation (for all groups) of the sample size, mean vectors, and covariance matrices, respectively.

The grouping variable is Species.
You can use the UNIQUE function to get the unique (sorted) values of the groups. You can then iterate over the unique values and use the LOC function to extract only the rows of the matrices that correspond to the *i*th group. What you do with that information depends on your application. In the following program, the information for each group is used to create a random sample from a multivariate normal distribution that has the same size, mean, and covariance as the *i*th group:

/* Goal: Write random MVN values in X to data set */ X = j(1, ncol(varNames), .); /* X variables will be numeric */ Spec = BlankStr(nleng(Species)); /* and Spec will be character */ create SimOut from X[rowname=Spec colname=varNames]; /* open for writing */ /* The BY-group variable is Species */ call randseed(12345); u = unique(Species); /* get unique values for groups */ do i = 1 to ncol(u); /* iterate over unique values */ idx = loc(Species=u[i]); /* rows for the i_th group */ N = mN[i, 1]; /* extract scalar from i_th group */ mu = mMean[i,]; /* extract vector from i_th group */ Sigma = mCov[idx,]; /* extract matrix from i_th group */ /* The parameters for this group are now in N, mu, and Sigma. Do something with these values. */ X = RandNormal(N, mu, Sigma); /* simulate obs for the i_th group */ X = round(X); /* measure to nearest mm */ Spec = j(N, 1, u[i]); /* ID for this sample */ append from X[rowname=Spec]; end; close SimOut; quit; ods graphics / attrpriority=none; title "Parametric Bootstrap Simulation of Iris Data"; proc sgscatter data=SimOut(rename=(Spec=Species)); matrix Petal: Sepal: / group=Species; run; |

The simulation has generated a new set of clustered data. If you compare the simulated data with the original, you will notice many statistical similarities.

Although the main purpose of this article is to discuss BY-group processing in SAS/IML, I want to point out that the simulation in this article is an example of the parametric bootstrap.
*Simulating Data with SAS* (Wicklin, 2013) states that "the parametric bootstrap is nothing more than the process of fitting a model
distribution to the data and simulating data from the fitted model." That is what happens in this program. The sample means and covariance matrices are used as parameters to generate new synthetic observations. Thus, the parametric bootstrap technique is really a form of simulation where the parameters for the simulation are obtained from the data.

In conclusion, sometimes you have many matrices in a SAS data set, each identified by a categorical variable. You can perform "BY-group processing" in SAS/IML by reading in all the matrices into a big matrix and then use the UNIQUE-LOC technique to iterate over each matrix.

The post Matrix operations and BY groups appeared first on The DO Loop.

]]>The post How to simulate multivariate outliers appeared first on The DO Loop.

]]>
When you simulate data, you know the data-generating distribution.
In general, an outlier is an observation that has a small probability of being randomly generated.
Consequently, outliers are usually generated by using a different distribution (called the *contaminating distribution*) or by using knowledge of the data-generating distribution to construct improbable data values.

A canonical example of adding an improbable value is to add an outlier to normal data by creating a data point that is *k* standard deviations from the mean. (For example, *k* = 5, 6, or 10.)
For multivaraite data, an outlier does not always have extreme values in all coordinates. However, the idea of an outliers as "far from the center" does generalize to higher dimensions. The following technique generates outliers for multivariate normal data:

- Generate uncorrelated, standardized, normal data.
- Generate outliers by adding points whose distance to the origin is
*k*units. Because you are using standardized coordinates, one unit equals one standard deviation. - Use a linear transformation (called the
*Cholesky transformation*) to produce multivariate normal data with a desired correlation structure.

In short, you use the Cholesky transformation to transform standardized uncorrelated data into scaled correlated data.
The remarkable thing is that you can specify the covariance and then *directly* compute the Cholesky transformation that will result in that covariance. This is a special property of the multivariate normal distribution. For a discussion about how to perform similar computations for other multivariate distributions, see Chapter 9 in Simulating Data with SAS (Wicklin 2013).

Let's illustrate this method by using a two-dimensional example. The following SAS/IML program generates 200 standardized uncorrelated data points. In the standardized coordinate system, the Euclidean distance and the Mahalanobis distance are the same. It is therefore easy to generate an outlier: you can generate any point whose distance to the origin is larger than some cutoff value. The following program generates outliers that are 5 units from the origin:

ods graphics / width=400px height=400px attrpriority=none; proc iml; /* generate standardized uncorrelated data */ call randseed(123); N = 200; x = randfun(N//2, "Normal"); /* 2-dimensional data. X ~ MVN(0, I(2)) */ /* covariance matrix is product of diagonal (scaling) and correlation matrices. See https://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html */ R = {1 0.9, /* correlation matrix */ 0.9 1 }; D = {4 9}; /* standard deviations */ Sigma = corr2cov(R, D); /* Covariance matrix Sigma = D*R*D */ print Sigma; /* U is the Cholesky transformation that scales and correlates the data */ U = root(Sigma); /* add a few unusual points (outliers) */ pi = constant('pi'); t = T(0:11) * pi / 6; /* evenly spaced points in [0, 2p) */ outliers = 5#cos(t) || 5#sin(t); /* evenly spaced on circle r=5 */ v = x // outliers; /* concatenate MVN data and outliers */ w = v*U; /* transform from stdized to data coords */ labl = j(N,1," ") // T('0':'11'); Type = j(N,1,"Normal") // j(12,1,"Outliers"); title "Markers on Circle r=5"; title2 "Evenly spaced angles t[i] = i*pi/6"; call scatter(w[,1], w[,2]) grid={x y} datalabel=labl group=Type; |

The program simulates standardized uncorrelated data (X ~ MVN(0, I(2))) and then appends 12 points on the circle of radius 5. The Cholesky transformation correlates the data. The correlated data are shown in the scatter plot. The circle of outliers has been transformed into an ellipse of outliers. The original outliers are 5 Euclidean units from the origin; the transformed outliers are 5 units of Mahalanobis distance from the origin, as shown by the following computation:

center = {0 0}; MD = mahalanobis(w, center, Sigma); /* compute MD based on MVN(0, Sigma) */ obsNum = 1:nrow(w); title "Mahalanobis Distance of MVN Data and Outliers"; call scatter(obsNum, MD) grid={x y} group=Type; |

The same technique enables you to add outliers at any distance to the origin. For example, the following modification of the program adds outliers that are 5, 6, and 10 units from the origin. The *t* parameter determines the angular position of the outlier on the circle:

MD = {5, 6, 10}; /* MD for outlier */ t = {0, 3, 10} * pi / 6; /* angular positions */ outliers = MD#cos(t) || MD#sin(t); v = x // outliers; /* concatenate MVN data and outliers */ w = v*U; /* transform from stdized to data coords */ labl = j(N,1," ") // {'p1','p2','p3'}; Type = j(N,1,"Normal") // j(nrow(t),1,"Outlier"); call scatter(w[,1], w[,2]) grid={x y} datalabel=labl group=Type; |

If you use the Euclidean distance, the second and third outliers (p2 and p3) are closer to the center of the data than p1. However, if you draw the probability ellipses for these data, you will see that p1 is more probable than p2 and p3. This is why the Mahalanobis distance is used for measuring how extreme an outlier is. The Mahalanobis distances of p1, p2, and p3 are 5, 6, and 10, respectively.

In two dimensions, you can use the formula (x(t), y(t)) = r*(cos(t), sin(t)) to specify an observation that is *r* units from the origin. If *t* is chosen randomly, you can obtain a random point on the circle of radius *r*. In higher dimensions, it is not easy to parameterize a sphere, which is the surface of an *d*-dimensional ball. Nevertheless, you can still generate random points on a *d*-dimensional sphere of radius *r* by doing the following:

- Generate a point from the d-dimensional multivariate normal distribution: Y = MVN(0, I(d)).
- Project the point onto the surface of the
*d*-dimensional sphere of radius*r*: Z = r*Y / ||Y||. - Use the Cholesky transformation to correlate the variables.

In the SAS/IML language, it would look like this:

/* General technique to generate random outliers at distance r */ d = 2; /* use any value of d */ MD = {5, 6, 10}; /* MD for outlier */ Y = randfun(nrow(MD)//d, "Normal"); /* d-dimensional Y ~ MVN(0, I(d)) */ outliers = MD # Y / sqrt(Y[,##]); /* surface of spheres of radii MD */ /* then append to data, use Cholesky transform, etc. */ |

For these particular random outliers, the largest outliers (MD=10) is in the upper right corner. The others are in the lower left corner. If you change the random number seed in the program, the outliers will appear in different locations.

In summary, by understanding the geometry of multivariate normal data and the Cholesky transformation, you can manufacture outliers in specific locations or in random locations. In either case, you can control whether you are adding a "near outlier" or an "extreme outlier" by specifying an arbitrary Mahalanobis distance.

You can download the SAS program that generates the graphs in this article.

The post How to simulate multivariate outliers appeared first on The DO Loop.

]]>The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

]]>
In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in *every* coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in *any* coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

- The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
- For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
- The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
- The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.

The diagram displays two observations, labeled "A" and "B":

- The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a
*univariate*outlier for both the X1 and X2 coordinates! - The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the *combination* of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate *z* scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its *z* scores are relatively small.

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate *z* scores. Another observation had large *z* scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within *t* MD units of the center is 1 - exp(-*t*^{2}/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6.
Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

- What is Mahalanobis distance?
- The Cholesky transformation and the geometry of multivariate normal variables
- The curse of dimensionality: How to define outliers in high-dimensional data?
- How to compute Mahalanobis distance in SAS

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

]]>The post Truncate response surfaces appeared first on The DO Loop.

]]>This article shows how you can truncate a surface or a contour plot so that negative values are not displayed. You could do something similar to truncate unreasonably high values in a surface plot.

Before showing how to truncate the surface plot, let's figure out why the model predicts negative values when all the observed responses are positive. The following DATA step is a simplified version of the real data. The RSREG procedure uses least squares regression to fit a quadratic response surface. If you use the PLOTS=SURFACE option, the procedure automatically displays a contour plot and surface plot for the predicted response:

data Sample; input X Y Z @@; datalines; 10 90 22 22 76 13 22 75 7 24 78 14 24 76 10 25 63 5 26 62 10 26 94 20 26 63 15 27 94 16 27 95 14 29 66 7 30 69 8 30 74 8 ; ods graphics / width=400px height=400px ANTIALIASMAX=10000; proc rsreg data=Sample plots=surface(fill=pred overlaypairs); model Z = Y X; run; proc rsreg data=Sample plots=surface(3d fill=Pred gridsize=80); model Z = Y X; ods select Surface; ods output Surface=Surface; /* use ODS OUTPUT to save surface data to a data set */ run; |

The contour plot overlays a scatter plot of the data. You can see that the data are observed only in the upper-right portion of the plot (the red regions) and that no data are in the lower-left portion of the plot. The RSREG procedure fits a quadratic model to the data. The predicted values near the observed data are all positive. Some of the predicted values that are far from the observed data are negative.

I previously wrote about this phenomenon and showed how to compute the convex hull for these bivariate data. When you evaluate the model inside the convex hull, you are interpolating. When you evaluate the model outside the convex hull, you are extrapolating. It is well known that polynomial regression models can give nonsensical results if you extrapolate far from the data.

The RSREG procedure is not aware that the response variable should be positive. A quadratic surface will eventually get arbitrarily big in the positive and/or negative directions. You can see this on the contour and surface plots, which show the predictions of the model on a regular grid of (X, Y) values.

If you want to display only the positive portion of the prediction surface, you can replace each negative predicted value with a missing value. The first step is to obtain the predicted values on a regular grid. You can use the "missing value trick" to score the quadratic model on a grid, or you can use the ODS OUTPUT statement to obtain the gridded values that are used in the surface plot. I chose the latter option. In the previous section, I used the ODS OUTPUT statement to write the gridded predicted values for the surface plot to a SAS data set named Surface.

As Warren Kuhfeld points out in his article about processing ODS OUTPUT data set, the names in an ODS data object can be "long and hard to type." Therefore, I rename the variables. I also combine the gridded values with the original data so that I can optionally overlay the data and the predicted values.

/* rename vars and set negative responses to missing */ data Surf2; set Surface(rename=( Predicted0_1_0_0 = Pred /* rename the long ODS names */ Factor1_0_1_0_0 = GY /* 'G' for 'gridded' */ Factor2_0_1_0_0 = GX)) Sample(in=theData); /* combine with original data */ if theData then Type = "Data "; else Type = "Gridded"; if Pred < 0 then Pred = .; /* replace negative predictions with missing values */ label GX = 'X' GY = 'Y'; run; |

You can use the Graph Template Language (GTL) to generate graphs that are similar to those produced by PROC RSREG. You can then use PROC SGRENDER to create the graph. Because the negative response values were set to missing, the contour plot displays a missing value color (black, for this ODS style) in the lower-left and upper-right portions of the plot. Similarly, the missing values cause the surface plot to be truncated. By using the GRIDSIZE= option, you can make the jagged edges small.

Notice that the colors in the graphs are now based on the range [0, 50], whereas previously the colors were based on the range [-60, 50]. I've added a continuous legend to the plots so that the range of the response variable is obvious.

I'd like to stress that sometimes "nonsensical values" indicate an inappropriate model. If you notice nonsensical values, you should always ask yourself why the model is predicting those values. You shouldn't modify the prediction surface without a good reason. But if you do have a good reason, the techniques in this article should help you.

You can download the complete SAS program that analyzes the data and generates the truncated graphs.

The post Truncate response surfaces appeared first on The DO Loop.

]]>The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.

]]>The same dangers exist for multivariate regression models, but they are emphasized less often. Perhaps the reason is that it is much harder to know when you are extrapolating a multivariate model. Interpolation occurs when you evaluate the model inside the convex hull of the training data. Anything else is an extrapolation. In particular, you might be extrapolating even if you score the model at a point inside the bounding box of the training data. This differs from the univariate case in which the convex hull equals the bounding box (range) of the data. In general, the convex hull of a set of points is smaller than the bounding box.

You can use a bivariate example to illustrate the difference between the convex hull of the data and the bounding box for the data, which is the rectangle
[X_{min}, X_{max}] x [Y_{min}, Y_{max}].
The following SAS DATA step defines two explanatory variables (X and Y) and one response variable (Z). The SGPLOT procedure shows the distribution of the (X, Y) variables and colors each marker according to the response value:

data Sample; input X Y Z @@; datalines; 10 90 22 22 76 13 22 75 7 24 78 14 24 76 10 25 63 5 26 62 10 26 94 20 26 63 15 27 94 16 27 95 14 29 66 7 30 69 8 30 74 8 ; title "Response Values for Bivariate Data"; proc sgplot data=Sample; scatter x=x y=y / markerattrs=(size=12 symbol=CircleFilled) colorresponse=Z colormodel=AltThreeColorRamp; xaxis grid; yaxis grid; run; |

The data are observed in a region that is approximately triangular. No observations are near the lower-left corner of the plot. If you fit a response surface to this data, it is likely that you would visualize the model by using a contour plot or a surface plot on the rectangular domain [10, 30] x [62, 95]. For such a model, predicted values near the lower-left corner are not very reliable because the corner is far from the data.

In general, you should expect less accuracy when you predict the model "outside" the data (for example, (10, 60)) as opposed to points that are "inside" the data (for example, (25, 70)). This concept is sometimes discussed in courses about the design of experiments. For a nice exposition, see the course notes of Professor Rafi Hafka (2012, p. 49–59) at the University of Florida.

You can use SAS to visualize the convex hull of the bivariate observations. The convex hull is the smallest convex set that contains the observations. The SAS/IML language supports the CVEXHULL function, which computes the convex hull for a set of planar points.

You can represent the points by using an *N* x 2 matrix, where each row is a 2-D point.
When you call the CVEXHULL function, you obtain a vector of *N* integers. The first few integers are positive and represent the rows of the matrix that comprise the convex hull. The (absolute value of) the negative integers represents the rows that are interior to the convex hull. This is illustrated for the sample data:

proc iml; use Sample; read all var {x y} into points; close; /* get indices of points in the convex hull in counter-clockwise order */ indices = cvexhull( points ); print (indices`)[L="indices"]; /* positive indices are boundary; negative indices are inside */ |

The output shows that the observation numbers (indices) that form the convex hull are {1, 6, 7, 12, 13, 14, 11}. The other observations are in the interior. You can visualize the interior and boundary points by forming a binary indicator vector that has the value 1 for points on the boundary and 0 for points in the interior. To get the indicator vector in the order of the data, you need to use the SORTNDX subroutine to compute the anti-rank of the indices, as follows:

b = (indices > 0); /* binary indicator variable for sorted vertices */ call sortndx(ndx, abs(indices)); /* get anti-rank, which is sort index that "reverses" the order */ onBoundary = b[ndx]; /* binary indicator data in original order */ title "Convex Hull of Bivariate Data"; call scatter(points[,1], points[,2]) group=onBoundary option="markerattrs=(size=12 symbol=CircleFilled)"; |

The blue points are the boundary of the convex hull whereas the red points are in the interior.

You can visualize the convex hull by forming the polygon that connects the first, sixth, seventh, ..., eleventh observations.
You can do this manually by using the POLYGON statement in PROC SGPLOT, which I show in the Appendix section. However, there is an easier way to visualize the convex hull. I previously wrote about SAS/IML packages and showed how to install the `polygon` package. The `polygon` package contains a module called PolyDraw, which enables you to draw polygons and overlay a scatter plot.

The following SAS/IML statements extract the positive indices and use them to get the points on the boundary of the convex hull. If the `polygon` package is installed, you can load the `polygon` package and visualize the convex hull and data:

hullNdx = indices[loc(b)]; /* get positive indices */ convexHull = points[hullNdx, ]; /* extract the convex hull, in CC order */ /* In SAS/IML 14.1, you can use the polygon package to visualize the convex hull: https://blogs.sas.com/content/iml/2016/04/27/packages-share-sas-iml-programs-html */ package load polygon; /* assumes package is installed */ run PolyDraw(convexHull, points||onBoundary) grid={x y} markerattrs="size=12 symbol=CircleFilled"; |

The graph shows the convex hull of the data. You can see that it primarily occupies the upper-right portion of the rectangle. The convex hull shows the interpolation region for regression models. If you evaluate a model outside the convex hull, you are extrapolating. In particular, even though points in the lower left corner of the plot are within the bounding box of the data, they are far from the data.

Of course, if you have 5, 10 or 100 explanatory variables, you will not be able to visualize the convex hull of the data. Nevertheless, the same lesson applies. Namely, when you evaluate the model inside the bounding box of the data, you might be extrapolating rather than interpolating. Just as in the univariate case, the model might predict nonsensical data when you extrapolate far from the data.

Packages are supported in SAS/IML 14.1. If you are running an earlier version of SAS, you create the same graph by writing the polygon data and the binary indicator variable to a SAS data set, as follows:

hullNdx = indices[loc(b)]; /* get positive indices */ convexHull = points[hullNdx, ]; /* extract the convex hull, in CC order */ /* Write the data and polygon to SAS data sets. Use the POLYGON statement in PROC SGPLOT. */ p = points || onBoundary; poly = j(nrow(convexHull), 1, 1) || convexHull; create TheData from p[colname={x y "onBoundary"}]; append from p; close; create Hull from poly[colname={ID cX cY}]; append from poly; close; quit; data All; set TheData Hull; run; /* combine the data and convex hull polygon */ proc sgplot data=All noautolegend; polygon x=cX y=cY ID=id / fill; scatter x=x y=y / group=onBoundary markerattrs=(size=12 symbol=CircleFilled); xaxis grid; yaxis grid; run; |

The resulting graph is similar to the one produced by the PolyDraw modules and is not shown.

The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.

]]>