The post Distances between observations in two groups appeared first on The DO Loop.

]]>The answer is yes, and this problem occurs frequently in applications. Typically the first group contains the locations of randomly placed individuals and the second contains the locations of resources. For each individual, you want to find the closest resources. Examples include:

- The first group contains the positions of houses. The second group contains the locations of stores. A retailer might want to identify the stores that are closest to customers, so that shipping costs are minimized.
- The first group contains the locations of a vehicle accidents on a highway. The second group contains the locations of hospitals who can treat the injured victims.
- The first group contains the location of cell phone users. The second group contains the locations of cell towers.

This article solves the problem in a straightforward way: compute all the pairwise distances between the points in each group. Distances within groups are not needed, so they are not computed.

*Distances between observations in two groups*

Click To Tweet

I previously showed how to compute the pairwise distance between points in different sets. Let S be a set of *n* d-dimensional points and let R be another set of *m* points. The PairwiseDist function in SAS/IML (shown below) returns an *n* x *m* matrix, D, of distances such that D[i,j] is the distance from the i_th point in S to the j_th point in R. The PairwiseNearestNbr function is the same
algorithm that was used to find nearest neighbors, so I will not re-explain how the function works. Suffice it to say that it returns two matrices: a matrix of row numbers and a matrix or distances.

proc iml; /* http://blogs.sas.com/content/iml/2013/03/27/compute-distance.html */ /* compute Euclidean distance between points in S and points in R. S is a n x d matrix, where each row is a point in d dimensions. R is a m x d matrix. The function returns the n x m matrix of distances, D, such that D[i,j] is the distance between S[i,] and R[j,]. */ start PairwiseDist(S, R); if ncol(S)^=ncol(R) then return (.); /* different dimensions */ n = nrow(S); m = nrow(R); idx = T(repeat(1:n, m)); /* index matrix for S */ jdx = shape(repeat(1:m, n), n); /* index matrix for R */ diff = S[idx,] - R[jdx,]; return( shape( sqrt(diff[,##]), n ) ); /* sqrt(sum of squares) */ finish; /* Compute indices (row numbers) of k nearest neighbors. INPUT: S an (n x d) data matrix R an (m x d) matrix of reference points k specifies the number of nearest neighbors (k>=1) OUTPUT: idx an (n x k) matrix of row numbers. idx[,j] contains the row numbers (in R) of the j_th closest elements to S dist an (n x k) matrix. dist[,j] contains the distances between S and the j_th closest elements in R */ start PairwiseNearestNbr(idx, dist, S, R, k=1); n = nrow(S); idx = j(n, k, .); dist = j(n, k, .); D = PairwiseDist(S, R); /* n x m */ do j = 1 to k; dist[,j] = D[ ,><]; /* smallest distance in each row */ idx[,j] = D[ ,>:<]; /* column of smallest distance in each row */ if j < k then do; /* prepare for next closest neighbors */ ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]); D[ndx] = .; /* set elements to missing */ end; end; finish; |

To illustrate how you can use these functions, consider two sets S and R. The set R is the set of resources locations (stores, hospitals, towers,...). For this example, R contains four points that are the vertices of a square of side length 2 centered at the origin. For each point in S, the following SAS/IML statements compute the closest vertex in R and the second-closest-vertex in R:

/* R = vertices of a square */ R = {-1 -1, /* Pt 1: lower left corner */ 1 -1, /* Pt 2: lower right corner */ 1 1, /* Pt 3: upper right corner */ -1 1}; /* Pt 4: upper left corner */ S = {-0.3 -0.2, /* points inside square */ 0.9 0.6, -0.5 0.8}; k = 2; /* find nearest and second nearest points */ run PairwiseNearestNbr(idx, dist, S, R, k); print idx[c={"Closest" "2nd Closest"} r=("S1":"S3")]; |

The first column in the table shows the indices (row numbers) of the vertices that are nearest to each point of S. The second column shows the second-closest vertices. For example, Pt 1 (the lowest left corner) is the nearest vertex to the first row of S, and Pt 4 (upper left corner) is the second nearest. In a similar way, Pt 3 is the closest vertex to the second row of S, and Pt 4 is the closest vertex to the third row of S.

The PairwiseNearestNbr module also returns the `dist` matrix, which contains the corresponding distances.
The i_th row of `dist` contains the distances from the i_th row of S to the two closest vertices.

In a similar way, you can generate random points in the square and color each point in a scatter plot according to the nearest point in the reference set. The ODS GRAPHICS statement with ATTRPRIORITY=NONE forces the ODS style to cycle though colors and symbols.

ods graphics / attrpriority=NONE width=400px height=400px; /* cycle symbols and colors */ /* Generate 500 random points in square [-1,1] x [-1,1] */ call randseed(12345); S = R // randfun({500 2}, "Uniform", -1, 1); k = 2; run PairwiseNearestNbr(idx, dist, S, R, k); Corner = idx[,1]; /* index of nearest vertex */ title "Points colored by nearest corner"; call scatter(S[,1], S[,2]) group=Corner grid={x y} procopt="aspect=1"; |

The scatter plot shows what you already knew: points in the first quadrant (green Xs) are closest to the vertex at (1,1). Points in the second quadrant (brown triangles) are closest to the vertex at (-1, 1), and so forth. Any points that are equidistant from two or more vertices (such as the origin) are assigned one of the colors arbitrarily. The points were sorted (not shown) so that the order in the legend agrees with the order of the points in R.

The call to the PairwiseNearestNbr subroutine used *k*=2, which requests the closest and the second closest points. The second-closest indices are located in the second column of the `idx` matrix. If you change the title and set `Corner = idx[,2]`, you can create a scatter plot in which each point is colored by the *second-closest* vertex. That coloration is shown in the scatter plot below.
See if you can understand why each marker in the following scatter plot is colored the way it is. For example, for the point in the first quadrant, their second-closest vertex is either the second vertex (1, -1) or the fourth vertex (-1, 1).

In summary, last week I showed how you can find the nearest neighbors among a single set of observations. This article solves a similar problem in which observations are split into two different groups. The functions in this article find the nearest observation in a "reference" or "resource" group for each observation in another group.

I'll conclude by mentioning that you can use these computations to compute the nearest distance between two groups of observations. Simply use *k*=1 and take the minimum distance that is computed by the PairwiseNearestNbr subroutine. This gives the smallest distance between any point in the first group and any point in the second group.

tags: Data Analysis

The post Distances between observations in two groups appeared first on The DO Loop.

]]>The post Create an ogive in SAS appeared first on The DO Loop.

]]>*#StatWisdom: How to create an ogive (rhymes with 'slow jive')*

Click To Tweet

An ogive is also called a cumulative histogram. You can create an ogive from a histogram by accumulating the frequencies (or relative frequencies) in each histogram bin. The
height of an ogive curve at *x* is found by summing the heights of the histogram bins to the left of *x*.

A histogram estimates the density of a distribution; the ogive estimates the cumulative distribution. Both are easy to construct by hand. Both are coarse estimates that depend on your choice of a bin widths and anchor position.

Ogives are not used much by professional statisticians because modern computers make it easy to compute and visualize the exact empirical cumulative distribution function (ECDF). However, if you are a student learning to analyze data by hand, an ogive is an easy way to approximate the ECDF from binned data. They are also important if you do not have access to the original data, but you have a histogram that appeared in a published report. (See "How to approximate a distribution from published quantiles.")

To demonstrate the construction of an ogive, let's consider the distribution of the MPG_CITY variable in the Sashelp.Cars data set. This variable contains the reported fuel efficiency (in miles per gallon) for 428 vehicle models. The following call to PROC UNIVARIATE in Base SAS uses the OUTHIST= option in the HISTOGRAM statement to create a data set that contains the frequencies and relative frequencies of each bin. By default, the frequencies are reported for the midpoints of the intervals. To create an ogive you need the *endpoints* of each bin, so use the ENDPOINTS option as follows:

proc univariate data=sashelp.cars; var mpg_city; histogram mpg_city / grid vscale=proportion ENDPOINTS OUTHIST=OutHist; /* cdfplot mpg_city / vscale=proportion; */ /* optional: create an ECDF plot */ run; |

The histogram shows that most vehicles get between 15 and 25 mpg in the city. The distribution is skewed to the right, with a few vehicles getting as much as 59 or 60 mpg. A few gas-guzzling vehicles get less than 15 mpg.

You can construct an ogive from the relative frequencies in the 11 histogram bins.
The height of the ogive at *x*=10 (the leftmost endpoint in the histogram) is zero. The height at *x*=15 is the height of the first bar. The height at *x*=20 is the sum of the heights of the first two histogram bars, and so on.

Each row in the OutHist data set contains a left-hand endpoint and the relative frequency (height) of the bar.
However, to construct an ogive, you need to associate the bar height with the *right-hand* endpoints. This is because at the left-hand endpoint none of the density for the bin has accumulated, and for the right-hand endpoint all of the density has accumulated.

Consequently, to construct an ogive from the OUTHIST= data set, you can do the following:

- Associate zero with the leftmost endpoint of the bins.
- Adjust the counts and proportions in the OutHist data so that they are associated with the right-hand endpoint of each bin. You can use the LAG function to do this.
- Accumulate the relative frequencies in each bin to form the cumulative frequencies.
- Add a new observation to the OutHist data that contains the rightmost endpoint of the bins.

The following SAS DATA step carry out these adjustments:

data Ogive; set outhist end=EOF; ogiveX = _MinPt_; /* left endpoint of bin */ dx = dif(ogiveX); /* compute bin width */ prop = lag(_OBSPCT_); /* move relative frequency to RIGHT endpoint */ if _N_=1 then prop = 0; /* replace missing value by 0 for first obs */ cumProp + prop/100; /* accumulate proportions */ output; if EOF then do; /* append RIGHT endpoint of final bin */ ogiveX = ogiveX + dx; cumProp = 1; output; end; drop dx _:; /* drop variables that begin with underscore */ run; |

The Ogive data set contains all the information that you need to graph an ogive. The following call to PROC SGPLOT uses a VLINE statement, which treats the endpoints of the bins as discrete values. You could also use the SERIES statement, which treats the endpoints as a continuous variable, but might not put a tick mark at each bin endpoint.

title "Cumulative Distribution of Binned Values (Ogive)"; proc sgplot data=Ogive; vline OgiveX / response=cumProp markers; /* series x=_minpt_ y=cumProp / markers; */ xaxis grid label="Miles per Gallon (City)"; yaxis grid values=(0 to 1 by 0.1) label="Cumulative Proportion"; run; |

You can use the graph to estimate the percentiles of the data. For example:

- The 20th percentile is approximately 17 because the curve appears to pass through the point (17, 0.20). In other words, about 20% of the vehicles get 17 mpg or less.
- The 50th percentile is approximately 19 because the curve appears to pass through the point (19, 0.50).
- The 90th percentile is approximately 27 because the curve appears to pass through the point (27, 0.90). Only 10% of the vehicles have a fuel efficiency greater than 27 mpg.

You might wonder how well the ogive approximates the empirical CDF. The following graph overlays the ogive and the ECDF for this data. You can see that the two curves agree closely at the ogive values, shown by the markers. However, there is some deviation because the ogive assume a linear accumulation (a uniform distribution) of data values within each histogram bin. Nevertheless, this coarse piecewise linear curve that is based on binned data does a good job of showing the basic shape of the empirical cumulative distribution.

tags: Data Analysis

The post Create an ogive in SAS appeared first on The DO Loop.

]]>The post Simulate data from a generalized Gaussian distribution appeared first on The DO Loop.

]]>
The generalized Gaussian distribution
has a standardized probability density of the form f(*x*) = B exp( -|A*x*|^{α} ), where A(α) and B(α) are known functions of the exponent parameter α > 0.
When 0 < α < 2, the generalized Gaussian distribution (GGD) is a heavy-tailed distribution that has finite moments. The distribution has applications in finance and signal processing.

The following SAS statements evaluate the GGD density function for four values of the shape parameter α. The SGPLOT procedure graphs the density functions (adapted from Nardon and Pianca (2009)):

data GenGauss; mu=0; sigma=1; /* location=mu and scale=sigma */ do alpha = 0.5, 1, 2, 5; A = sqrt( Gamma(3/alpha) / Gamma(1/alpha) ) / sigma; /* A(alpha, sigma) */ B = (alpha/2) / Gamma(1/alpha) * A; /* B(alpha, sigma) = normalizing constant */ do x = -3 to 3 by 0.05; pdf = B*exp(-(A*abs(x-mu))**alpha); output; end; end; run; title "Generalized Gaussian Densities"; title2 "(*ESC*){unicode mu}=0; (*ESC*){unicode sigma}=1"; proc sgplot data=GenGauss; series x=x y=pdf / group=alpha; yaxis grid label="Density of Generalized Gaussian"; keylegend / across=1 opaque location=inside position=right; run; |

The density for α=1/2 is sharply peaked and has heavy tails. The distribution for α=1 is the Laplace distribution, which also has a sharp peak at the mean. The distribution for α=2 is the standard normal distribution. For α=5, the distribution is platykurtic: it has a broad flat central region and thin tails.

This article uses only the standardized distribution that has zero mean and unit variance. However, the functions are all written to support a location parameter μ and a scale parameter σ.

Nardon and Pianca (2009) describe an algorithm for simulating random variates from the generalized Gaussian distribution: simulate from a gamma distribution, raise that variate to a power, and then randomly multiply by ±1. You can implement the simulation in the SAS DATA step or in the SAS/IML language. The following SAS/IML program defines a function for generating random GGD values:

proc iml; start Rand_GenGauss(N, mu=0, sigma=1, alpha=0.5); n1 =N[1]; n2 = 1; /* default dimensions for returned matrix */ if nrow(N)>1 | ncol(N)>1 then n2 = N[2]; /* support n1 x n2 matrices */ x = j(n1,n2); sgn = j(n1,n2); /* simulate for mu=0 and sigma=1, then scale and translate */ A = sqrt( Gamma(3/alpha) / Gamma(1/alpha) ); b = A##alpha; call randgen(x, "Gamma", 1/alpha, 1/b); /* X ~ Gamma(1/alpha, 1/b) */ call randgen(sgn, "Bernoulli", 0.5); sgn = -1 + 2*sgn; /* random {-1, 1} */ return mu + sigma * sgn # x##(1/alpha); finish; /* try it out: simulate 10,000 random variates */ call randseed(12345); x = Rand_GenGauss(10000, 0, 1, 1/2); /* X ~ GGD(mu=0, sigma=1, alpha=1/2) */ title "Generalized Gaussian Distribution"; title2 "mu=0, sigma=1, alpha=1/2"; call histogram(x); |

For α=1/2, most of the simulated values are near 0, which is the mean and mode. The bulk of the remaining values are in the interval [-5,5]. However, a very small percentage (about 0.5%) are more extreme than these values. For this sample, the range of the simulated data is about [-14 , 11], so you can see that this distribution is useful for simulations in which the magnitude of errors can be more extreme than normal errors.

The cumulative distribution function for the generalized Gaussian distribution does not have a closed-form solution in terms of elementary functions. However, there are a few values of α for which the equations simplify, including α=1/2. Chapeau-Blondeau and Monir (2002) note that the generalized Gaussian distribution with α=1/2 is especially useful in practice because its quantile function (inverse CDF) can be explicitly expressed in terms of the Lambert W function.

The CDF for the GGD with α=1/2 can be computed separately for *x* ≤ 0 and for *x* > 0, as shown below:

/* CDF for generalized Gaussian distribution with alpha=1/2 */ start CDF_GenGaussHalf(x); y = sqrt(abs(2*sqrt(30)*x)); cdf = 0.5*(1+y)#exp(-y); /* CDF for x <= 0 */ idx = loc(x>0); /* if x > 0, reflect: CDF(x) = 1 - CDF(-x) */ if ncol(idx)>0 then do; cdf[idx] = 1 - cdf[idx]; end; return cdf; finish; p = CDF_GenGaussHalf(-5); /* P(X< -5) for X ~ GDD(0,1, alpha=1/2) */ print p; /* 0.0025654 */ |

You can use the CDF function to evaluate the probability that a random GGD observation is less than -5. The answer is 0.0026. By symmetry, the probability that a random variate is outside of [-5,5] is 0.0052, which agrees with the empirical proportion in the random sample in the previous section.

The inverse CDF for the GGD with α=1/2 can be implemented by calling the Lambert W function. (Technically, the lower branch of the Lambert W function.) You can download functions that implement the Lambert W function in SAS/IML. The following SAS/IML statements assume that you have downloaded the Lambert W functions and stored the modules, so that they are available by using the LOAD statement. Just as the CDF was defined in two parts, the inverse CDF is defined in two parts.

/* Use Lambert W function to compute the quantile function (inverse CDF) for generalized Gaussian distribution with alpha=1/2 */ load module=(LambertWm); /* load function for the Lambert W function */ start Quantile_GenGaussHalf(x); q = j(nrow(x), ncol(x), 0); idx = loc(x<0.5); /* invert CDF for x < 0.5 */ if ncol(idx)>0 then do; y = -2 * x[idx] / exp(1); q[idx] = -1/(2*sqrt(30)) * (1 + LambertWm(y))##2; end; if ncol(idx)=nrow(x)*ncol(x) then return q; /* all found */ idx = loc(x=0.5); /* quantile for x=0.5 */ if ncol(idx)>0 then q[idx] = 0; idx = loc(x>0.5); /* invert CDF for x > 0.5 */ if ncol(idx)>0 then do; y = -2 * (1-x[idx]) / exp(1); q[idx] = 1/(2*sqrt(30)) * (1 + LambertWm(y))##2; end; return q; finish; q = Quantile_GenGaussHalf(0.975); /* find q such that P(X < q) = 0.975 */ print q; /* 2.0543476 */ |

The output (not shown) is q = 2.05, which means that the interval [-2.05, 2.05] contains 95% of the probability for the standardized GGD when α=1/2. Compare this interval with the familiar interval [-1.96, 1.96], which contains 95% of the probability for the standard normal distribution.

To learn more about the generalized Gaussian distribution, with an emphasis on parameter value α=1/2, see the following wwll-written papers, which expand the discussion in this blog post:

- F. Chapeau-Blondeau and A. Monir (2002), "Numerical Evaluation of the Lambert W Function and Application to Generation of Generalized Gaussian Noise With Exponent 1/2,"
*IEEE Trans. on Sig. Processing*. -
Nardon and Pianca (2009; download preprint 2006).
"Simulation techniques for generalized Gaussian densities,"
*J. Stat. Comp. and Sim.*

tags: Simulation

The post Simulate data from a generalized Gaussian distribution appeared first on The DO Loop.

]]>The post The distribution of nearest neighbor distances appeared first on The DO Loop.

]]>The distribution of the nearest-neighbor distances provides information about the process that generated the spatial data. Let's look at two example: clustered data for trees in a forest and uniformly distributed (simulated) data.

The Sashelp.Bei data contains the locations for 3,604 trees in a forest. A scatter plot of the data is shown to the left. The trees appear to be clustered in certain regions, whereas other regions contain no trees. This indicates that the trees are not uniformly distributed throughout the region.

Let's use the NearestNbr module from my previous blog post to compute the nearest neighbor distances for these tree locations. The following program saves the distances to a SAS data set named Dist_Trees. For comparison, the program also generates 3,604 random (uniformly distributed) points in the same rectangular region. The nearest neighbor distances for these simulated points are written to a data set named Dist_Unif.

proc iml; use Sashelp.Bei where(Trees=1); read all var {x y} into T; close Sashelp.Bei; /* See http://blogs.sas.com/content/iml/2016/09/14/nearest-neighbors-sas.html */ start NearestNbr(idx, dist, X, k=1); N = nrow(X); p = ncol(X); idx = j(N, k, .); dist = j(N, k, .); D = distance(X); D[do(1, N*N, N+1)] = .; /* set diagonal elements to missing */ do j = 1 to k; dist[,j] = D[ ,><]; /* smallest distance in each row */ idx[,j] = D[ ,>:<]; /* column of smallest distance in each row */ if j < k then do; /* prepare for next closest neighbors */ ndx = sub2ndx(dimension(D), T(1:N)||idx[,j]); D[ndx] = .; /* set elements to missing */ end; end; finish; run NearestNbr(nbrIdx, dist, T); create Dist_Trees var "Dist"; append; close; /* tree distances */ /* compare with uniform */ call randseed(12345); N = nrow(T); /* generate random uniform points in [0,1000] x [0,500] */ X = 1000*randfun(N, "uniform") || 500*randfun(N, "uniform"); run NearestNbr(nbrIdx, dist, X); create Dist_Unif var "Dist"; append; close; /* random uniform distances */ QUIT; |

You can use a PROC UNIVARIATE to compare the two distributions of distances. The following DATA step merges the distances and adds an indicator variable that has the value "Uniform" or "Trees." PROC UNIVARIATE creates a graph of the empirical cumulative distribution function for each set of nearest-neighbor distances:

data All; set Dist_Unif(in=inUnif) Dist_Trees; if inUnif then Group = "Uniform"; else Group = "Trees "; run; proc univariate data=All; where dist<=20; /* excludes 41 trees and 1 random observation */ class Group; cdfplot dist / vscale=proportion odstitle="Compare ECDF for Uniform and Tree Data" overlay; ods select cdfplot; run; |

The empirical CDFs for the two data sets look quite different. For the tree data, the distribution function rises rapidly, which indicates that many observations have nearest neighbors that are very close. You can see that about 25% of trees have a nearest neighbor within about 1.5 units, and 50% have a nearest neighbor within 3.2 units.

In contrast, the empirical distribution of the uniform data rises less steeply. For the simulated observations, only 5% have a nearest neighbor within 1.5 units. Only 21% have a nearest neighbor within 3.5 units. The graph indicates clumping in the tree data: the typical tree has a neighbor that is closer than would be expected for random uniform locations.

In SAS/STAT 13.2 and beyond you can use the new SPP procedure to perform comparisons like this automatically. The SPP procedure supports several statistical tests for "complete spatial randomness," which enable you to compare the observed distribution to the expected distribution for uniformly distributed data.

The SPP procedure supports tests that are based on nearest-neighbor distances, as well as other tests. The example in this article is essentially a visual representation of the "G function test," which you can carry out by using the following statements:

proc spp data=sashelp.bei plots=(G(unpack)); process trees = (x, y /area=(0,0,1000,500) Event=Trees) / G; run; |

The SPP procedure creates a graph of the "edge-corrected G function." The upper curve is the empirical distribution of nearest neighbor distances for the trees, adjusted for edge effects caused by a finite domain. The lower curve shows the expected distribution for random uniform data of the same size on the same domain. The light-blue band is a 95% confidence envelope, which gives you a feeling for the variation due to random sampling. The envelope is formed by 100 simulations of random uniform data, similar to the simulation in the previous section.

The graph indicates that there is structure in the trees data beyond what you would expect from complete spatial randomness. You can use other features of PROC SPP to model the trees data. For an overview of the spatial analyses that you can perform with PROC SPP, see Mohan and Tobias (2015), "Analyzing Spatial Point Patterns Using the New SPP Procedure".

In summary, nearest-neighbor distances are useful for analyzing properties of spatial point patterns. This article compared tree data to a single instance of random uniform data. By using the SPP procedure, you can run a more complete analysis and obtain graphs and related statistics with minimal effort.

tags: Data Analysis

The post The distribution of nearest neighbor distances appeared first on The DO Loop.

]]>The post Compute nearest neighbors in SAS appeared first on The DO Loop.

]]>Let's create a sample data set. The following DATA step simulates 100 observations that are uniformly distributed in the unit square [0,1] x [0,1]:

data Unif; call streaminit(12345); do i = 1 to 100; x = rand("Uniform"); y = rand("Uniform"); output; end; run; |

I have previously shown how to compute the distance between observations in SAS by using PROC DISTANCE or the DISTANCE function in SAS/IML. The following statements read the data into a SAS/IML matrix and computes the pairwise distances between all observations:

proc iml; use Unif; read all var {x, y} into X; close; D = distance(X); /* N x N distance matrix */ |

The D matrix is a symmetric 100 x 100 matrix. The value `D[i,j]` is the Euclidean distance between the *i*th and *j*th rows of X. An easy way to look for the nearest neighbor of observation *i* is to search the *i*th row for the column that contains smallest distance.
Because the diagonal elements of D are all zero, a useful trick is to change the diagonal elements to be missing values.
Then the smallest value in each row of D corresponds to the nearest neighbor.

You can use the following statements assigns missing values to the diagonal elements of the D matrix. You can then use the SAS/IML subscript reduction operators to find the minimum distance in each row:

diagIdx = do(1, nrow(D)*ncol(D), ncol(D)+1); /* index diagonal elements */ D[diagIdx] = .; /* set diagonal elements */ dist = D[ ,><]; /* smallest distance in each row */ nbrIdx = D[ ,>:<]; /* column of smallest distance in each row */ Neighbor = X[nbrIdx, ]; /* coords of closest neighbor */ |

If you write the nearest neighbors and distances to a SAS data set, you can use the VECTOR statement in PROC SGPLOT to draw a vector that connects each observation to its nearest neighbor. The graph indicates the nearest neighbor for each observation.

Z = X || Neighbor || dist; create NN_Unif from Z[c={"x" "y" "xc" "yc" "dist"}]; append from Z; close; QUIT; title "Nearest Neighbors for 100 Uniformly Distributed Points"; proc sgplot data=NN_Unif aspect=1; label dist = "Distance to Nearest Neighbor"; scatter x=x y=y / colorresponse=dist markerattrs=(symbol=CircleFilled); vector x=xc y=yc / xorigin=x yorigin=y transparency=0.5 colorresponse=dist; /* COLORRESPONSE= requires 9.4m3 for VECTOR */ xaxis display=(nolabel); yaxis display=(nolabel); run; |

If you don't have SAS/IML but still want to compute nearest neighbors, you can use PROC MODECLUS. The NEIGHBOR option on the PROC MODECLUS statement produces a table that gives the observation number (or ID value) of nearest neighbors. For example, the following statements produce the observation numbers for the nearest neighbors:

ods select neighbor; /* Use K=p to find nearest p-1 neighbors */ proc modeclus data=Unif method=1 k=2 Neighbor; /* k=2 for nearest nbrs */ var x y; run; |

To get the coordinates of the nearest neighbor, you can create a variable that contains the observation numbers and then use an ID statement to include that ID variable in the PROC MODECLUS output. You can then look up the coordinates. I omit the details.

You can use the ideas in the earlier SAS/IML program to write a program that returns the indices (observation numbers) of the *k* closest neighbors for *k* ≥ 1. The trick is to replace the smallest distance in each row with a missing value and then repeat the process of finding the smallest value (and column) in each row. The following SAS/IML module implements this computation:

proc iml; /* Compute indices (row numbers) of k nearest neighbors. INPUT: X an (N x p) data matrix k specifies the number of nearest neighbors (k>=1) OUTPUT: idx an (N x k) matrix of row numbers. idx[,j] contains the row numbers (in X) of the j_th closest neighbors dist an (N x k) matrix. dist[,j] contains the distances between X and the j_th closest neighbors */ start NearestNbr(idx, dist, X, k=1); N = nrow(X); p = ncol(X); idx = j(N, k, .); /* j_th col contains obs numbers for j_th closest obs */ dist = j(N, k, .); /* j_th col contains distance for j_th closest obs */ D = distance(X); D[do(1, N*N, N+1)] = .; /* set diagonal elements to missing */ do j = 1 to k; dist[,j] = D[ ,><]; /* smallest distance in each row */ idx[,j] = D[ ,>:<]; /* column of smallest distance in each row */ if j < k then do; /* prepare for next closest neighbors */ ndx = sub2ndx(dimension(D), T(1:N)||idx[,j]); D[ndx] = .; /* set elements to missing */ end; end; finish; |

You can use this module to compute the *k* closest neighbors for each observation (row) in a data matrix. For example, the following statements compute the two closest neighbors for each observation. The output shows a few rows of the original data and the coordinates of the closest and next closest observations:

use Unif; read all var {x, y} into X; close; k=2; run NearestNbr(nbrIdx, dist, X, k); Near1 = X[nbrIdx[,1], ]; /* 1st nearest neighbors */ Near2 = X[nbrIdx[,2], ]; /* 2nd nearest neighbors */ |

In summary, this article defines a short module in the SAS/IML language that you can use to compute the *k* nearest neighbors for a set of *N* numerical observations. Notice that the computation builds an *N* x *N* distance matrix in RAM, so this matrix might consume a lot of memory for large data sets. For example, the distance matrix for a data set with 16,000 observations requires about 1.91 GB of memory. For larger data sets, you might prefer to use PROC DISTANCE or PROC MODECLUS.

tags: 9.4, Data Analysis, Statistical Programming

The post Compute nearest neighbors in SAS appeared first on The DO Loop.

]]>The post Overlay a curve on a bar chart in SAS appeared first on The DO Loop.

]]>
However, if you try to overlay incompatible plot types, you will get an error message that says

`ERROR: Attempting to overlay incompatible plot or chart types.`

For example, a histogram and a series plots are not compatible in PROC SGPLOT, so you need to use the Graphics Template Language (GTL) to overlay a custom density estimate on a histogram.

A similar limitation exists for bar charts in PROC SGPLOT: you cannot specify the VBAR and SERIES statements in a single call. However, in SAS 9.3 and beyond you can use the VBARPARM statement in SAS 9.3 to overlay a curve and a bar chart.

In SAS 9.4m3 there is yet another VBAR-like statement that enables you to combine bar charts with one or more other plots. The new the VBARBASIC and the HBARBASIC statements create a bar chart that is compatible with basic plots such as scatter plots, series plots, and box plots. These statements can summarize raw data like the VBAR statement can. In other words, if you use the VBARBASIC and HBARBASIC statements on raw data, the counts will be automatically computed. However, you can also use the statements on pre-summarized data: specify the height of the bars by using the RESPONSE= option.

In most situations it doesn't make sense to overlay a continuous curve on a discrete bar chart, which is why the SG routines have the concept of compatible plot types. However, there is a canonical example in elementary statistics that combines continuous and discrete data: the normal approximation to the binomial distribution.

Recall that if X is the number of successes in *n* independent trials for which the probability of success is *p*, then X is binomially distributed: X ~ Binom(*n, p*). A well-known rule says that if *np* > 5 and *n*(1-*p*) > 5, then the binomial distribution is approximated by a normal distribution with mean *np* and standard deviation sqrt(*np*(1-*p*)).

This rule is often illustrated by overlaying the continuous normal PDF on a bar chart that shows the binomial distribution, as shown to the left.
To create this plot, I used the VBARBASIC statement to create the bar chart. Because the VBARBASIC statement creates a "basic plot," you can combine it with another basic plot, such as the line plot created by using a SERIES statement. For fun, I used an INSET statement to overlay a box of parameter values for the graph.
The graph shows that the binomial probability at *j* is approximated by the area under the normal density curve on the interval [*j*-0.5, *j*+0.5].

The following SAS statements use the PDF function to evaluate the binomial probabilities and the normal density for the graph. The values for μ and σ are stored in macro variables for later use.

%let p = 0.25; /* probability of success */ %let n = 25; /* number of trials */ data Binom; n = &n; p = &p; q = 1 - p; mu = n*p; sigma = sqrt(n*p*q); /* parameters for the normal approximation */ Lower = mu-3.5*sigma; /* evaluate normal density on [Lower, Upper] */ Upper = mu+3.5*sigma; /* PDF of normal distribution */ do t = Lower to Upper by sigma/20; Normal = pdf("normal", t, mu, sigma); output; end; /* PMF of binomial distribution */ t = .; Normal = .; /* these variables are not used for the bar chart */ do j = max(0, floor(Lower)) to ceil(Upper); Binomial = pdf("Binomial", j, &p, &n); output; end; call symput("mu", strip(mu)); /* store mu and sigma in macro variables */ call symput("sigma", strip(round(sigma,0.01))); label Binomial="Binomial Probability" Normal="Normal Density"; keep t Normal j Binomial; run; |

The preceding DATA step evaluates the Binom(15, 0.25) probability for the integers *j*=0, 1, ..., 14.
It evaluates the N(6.25, 2.17) PDF on the interval [-1.3, 13.8]. The following call to PROC SGPLOT uses the VBARBASIC statement to overlay the bar chart and the density curve:

title "Binomial Probability and Normal Approximation"; proc sgplot data=Binom; vbarbasic j / response=Binomial barwidth=1; /* requires SAS 9.4M3 */ series x=t y=Normal / lineattrs=GraphData2(thickness=2); inset "n = &n" "p = &p" "q = %sysevalf(1-&p)" "(*ESC*){unicode mu} = np = &mu" /* use Greek letters */ "(*ESC*){unicode sigma} = sqrt(npq) = &sigma" / position=topright border; yaxis label="Probability"; xaxis label="x" integer type=linear; /* force TYPE=LINEAR */ run; |

The TYPE=LINEAR option on the XAXIS statement tells the horizontal axis to use interval tick marks. The BARWIDTH=1 option on the VBARBASIC statement makes the bar chart look more like a histogram by eliminating the gaps between bars. The graph is shown at the top of this section.

If you are content to show only the height of the binomial probability mass function (PMF), you can use
an alternative visualization. The following graph shows a needle plot (the binomial PMF) overlaid with a normal PDF.
This visualization does not require 9.4M3. The SGPLOT statements are the same as before, except the binomial probabilities are represented by using the NEEDLE statement: `needle x=j y=Binomial / markers;`

tags: 14.1, 9.4, Statistical Graphics

The post Overlay a curve on a bar chart in SAS appeared first on The DO Loop.

]]>The post Coverage probability of confidence intervals: A simulation approach appeared first on The DO Loop.

]]>
Recall that a confidence interval (CI) is an interval estimate that contains the population parameter with a specified probability.
Because the CI is an *estimate*, it is computed from a sample. A confidence interval for a parameter is derived by knowing (or approximating) the sampling distribution of a statistic. For symmetric sampling distributions, the CI often has the form *m* ± w(α, *n*), where *m* is an unbiased estimate of the parameter and w(α, *n*) is a width that depends on the significance level α, the sample size *n*, and the standard error of the estimate.

Due to sampling variation, the confidence interval for a particular sample might not contain the parameter. A 95% confidence interval means that if you generate a large number of samples and construct the corresponding confidence intervals, then about 95% of the intervals will contain (or "cover") the parameter.

For example, a well-known formula is the confidence interval of the mean. If the population is normally distributed, then a 95% confidence interval for the population mean, computed from a sample of size *n*, is

[ *xbar* – *t*_{c} *s* / sqrt(*n*),
*xbar* + *t*_{c} *s* / sqrt(*n*) ]

where

*xbar*is the sample mean*t*_{c}=*t*_{1-α/2, n-1}is the critical value of the*t*statistic with significance α and*n*-1 degrees of freedom*s*/ sqrt(*n*) is the standard error of the mean, where*s*is the sample standard deviation.

*What is a confidence interval? How can you estimate coverage probability? #Statistics #SASTip*

Click To Tweet

The preceding discussion leads to the simulation method for estimating the coverage probability of a confidence interval. The simulation method has three steps:

- Simulate many samples of size
*n*from the population. - Compute the confidence interval for each sample.
- Compute the proportion of samples for which the (known) population parameter is contained in the confidence interval. That proportion is an estimate for the empirical coverage probability for the CI.

You might wonder why this is necessary. Isn't the coverage probability always (1-α) = 0.95? No, that is only true when the population is normally distributed (which is never true in practice) or the sample sizes are large enough that you can invoke the Central Limit Theorem. Simulation enables you to estimate the coverage probability for small samples when the population is not normal. You can simulate from skewed or heavy-tailed distributions to see how skewness and kurtosis affect the coverage probability. (See Chapter 16 of *Simulating Data with SAS*.)

Let's use simulation
to verify that the formula for a CI of the mean is valid when you draw samples from a standard normal population.
The following DATA step simulates 10,000 samples of size *n*=50:

%let N = 50; /* size of each sample */ %let NumSamples = 10000; /* number of samples */ /* 1. Simulate samples from N(0,1) */ data Normal(keep=SampleID x); call streaminit(123); do SampleID = 1 to &NumSamples; /* simulation loop */ do i = 1 to &N; /* N obs in each sample */ x = rand("Normal"); /* x ~ N(0,1) */ output; end; end; run; |

The second step is to compute the confidence interval for each sample. You can use PROC MEANS to compute the confidence limits. The LCLM= and UCLM= outputs the lower and upper endpoints of the confidence interval to a SAS data set. I also output the sample mean for each sample. Notice that the BY statement is an efficient way to analyze all samples in a simulation study.

/* 2. Compute statistics for each sample */ proc means data=Normal noprint; by SampleID; var x; output out=OutStats mean=SampleMean lclm=Lower uclm=Upper; run; |

The third step is to count the proportion of samples for which the confidence interval contains the value of the parameter. For this simulation study, the value of the population mean is 0. The following DATA step creates an indicator variable that has the value 1 if 0 is within the confidence interval for a sample, and 0 otherwise. You can then use PROC FREQ to compute the proportion of intervals that contain the mean. This is the empirical coverage probability. If you want to get fancy, you can even use the BINOMIAL option to compute a confidence interval for the proportion.

/* 3a. How many CIs include parameter? */ data OutStats; set OutStats; label ParamInCI = "Parameter in CI"; ParamInCI = (Lower<0 & Upper>0); /* indicator variable */ run; /* 3b. Nominal coverage probability is 95%. Estimate true coverage. */ proc freq data=OutStats; tables ParamInCI / nocum binomial(level='1' p=0.95); run; |

The output from PROC FREQ tells you that the empirical coverage (based on 10,000 samples) is 94.66%, which is very close to the theoretical value of 95%. The output from the BINOMIAL option estimates that the true coverage is in the interval [0.9422,0.951], which includes 0.95. Thus the simulation supports the assertion that the standard CI of the mean has 95% coverage when a sample is drawn from a normal population.

You can draw a graph that shows how the confidence intervals depend on the random samples. The following graph shows the confidence intervals for 100 samples. The center of each CI is the sample mean.

proc format; /* display 0/1 as "No"/"Yes" */ value YorN 0="No" 1="Yes"; run; ods graphics / width=6.5in height=4in; proc sgplot data=OutStats(obs=100); format ParamInCI YorN.; title "95% Confidence Intervals for the Mean"; title2 "Normal Data"; scatter x=SampleID y=SampleMean / group=ParamInCI markerattrs=(symbol=CircleFilled); highlow x=SampleID low=Lower high=Upper / group=ParamInCI legendlabel="95% CI"; refline 0 / axis=y; yaxis display=(nolabel); run; |

The reference line shows the mean of the population. Samples for which the population mean is inside the confidence interval are shown in blue. Samples for which the population mean is not inside the confidence interval are shown in red.

You can see how sample variability affects the confidence intervals. In four random samples (shown in red) the values in the sample are so extreme that the confidence interval does not include the population mean. Thus the estimate of the coverage probability is 96/100 = 96% for these 100 samples. This graph shows why the term "coverage probability" is used: it is the probability that one of the vertical lines in the graph will "cover" the population mean.

The previous simulation confirms that the empirical coverage probability of the CI is 95% for normally distributed data. You can use simulation to understand how that probability changes if you sample from nonnormal data. For example, in the DATA step that simulates the samples, replace the call to the RAND function with the following line:

x = rand("Expo") - 1; /* x + 1 ~ Exp(1) */ |

You can then rerun the simulation study. This time the samples are drawn from a (shifted) exponential distribution that has mean 0 and unit variance. The skewness for this distribution is 2 and the excess kurtosis is 6. The result from PROC FREQ is that only about 93.5% of the confidence intervals (using the standard formula) cover the true population mean. Consequently, the formula for the CI, which has 95% coverage for normal data, only has about 93.5% coverage for this exponential data.

You can create a graph that visualizes the confidence intervals for the exponential data. Again, only the first 100 samples are shown. In this graph, the CIs for nine samples do not contain the population mean, which implies a 91% empirical coverage.

You can also write a SAS/IML program. An example of using SAS/IML to estimate the coverage probability of a confidence interval is posted on the SAS/IML File Exchange.

In summary, you can use simulation to estimate the empirical coverage probability for a confidence interval. In many cases the formula for a CI is based on an assumption about the population distribution, which determines the sampling distribution of the statistic. Simulation enables you to explore how the coverage probability changes when the population does not satisfy the theoretical assumptions.

tags: Simulation, Statistical Thinking

The post Coverage probability of confidence intervals: A simulation approach appeared first on The DO Loop.

]]>The post Graph a step function in SAS appeared first on The DO Loop.

]]>
In statistics, the canonical step function is the empirical cumulative distribution function. Given a set of data values
*x*_{1} ≤ *x*_{2} ≤ ... ≤ *x*_{n}
the empirical distribution function (ECDF) is the step function defined by

F(t) = (number of data values ≤ t) / *n*.

Notice that an ECDF is constant on each half-open interval
[*x*_{i}, *x*_{i+1}).

The easiest way to visualize the ECDF in SAS is to use the CDFPLOT statement in PROC UNIVARIATE. The following DATA step creates a set of nine observations. Two values are repeated. The call to PROC UNIVARIATE creates a graph of the empirical CDF:

data A; input x @@; datalines; 3.7 1.0 2.2 4.1 5.0 1.9 3.0 2.2 4.1 ; ods select cdfplot; proc univariate data=A; cdfplot x / vscale=proportion odstitle="Empirical CDF" odstitle2="PROC UNIVARIATE"; ods output cdfplot=outCDF; /* data set contains ECDF values */ run; |

The ECDF jumps by 1/*n* = 1/9 at each sorted data value. Because the values 2.2 and 4.1 appear twice in the data, the ECDF jumps by 2/9 at those data values. The ECDF is 0 for any point less than the minimum data value; it is 1 for any point greater than or equal to the maximum data value.

In the previous call to PROC UNIVARIATE, the ODS OUTPUT statement writes a SAS data set that contains the data values in sorted order and the value of the ECDF at each data value. You can use this output data set and the STEP statement in PROC SGPLOT to create your own graph of the ECDF. This gives you complete control over colors, labels, background grids, and other graphical attributes. You can also overlay other plots on the ECDF. For example, the following call to PROC SGPLOT creates an ECDF, adds a background grid, and overlays a fringe plot that shows individual data values:

title "Empirical CDF"; title2 "STEP and FRINGE Statements"; proc sgplot data=outCDF noautolegend; step x=ECDFX y=ECDFY; /* variable names created by PROC UNIVARIATE */ fringe ECDFX; xaxis grid label="x" offsetmin=0.05 offsetmax=0.05; yaxis grid min=0 label="Cumulative Proportion"; run; |

For the ECDF, we used PROC UNIVARIATE to create a data set that contains the (X,Y) coordinates of each "corner" in the plot. For a general discontinuous function, you need to create a similar data set manually. If you can create the data set, you can use the STEP statement to visualize an arbitrary piecewise constant function.

Some users might want to omit the vertical lines in the graph in order to emphasize the discontinuous nature of the function. For example, the adjacent graph is an alternative approach to visualizing the ECDF. This graph emphasizes that the function is constant on intervals that are closed on the left and open on the right.

You can use the VECTOR statement in PROC SGPLOT to generate this graph. The VECTOR statement draws a line between two arbitrary points. The output from PROC UNIVARIATE provides the X and Y values for the plot, but you need to modify the data slightly because the VECTOR statement needs four variables: the starting and ending coordinates of each line segment.

The adjacent table shows the shape of the data that is suitable for graphing with the VECTOR statement. You can see that the LAG function is useful for generating the xL column from the xR column. The yL and yR columns are identical except for the first observation because the ECDF is piecewise constant.
The VECTOR statement connects the points (*x*_{L}, *y*_{L}) and (*x*_{R}, *y*_{R}), where the "L" subscript refers to left-hand endpoints and the "R" subscript refers to right-hand endpoints.

/* CDF is step function. Each interval [x_i, x_{i+1}) is closed on the left and open on the right */ title "Empirical CDF"; title2 "VECTOR Statement"; proc sgplot data=ECDF noautolegend; vector x=xR y=yR / xorigin=xL yorigin=yL noarrowheads; scatter x=xL y=yL / markerattrs=(symbol=CircleFilled color=black); /* closed */ scatter x=xR y=yR / filledoutlinedmarkers markerfillattrs=(color=white) /* open */ markerattrs=(symbol=CircleFilled color=black); xaxis grid label="x"; yaxis grid min=0 max=1 label="Quantile"; run; |

In conclusion, there are several ways to visualize an empirical CDF in SAS. You can also visualize the graph of a general piecewise constant function. You can choose either the STEP statement or the VECTOR statement in PROC SGPLOT to visualize the graph.

The post Graph a step function in SAS appeared first on The DO Loop.

]]>The post The Lambert W function in SAS appeared first on The DO Loop.

]]>
This article describes how you can evaluate the Lambert W function in SAS/IML software. The Lambert W function is defined implicitly: given a real value x, the function's value w = W(x) is the value of w that satisfies the equation w exp(w) = x. Thus W is the inverse of the function *g*(w) = w exp(w).

Because the function *g* has a minimum value at w=-1, there are two branches to the Lambert W function. Both branches are shown in the adjacent graph. The top branch is called the principal (or positive) branch and is denoted W_{p}. The principal branch has domain x ≥ -1/e and range W(x) ≥ -1. The lower branch, denoted by W_{m}, is defined on -1/e ≤ x < 0 and has range W(x) ≤ = -1. The subscript "m" stands for "minus" since the range contains only negative values. Some authors use W_{0} for the upper branch and W_{-1} for the lower branch.

Both branches are used in applications, but the lower branch W_{m} is useful for the statistical programmer because you can use it to simulate data from heavy-tailed distribution. Briefly: for a class of heavy-tailed distributions, you can simulate data by using the inverse CDF method, which requires evaluating the Lambert W function.

You can download the SAS/IML functions that evaluate each branch of the Lambert W function. Both functions use an analytical approximation for the Lambert W function, followed by one or two iterations of Halley's method to ensure maximum precision:

- The LambertWp function implements the principal branch W
_{p}. The algorithm constructs a direct global approximation on [-1/e, ∞) based on Boyd (1998). - The LambertWm function implements the second branch W
_{m}. The algorithm follows Chapeau-Blondeau and Monir (2002) (hereafter CBM). The CBM algorithm approximates W_{m}(x) on three different domains. The approximation uses a series expansion on [-1/e, -0.333), a rational-function approximation on [-0.333, -0.033], and an asymptotic series expansion on (-0.033, 0).

Download the SAS/IML program for this article and save it to a file called LambertW.sas. Then the following SAS/IML program shows how to call the functions for the upper and lower branches and plot the graphs:

%include "C:\MyPath\LambertW.sas"; /* specify path to file */ proc iml; load module=(LambertWp LambertWm); title "Lambert W Function: Principal Branch"; x = do(-1/exp(1), 1, 0.01); Wp = LambertWp(x); call series(x, Wp) grid={x y} other="refline 0/axis=x; refline 0/axis=y"; title "Lambert W Function: Lower Branch"; x = do(-1/exp(1), -0.01, 0.001); Wm = LambertWm(x); call series(x, Wm) grid={x y}; |

The graphs are not shown because they are indistinguishable from the graph at the beginning of article.

You can use a simple error analysis to investigate the accuracy f these functions. After computing w = W(x), apply the inverse functionx = do(-1/exp(1), -0.01, 0.001); Wm = LambertWm(x); /* default precision: 1 Halley iteration */ g = Wm # exp(Wm); /* apply inverse function */ maxError1 = max(abs(x - g)); /* maximum error */ Wm = LambertWm(x, 2); /* higher precision: 2 Halley iterations */ g = Wm # exp(Wm); maxError2 = max(abs(x - g)); print maxError1 maxError2; |

In summary, the SAS/IML language provides an excellent environment for implementing new functions or statistical procedures that are not built into SAS. In this article I implemented methods from journal articles for evaluating the upper and lower branches of the Lambert W function, which is the inverse function of *g*(w) = w exp(w). Although the Lambert W function does not have a closed-form solution, you can implement an approximation to the function and then use one or two Halley iterations to quickly converge within machine precision.

tags: Numerical Analysis

The post The Lambert W function in SAS appeared first on The DO Loop.

]]>The post Weighted percentiles appeared first on The DO Loop.

]]>I do not discuss survey data in this article. Survey statisticians use weights to make valid inferences in survey data, and you can see the SAS documentation to learn about how to use weights to estimate variance in complex survey designs.

*How to understand weighted percentiles #Statistics #StatWisdom*

Click To Tweet

Before we calculate a weighted statistic, let's remember that a weight variable is not that same as a frequency variable. A frequency variable, which associates a positive integer with each observation, specifies that each observation is replicated a certain number of times. There is nothing unintuitive about the statistics that arise from including a frequency variable. They are the same that you would obtain by duplicating each record according to the value of the frequency variable.

Weights are not frequencies. Weights can be fractional values. When comparing a weighted and unweighted analyses, the key idea is this: an unweighted analysis is equivalent to a weighted analysis for which the weights are all 1. An "unweighted analysis" is really a misnomer; it should be called an "equally weighted" analysis!

In the computational formulas that SAS uses for weighted percentiles, the weights are divided by the sum of the weights. Therefore only relative weights are important, and the formulas simplify if you choose weights that sum to 1. For the remainder of this article, assume that the weights sum to unity and that an unweighted analysis has weights equal to 1/
To understand how weights change the computation of percentiles, let's review the standard unweighted computation of the empirical percentiles (or quantiles) of a set of *n* numbers. First, sort the data values from smallest to largest. Then construct the empirical cumulative distribution function (ECDF). Recall that the ECDF is a piecewise-constant step function that increases by 1/*n* at each data point. The quantity 1/*n* represents the fact that each observation is weighted equally in this analysis.

The quantile function is derived from the CDF function, and
the quantile function for a discrete distribution is also a step function.
You can use the graph of the ECDF to compute the quantiles. For example, suppose your data are

{
1
1.9
2.2
3
3.7
4.1
5 }

The following graph shows the ECDF for these seven values:

The data values are indicated by tick marks along the horizontal axis. Notice that the ECDF jumps by 1/7 at each data value because there are seven unique values.

I should really show you the graph of the quantile function (an "inverse function" to the CDF), but you can visualize the graph of the quantile function if you rotate your head clockwise by 90 degrees. To find a quantile, start at some value on the Y axis, move across until you hit a vertical line, and then drop down to the X axis to find the datum value. For example, to find the 0.2 quantile (=20th percentile), start at Y=0.2 and move right horizontally until you hit the vertical line over the datum 1.9. Thus 1.9 is the 20th percentile. Similarly, the 0.6 quantile is the data value 3.7. (I omit details about what to do if you hit a horizontal line.)

Of course, SAS can speed up this process. The following call to PROC MEANS displays the 20th, 40th, 60th, and 80th percentiles:

data A; input x wt; datalines; 1 0.25 1.9 0.05 2.2 0.15 3.0 0.25 3.7 0.15 4.1 0.10 5 0.05 ; proc means data=A p20 p40 p60 p80; var x; /* unweighted analysis: data only */ run; |

The previous section shows the relationship between percentile values and the graph of the ECDF. This section describes how the ECDF changes if you specify unequal weights for the data. The change is that the weighted ECDF will jump by the (standardized) weight at each data value. Because the weights sum to unity, the CDF is still a step function that rises from 0 to 1, but now the steps are not uniform in height. Instead, data that have relatively large weights produce a large step in the graph of the ECDF function.

In the previous section, the DATA step defined a weight variable. The weights for this example are

{
0.25
0.05
0.15
0.25
0.15
0.10
0.05 }

The following graph shows the weighted ECDF for these weights:

By using this weighted ECDF, you can read off the weighted quantiles. For example, to find the 0.2 weighted quantile, start at Y=0.2 and move horizontally until you hit a vertical line, which is over the datum 1.0. Thus 1.0 is the 20th weighted percentile. Similarly, the 0.6 quantile is the data value 3.0. You can confirm this by calling PROC MEANS with a WEIGHT statement, as shown below:

proc means data=A p20 p40 p60 p80; weight wt; /* weighted analysis */ var x; run; |

You can use a physical model to intuitively understand weighted percentiles. The model is the same as I used to
visualize a weighted mean. Namely, imagine a point-mass of *w*_{i} concentrated at position *x*_{i} along a massless rod. Finding a weighted percentile *p* is equivalent to finding the first location along the rod (moving from left to right) at which the proportion of the weight is greater than *p*. (I omit how to handle special percentiles for which the proportion is equal to *p*.)

The physical model looks like the following:

From the figure you can see that *x*_{1} is the *p*^{th} quantile for *p* < 0.25. Similarly, *x*_{2} is the *p*^{th} quantile for 0.25 < *p* < 0.30, and so forth.

If you want to apply these concepts to your own data, you can download the SAS program that generates the CDF graphs and computes the weighted percentiles.

The post Weighted percentiles appeared first on The DO Loop.

]]>