The post The jackknife method to estimate standard errors in SAS appeared first on The DO Loop.
]]>One way to assess the precision of a statistic (a point estimate) is to compute the standard error, which is the standard deviation of the statistic's sampling distribution. A relatively large standard error indicates that the point estimate should be viewed with skepticism, either because the sample size is small or because the data themselves have a large variance. The jackknife method is one way to estimate the standard error of a statistic.
Some simple statistics have explicit formulas for the standard error, but the formulas often assume normality of the data or a very large sample size. When your data do not satisfy the assumptions or when no formula exists, you can use resampling techniques to estimate the standard error. Bootstrap resampling is one choice, and the jackknife method is another. Unlike the bootstrap, which uses random samples, the jackknife is a deterministic method.
This article explains the jackknife method and describes how to compute jackknife estimates in SAS/IML software. This is best when the statistic that you need is also implemented in SAS/IML. If the statistic is computed by a SAS procedure, you might prefer to download and use the %JACK macro, which does not require SAS/IML.
The jackknife method estimates the standard error (and bias) of statistics without making any parametric assumptions about the population that generated the data. It uses only the sample data.
The jackknife method manufactures jackknife samples from the data.
A jackknife sample is a "leave-one-out" resample of the data. If there are n observations, then there are n jackknife samples, each of size n-1. If the original data are
{x_{1}, x_{2},..., x_{n}},
then the i_th jackknife sample is
{x_{1},..., x_{i-1},x_{i+1},..., x_{n}}
You then compute n jackknife replicates. A jackknife replicate is the statistic of interest computed on a jackknife sample. You can obtain an estimate of the standard error from the variance of the jackknife replicates. The jackknife method is summarized by the following:
Resampling methods are not hard, but the notation in some books can be confusing. To clarify the method, let's choose a particular statistic and look at example data. The following example is from Martinez and Martinez (2001, 1st Ed, p. 241), which is also the source for this article. The data are the LSAT scores and grade-point averages (GPAs) for 15 randomly chosen students who applied to law school.
data law; input lsat gpa @@; datalines; 653 3.12 576 3.39 635 3.30 661 3.43 605 3.13 578 3.03 572 2.88 545 2.76 651 3.36 555 3.00 580 3.07 594 2.96 666 3.44 558 2.81 575 2.74 ; |
The statistic of interest (T) will be the correlation coefficient between the LSAT and the GPA variables for the n=15 observations. The observed correlation is T_{Data} = 0.776. The standard error of T helps us understand how much T would change if we took a different random sample of 15 students. The next sections show how to implement the jackknife analysis in the SAS/IML language.
The SAS/IML matrix language is the simplest way to perform a general jackknife estimates. If X is an n x p data matrix, you can obtain the i_th jackknife sample by excluding the i_th row of X. The following two helper functions encapsulate some of the computations. The SeqExclude function returns the index vector {1, 2, ..., i-1, i+1, ..., n}. The JackSample function returns the data matrix without the i_th row:
proc iml; /* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ start SeqExclude(n,i); if i=1 then return 2:n; if i=n then return 1:n-1; return (1:i-1) || (i+1:n); finish; /* return the i_th jackknife sample for (n x p) matrix X */ start JackSamp(X,i); n = nrow(X); return X[ SeqExclude(n, i), ]; /* return data without i_th row */ finish; |
By using the helper functions, you can carry out each step of the jackknife method. To make the method easy to modify for other statistics, I've written a function called EvalStat which computes the correlation coefficient. This function is called on the original data and on each jackknife sample.
/* compute the statistic in this function */ start EvalStat(X); return corr(X)[2,1]; /* <== Example: return correlation between two variables */ finish; /* read the data into a (n x 2) data matrix */ use law; read all var {"gpa" "lsat"} into X; close; /* 1. compute statistic on observed data */ T = EvalStat(X); /* 2. compute same statistic on each jackknife sample */ n = nrow(X); T_LOO = j(n,1,.); /* LOO = "Leave One Out" */ do i = 1 to n; Y = JackSamp(X,i); T_LOO[i] = EvalStat(Y); end; /* 3. compute mean of the LOO statistics */ T_Avg = mean( T_LOO ); /* 4 & 5. compute jackknife estimates of bias and standard error */ biasJack = (n-1)*(T_Avg - T); stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) ); result = T || T_Avg || biasJack || stdErrJack; print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}]; |
The output shows that the estimate of bias for the correlation coefficient is very small. The standard error of the correlation coefficient is estimated as 0.14, which is about 18% of the estimate.
To use this code yourself, simply modify the EvalStat function. The remainder of the program does not need to change.
When the data are univariate, you can sometimes eliminate the loop that computes jackknife samples and evaluates the jackknife replicates. If X is column vector, you can computing the (n-1) x n matrix whose i_th column represents the i_th jackknife sample. (To prevent huge matrices, this method is best for n < 20000.) Because many statistical functions in SAS/IML operate on the columns of a matrix, you can often compute the jackknife replicates in a vectorized manner.
In the following program, the JackSampMat function returns the matrix of jackknife samples for univariate data. The function calls the REMOVE function in SAS/IML, which deletes specified elements of a matrix and returns the results in a row vector. The EvalStatMat function takes the matrix of jackknife samples and returns a row vector of statistics, one for each column. In this example, the function returns the sample standard deviation.
/* If x is univariate, you can construct a matrix where each column contains a jackknife sample. Use for univariate column vector x when n < 20000 */ start JackSampMat(x); n = nrow(x); B = j(n-1, n,0); do i = 1 to n; B[,i] = remove(x, i)`; /* transpose to column vevtor */ end; return B; finish; /* Input: matrix where each column of X is a bootstrap sample. Return a row vector of statistics, one for each column. */ start EvalStatMat(x); return std(x); /* <== Example: return std dev of each sample */ finish; |
Let's use these functions to get a jackknife estimate of the standard error for the statistic (the standard deviation). The data (from Martinez and Martinez, p. 246) have been studied by many researchers and represent the weight gain in grams for 10 rats who were fed a low-protein diet of cereal:
x = {58,67,74,74,80,89,95,97,98,107}; /* Weight gain (g) for 10 rats */ /* optional: visualize the matrix of jackknife samples */ *M = JackSampMat(x); *print M[c=("S1":"S10") r=("1":"9")]; /* Jackknife method for univariate data */ /* 1. compute observed statistic */ T = EvalStatMat(x); /* 2. compute same statistic on each jackknife sample */ T_LOO = EvalStatMat( JackSampMat(x) ); /* LOO = "Leave One Out" */ /* 3. compute mean of the LOO statistics */ T_Avg = mean( T_LOO` ); /* transpose T_LOO */ /* 4 & 5. compute jackknife estimates of bias and standard error */ biasJack = (n-1)*(T_Avg - T); stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) ); result = T || T_Avg || biasJack || stdErrJack; print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}]; |
The output shows that the standard deviation of these data is about 15.7 grams. The jackknife method computes that the standard error for this statistic about 2.9 grams, which is about 18% of the estimate.
In summary, jackknife estimates are straightforward to implement in SAS/IML. This article shows a general implementation that works for all data and a specialized implementation that works for univariate data. In both cases, you can adapt the code for your use by modifying the function that computes the statistic on a data set. This approach is useful and efficient when the statistic is implemented in SAS/IML.
The post The jackknife method to estimate standard errors in SAS appeared first on The DO Loop.
]]>The post How to find a feasible point for a constrained optimization in SAS appeared first on The DO Loop.
]]>The NLPFEA routine returns a point in the feasible region from an arbitrary starting guess. Suppose the problem has p (possibly bounded) parameters, and the feasible region is formed by k > 0 additional linear constraints. Then you can represent the feasible regions by a (k+2) x (p+2) matrix, which is the representation that is used for linear programming and constrained nonlinear optimizations. The first row specifies the lower bounds for the parameters and the second row specifies the upper bounds. (Missing values in the first p columns indicate that a parameter is not bounded.) The remaining k rows indicate linear constraints. For example, the following matrix defines a pentagonal region for two parameters. You can call the NLPFEA subroutine to generate a point that satisfies all constraints:
proc iml; con = { 0 0 . ., /* param min */ 10 10 . ., /* param max */ 3 -2 -1 10, /* 3*x1 + -2*x2 LE 10 */ 5 10 -1 56, /* 5*x1 + 10*x2 LE 56 */ 4 2 1 7 }; /* 4*x1 + 2*x2 GE 7 */ guess = {0 0}; /* arbitrary p-dimensional point */ call nlpfea(z, guess, con); /* x0 is feasible point */ print guess[c={"x" "y"}], z[c={"Tx" "Ty"}]; |
The output shows that the guess (x,y) = (0,0) was not feasible, but the NLPFEA routine generated the transformed point T(x,y) = (1.2, 1.1), which is feasible.
It is interesting to visualize the NLPFEA subroutine. The following SAS/IML statements create 36 initial guesses that are distributed uniformly on a circle around the feasible region. For each guess, the program transforms the guess into the feasible region by calling the NLPFEA subroutine. The initial and transformed points are saved to a SAS data set and visualized by using PROC SGPLOT:
NumPts = 36; twopi = 2*constant('pi'); x=.; y=.; Tx=.; Ty=.; create feasible var {"x" "y" "Tx" "Ty"}; do i = 1 to NumPts; x = 2.5 + 5*cos((i-1)/NumPts * twopi); /* guess on circle */ y = 2.5 + 5*sin((i-1)/NumPts * twopi); call nlpfea(feasPt, x||y, con); /* transform into feasible */ Tx = feasPt[1]; Ty = feasPt[2]; append; end; close; |
The graph visualizes the result. The graph shows how each point on the circle is transformed into the feasible region. Some points are transformed into the interior, but many are transformed onto the boundary. You can see that the transformed point always satisfies the linear constraints.
Before I finish, I want to point out that the nonlinear programming (NLP) subroutines in SAS/IML software rarely require you to call the NLPFEA subroutine explicitly. When you call an NLP routine for a linearly constrained optimization and provide a nonfeasible initial guess, the NLP routine internally calls the NLPFEA routine. Consequently, you might see the following NOTE displayed in the SAS log: NOTE: Initial point was changed to be feasible for boundary and linear constraints. For example, run the following program, which provides a nonfeasible initial guess to the NLPNRA (Newton-Raphson) subroutine.
start func(x); x1 = x[,1]; x2 = x[,2]; return ( -(x1-3)##4 + -(x2-2)##2 + 0.1*sin(x1#x2)); finish; opt = {1, /* find maximum of function */ 3}; /* print a little bit of output */ x0 = {0 0}; call nlpnra(rc, x_Opt, "func", x0, opt) blc=con; |
In this program, the NLPNRA subroutine detects that the guess not feasible. It internally calls NLPFEA to obtain a feasible guess and then computes an optimal solution. This is very convenient for programmers. The only drawback is that you don't know the initial guess that produced the optimal solution. However, you can call the NLPFEA subroutine directly if you want to obtain that information.
In optimization problems that have linear and boundary constraints, most optimization routines require an initial guess that is in the feasible region. The NLPFEA subroutine enables you to obtain a feasible point from an arbitrary initial guess. You can then use that feasible point as an initial guess in a built-in or user-defined optimization routine. However, for the built-in NLP subroutines, you can actually skip the NLPFEA call because the NLP subroutines internally call NLPFEA when you supply a nonfeasible initial guess.
The post How to find a feasible point for a constrained optimization in SAS appeared first on The DO Loop.
]]>The post Two ways to compute maximum likelihood estimates in SAS appeared first on The DO Loop.
]]>I previously wrote a step-by-step description of how to compute maximum likelihood estimates in SAS/IML. SAS/IML contains many algorithms for nonlinear optimization, including the NLPNRA subroutine, which implements the Newton-Raphson method.
In my previous article I used the LOGPDF function to define the log-likelihood function for the binomial data. The following statements define bounds for the parameter (0 < p < 1) and provides an initial guess of p_{0}=0.5:
/* Before running program, create Binomial and LN data sets from previous post */ /* Example 1: MLE for binomial data */ /* Method 1: Use SAS/IML optimization routines */ proc iml; /* log-likelihood function for binomial data */ start Binom_LL(p) global(x, NTrials); LL = sum( logpdf("Binomial", x, p, NTrials) ); return( LL ); finish; NTrials = 10; /* number of trials (fixed) */ use Binomial; read all var "x"; close; /* set constraint matrix, options, and initial guess for optimization */ con = { 0, /* lower bounds: 0 < p */ 1}; /* upper bounds: p < 1 */ opt = {1, /* find maximum of function */ 2}; /* print some output */ p0 = 0.5; /* initial guess for solution */ call nlpnra(rc, p_MLE, "Binom_LL", p0, opt, con); print p_MLE; |
The NLPNRA subroutine computes that the maximum of the log-likelihood function occurs for p=0.56, which agrees with the graph in the previous article. We conclude that the parameter p=0.56 (with NTrials=10) is "most likely" to be the binomial distribution parameter that generated the data.
If you've never used PROC NLMIXED before, you might wonder why I am using that procedure, since this problem is not a mixed modeling regression. However, you can use the NLMIXED procedure for general maximum likelihood estimation. In fact, I sometimes joke that SAS could have named the procedure "PROC MLE" because it is so useful for solving maximum likelihood problems.
PROC NLMIXED has built-in support for computing maximum likelihood estimates of data that follow the Bernoulli (binary), binomial, Poisson, negative binomial, normal, and gamma distributions. (You can also use PROC GENMOD to fit these distributions; I have shown an example of fitting Poisson data.)
The syntax for PROC NLMIXED is very simple for the Binomial data. On the MODEL statement, you declare that you want to model the X variable as Binom(p) where NTrials=10, and you use the PARMS statement to supply an initial guess for the parameter p. Be sure to ALWAYS check the documentation for the correct syntax for a binomial distribution. Some functions like PDF, CDF, and RAND use the p parameter as the first argument: Binom(p, NTrials). Some procedure (like the MCMC and NLMIXED) use the p parameter as the second argument: Binom(NTrials, p).
/* Method 2: Use PROC NLMIXED solve using built-in modeling syntax */ proc nlmixed data=Binomial; parms p = 0.5; * initial value for parameter; bounds 0 < p < 1; NTrials = 10; model x ~ binomial(NTrials, p); run; |
Notice that the output from PROC NLMIXED contains the parameter estimate, standard error, and 95% confidence intervals. The parameter estimate is the same value (0.56) as was found by the NLPNRA routine in SAS/IML. The confidence interval confirms what we previously saw in the graph of the log-likelihood function: the function is somewhat flat near the optimum, so a 95% confidence interval is wide: [0.49, 0.63].
By the way, both PROC IML and PROC NLMIXED enable you to choose your favorite optimization solver. In this example, I used Newton-Raphson (CALL NLPNRA) for the SAS/IML solver, but you could also use a quasi-Newton solver (the NLPQN subroutine), which is the solver used by PROC NLMIXED. Or you can tell PROC NLMIXED to use Newton-Raphson by adding TECHNIQUE=NEWRAP to the PROC NLMIXED statement.
You can use similar syntax to compute MLEs for lognormal data. The SAS/IML syntax is similar to the binomial example, so it is omitted. To view it, download the complete SAS program that computes these maximum likelihood estimates.
PROC NLMIXED does not support the lognormal distribution as a built-in distribution, which means that you need to explicitly write out the log-likelihood function and specify it in the GENERAL function on the MODEL statement. Whereas in SAS/IML you have to use the SUM function to sum the log-likelihood over all observations, the syntax for PROC NLMIXED is simpler. Just as the DATA step has an implicit loop over all observations, the NLMIXED procedure implicitly sums the log-likelihood over all observations. You can use the LOGPDF function, or you can explicitly write the log-density formula for each observation.
If you look up the lognormal distribution in the list of "Standard Definition" in the PROC MCMC documentation, you will see that one parameterization of the lognormal PDF in terms of the log-mean μ and log-standard-deviation σ is
f(x; μ, σ) = (1/(sqrt(2π σ x) exp(-(log(x)-μ)**2 / (2σ**2))
When you take the logarithm of this quantity, you get two terms, or three if you use the rules of logarithms to isolate quantities that do not depend on the parameters:
proc nlmixed data=LN; parms mu 1 sigma 1; * initial values of parameters; bounds 0 < sigma; * bounds on parameters; sqrt2pi = sqrt(2*constant('pi')); LL = -log(sigma) - log(sqrt2pi*x) /* this term is constant w/r/t (mu, sigma) */ - (log(x)-mu)**2 / (2*sigma**2); /* Alternative: LL = logpdf("Lognormal", x, mu, sigma); */ model x ~ general(LL); run; |
The parameter estimates are shown, along with standard errors and 95% confidence intervals. The maximum likelihood estimates for the lognormal data are (μ, σ) = (1.97, 0.50). You will get the same answer if you use the LOGPDF function (inside the comment) instead of the "manual calculation." You will also get the same estimates if you omit the term log(sqrt2pi*x) because that term does not depend on the MLE parameters.
In conclusion, you can use nonlinear optimization in the SAS/IML language to compute MLEs. This approach is especially useful when the computation is part of a larger computational program in SAS/IML. Alternatively, the NLMIXED procedure makes it easy to compute MLEs for discrete and continuous distributions. For some simple distributions, the log-likelihood functions are built into PROC NLMIXED. For others, you can specify the log likelihood yourself and find the maximum likelihood estimates by using the GENERAL function.
The post Two ways to compute maximum likelihood estimates in SAS appeared first on The DO Loop.
]]>The post Two simple ways to construct a log-likelihood function in SAS appeared first on The DO Loop.
]]>Although the method is known as maximum likelihood estimation, in practice you should optimize the log-likelihood function, which is numerically superior to work with. For an introduction to MLE, including the definitions of the likelihood and log-likelihood functions, see the Penn State Online Statistics Course, which is a wonderful reference.
MLE assumes that the observed data x={x_{1}, x_{2}, ..., x_{n}} are independently drawn from some population. You want to find the most likely parameters θ = (θ_{1},...,θ_{k}) such that the data are fit by the probability density function (PDF), f(x; θ). Since the data are independent, the probability of observing the data is the product Π_{i} f(x_{i}, θ), which is the likelihood function L(θ | x). If you take the logarithm, the product becomes a sum. The log-likelihood function is
LL(θ | x) = Σ_{i} log( f(x_{i}, θ) )
This formula is the key. It says that the log-likelihood function is simply the sum of the log-PDF function evaluated at the data values. Always use this formula. Do not ever compute the likelihood function (the product) and then take the log, because the product is prone to numerical errors, including overflow and underflow.
There are two simple ways to construct the log-likelihood function in SAS:
A coin was tossed 10 times and the number of heads was recorded. This was repeated 20 times to get a sample. A student wants to fit the binomial model X ~ Binom(p, 10) to estimate the probability p of the coin landing on heads. For this problem, the vector of MLE parameters θ is merely the one parameter p.
Recall that if you are using SAS/IML to optimize an objective function, the parameter that you are trying to optimize should be the only argument to the function, and all other parameters should be specified on the GLOBAL statement. Thus one way to write a SAS/IML function for the binomial log-likelihood function is as follows:
proc iml; /* Method 1: Use LOGPDF. This method works in DATA step as well */ start Binom_LL1(p) global(x, NTrials); LL = sum( logpdf("Binomial", x, p, NTrials) ); return( LL ); finish; /* visualize log-likelihood function, which is a function of p */ NTrials = 10; /* number of trials (fixed) */ use Binomial; read all var "x"; close; p = do(0.01, 0.99, 0.01); /* vector of parameter values */ LL = j(1, ncol(p), .); do i = 1 to ncol(LL); LL[i] = Binom_LL1( p[i] ); /* evaluate LL for a sequence of p */ end; title "Graph of Log-Likelihood Function"; title2 "Binomial Distribution, NTrials=10"; call series(p, LL) grid={x y} xvalues=do(0,1,0.1) label={"Probability of Sucess (p)", "Log Likelihood"}; |
Notice that the data is constant and does not change. The log likelihood is considered to be a function of the parameter p. Therefore you can graph the function for representative values of p, as shown. The graph clearly shows that the log likelihood is maximal near p=0.56, which is the maximum likelihood estimate. The graph is fairly flat near its optimal value, which indicates that the estimate has a wide standard error. A 95% confidence interval for the parameter is also wide. If the sample contained 100 observations instead of only 20, the log-likelihood function might have a narrower peak.
Notice also that the LOGPDF function made this computation very easy. You do not need to worry about the actual formula for the binomial density. All you have to do is sum the log-density at the data values.
In contrast, the second method requires a little more work, but can handle any distribution for which you can compute the density function. If you look up the formula for the binomial PDF in the MCMC documentation, you see that
PDF(x; p, NTrials) = comb(NTrials, x) * p**x * (1-p)**(NTrials-x)
where the COMB function computes the binomial coefficient "NTrials choose x." There are three terms in the PDF that are multiplied together. Therefore when you apply the LOG function, you get the sum of three terms. You can use the LCOMB function in SAS to evaluate the logarithm of the binomial coefficients in an efficient manner, as follows:
/* Method 2: Manually compute log likelihood by using formula */ start Binom_LL2(p) global(x, NTrials); LL = sum(lcomb(NTrials, x)) + log(p)*sum(x) + log(1-p)*sum(NTrials-x); return( LL ); finish; LL2 = Binom_LL2(p); /* vectorized function, so no need to loop */ |
The second formulation has an advantage in a vector language such as SAS/IML because you can write the function so that it can evaluate a vector of values with one call, as shown. It also has the advantage that you can modify the function to eliminate terms that do not depend on the parameter p. For example, if your only goal is maximize the log-likelihood function, you can omit the term sum(lcomb(NTrials, x)) because that term is a constant with respect to p. That reduces the computational burden. Of course, if you omit the term then you are no longer computing the exact binomial log likelihood.
In a similar way, you can use the LOGPDF or the formula for the PDF to define the log-likelihood function for the lognormal distribution. For brevity, I will only show the SAS/IML functions, but you can download the complete SAS program that defines the log-likelihood function and computes the graph.
The following SAS/IML modules show two ways to define the log-likelihood function for the lognormal distribution. For the lognormal distribution, the vector of parameters θ = (μ, σ) contains two parameters.
/* Method 1: use LOGPDF */ start LogNormal_LL1(param) global(x); mu = param[1]; sigma = param[2]; LL = sum( logpdf("Lognormal", x, mu, sigma) ); return( LL ); finish; /* Method 2: Manually compute log likelihood by using formula PDF(x; p, NTrials) = comb(NTrials,x) # p##x # (1-p)##(NTrials-x) */ start LogNormal_LL2(param) global(x); mu = param[1]; sigma = param[2]; twopi = 2*constant('pi'); LL = -nrow(x)/2*log(twopi*sigma##2) - sum( (log(x)-mu)##2 )/(2*sigma##2) - sum(log(x)); /* this term is constant w/r/t (mu, sigma) */ return( LL ); finish; |
The function that uses the LOGPDF function is simple to write. The second method is more complicated because the lognormal PDF is more complicated than the binomial PDF. Nevertheless, the complete log-likelihood function only requires a few SAS/IML statements.
For completeness, the contour plot on this page shows the log-likelihood function for 200 simulated observations from the Lognormal(2, 0.5) distribution. The parameter estimates are (μ, σ) = (1.97, 0.5).
This article has shown two simple ways to define a log-likelihood function in SAS. You can sum the values of the LOGPDF function evaluated at the observations, or you can manually apply the LOG function to the formula for the PDF function. The log likelihood is regarded as a function of the parameters of the distribution, even though it also depends on the data. For distributions that have one or two parameters, you can graph the log-likelihood function and visually estimate the value of the parameters that maximize the log likelihood.
Of course, SAS enables you to numerically optimize the log-likelihood function, thereby obtaining the maximum likelihood estimates. My next blog post shows two ways to obtain maximum likelihood estimates in SAS.
The post Two simple ways to construct a log-likelihood function in SAS appeared first on The DO Loop.
]]>The post The IFN function versus the IF-THEN/ELSE statement in SAS appeared first on The DO Loop.
]]>data Log; input x @@; if x > 0 then logX = log(x); else logX = .; datalines; -1 1 2 . 0 10 ; proc print; run; |
On SAS discussion forums, I sometimes see questions from people who try to use the IFN function to accomplish the same logic. That is, in place of the IF-THEN/ELSE logic, they try to use the following one-liner in the DATA step:
logX = ifn(x>0, log(x), .); |
Although this looks like the same logic, there is a subtle difference. All three arguments to the IFN function are evaluated BEFORE the function is called, and the results of the evaluation are then passed to the function. For example, if x= -1, then the SAS DATA step does the following:
In Step 2, SAS evaluates log(x) unconditionally for every value of x, which leads to out-of-domain errors when x is not positive. This is exactly the situation that the programmer was trying to avoid! In contrast, the IF-THEN/ELSE logic only evaluates log(x) when x is positive. Consequently, the SAS log is clean when you use the IF-THEN/ELSE statement.
There are plenty of situations for which the IFN function (and its cousin, the IFC function) are useful, but for testing out-of-domain conditions, use IF-THEN/ELSE logic instead.
The post The IFN function versus the IF-THEN/ELSE statement in SAS appeared first on The DO Loop.
]]>The post Runs in coin tosses; patterns in random seating appeared first on The DO Loop.
]]>This question popped into my head last weekend as I attended my son's graduation ceremony. As the students marched in, I noticed that men were dressed in green cap and gowns, whereas the women were dressed in white. They entered in alphabetical order, which randomized the men and women. They filed into 12 rows that each contained 28 seats. Thus each row is like an independent toss of a coin, with green and white representing heads and tails, respectively.
When the students entered the ninth row from the left (fourth from the right), I noticed a sequence of eight consecutive "little green men," which is highlighted in red in the picture on this page. (Click to enlarge.) I wish I had a photo of the students seated in their chairs because the effect is more dramatic when the green mortarboards are all aligned. But take my word for it: the long sequence of green was very noticeable.
The picture shows that there was actually a row to the extreme left that was partially filled. For the purpose of this article, ignore the partial row. In the 12 full rows, the number of men in each row is (from left to right) {15, 15, 14, 11, 16, 16, 15, 10, 20, 9, 14, 13}. Remarkably, this adds to 168, so the proportion of men is exactly 0.5 of the 12 x 28 = 336 students.
You can simulate the students by generating 336 random binary values arranged on a 12 x 28 grid. Since this was the graduating class of 2017, I used 2017 as the random number seed in the following DATA step:
%let NumRows = 12; %let NumInRow= 28; data Graduation; call streaminit(2017); do row = 1 to &NumRows; do seat = 1 to &NumInRow; Male = rand("Bernoulli", 0.5); output; end; end; run; title "One Simulated Seating Arrangement"; proc sgplot data=Graduation; styleattrs wallcolor=grey DATACONTRASTCOLORS=(white green); scatter x=Row y=Seat / group=male markerattrs=(symbol=SquareFilled); xaxis integer values=(1 to 12); run; |
If you look at row 5 in the image, you will see a sequence of nine consecutive green markers. The fact that a simulated data set reproduced the graduation scenario on the very first attempt makes me think that this situation is not very rare. However, changing the seed a few times shows that the situation does not always occur.
There are 12 rows, each containing 28 students. The event of interest is a row with eight or more consecutive males. The easiest way to compute the probability of this happening is to first compute the probability for one row. Since the rows are assumed to be independent, you can then compute the probability of seeing the event in any of the 12 rows.
A sequence of consecutive events is also called a "run" of events. If you do an internet search for "probability of k heads in a row" or "probability of runs in coin toss", you will find many solutions to this problem. The source I used is a question that was asked on StackExchange about "blocks of events." Whereas many people approach this problem by using a simulation or an explicit recursive mathematical formula, "Neil G" and "COOLSerdash" compute the probability by using a Markov transition matrix, which is easy to create in the SAS/IML matrix language.
The following statements define a function that creates the Markov transition matrix and iterates it to compute the probability that coin will show k consecutive heads in N tosses. The program works for any probability of heads, not merely p=0.5. See the StackExchange article for the explanation:
proc iml; k = 8; * desired number of correct trials in a row; p = 1/2; * probability of getting a correct trial; N = 28; * Total number of trials; /* Iterate Markov transition matrix to compute probability of k consecutive heads in N tosses of a coin that has probability p of showing heads */ start ProbConsec(N, p, k); M = j(k+1, k+1, 0); * set up the transition matrix M; M[1, 1:k] = (1-p); * first row, except for last column; M[k+1, k+1] = 1; * lower right corner; do i = 2 to (k+1); M[i, i-1] = p; * subdiagonal elements; end; Mn = M**N; * Calculate M^N; /* Prob that starting in State 1 ends in State (k+1) */ return(Mn[(k+1), 1]); finish; prob = ProbConsec(N, p, k); print prob; |
The result shows that the probability of seeing 8 consecutive heads out of 28 tosses is 0.0426. This is the same probability as observing 8 consecutive men in green in one of the rows at graduation, assuming that alphabetical ordering randomizes men and women. However, remember that there were 12 rows at graduation, so the probability of observing this event in ANY row is higher, as shown below:
ProbSee0 = (1-prob)##12; * P(Not in Row1 AND ... NOT in Row 12); ProbSeeAny = 1 - ProbSee0; * P(In Row1 OR ... OR in Row 12); print ProbSeeAny ProbSee0; |
The chance of observing exactly eight consecutive men in any of the 12 rows is about 41%. Of course, you can also compute the probability of observing 9, 10, 11, or more consecutive men. When you add up the probabilities, you discover that the cumulative probability of observing an "extreme arrangement" of 8 or more consecutive men is about 0.64. And why stop there? You could extend this analysis to include a sequence of consecutive women!
In summary, graduation events can be long, but computing the probabilities of interesting arrangements of the students can help make the time go faster! I wasn't able to compute the probabilities in my head while at the graduation, but it didn't take long to research the problem and solve it with SAS after I got home. I conclude that observing a long sequence of men in a randomized seating arrangement that has 12 rows of 28 seats is not a rare event. In fact, the chance of observing a run of eight or more men is about 64%.
The real lesson for all of us is that we should keep our eyes open and look around. Math and statistics are everywhere!
The post Runs in coin tosses; patterns in random seating appeared first on The DO Loop.
]]>The post How to choose a seed for generating random numbers in SAS appeared first on The DO Loop.
]]>
To be clear, I am talking about using a seed value to initialize a modern, high-quality, pseudorandom number generator (RNG). For example, in SAS you can use the STREAMINIT subroutine to initialize the Mersenne twister algorithm that is used by the RAND function. If you are still using the old-style RANUNI or RANNOR functions in SAS, please read the article "Six reasons you should stop using the RANUNI function to generate random numbers."
A seed value specifies a particular stream from a set of possible random number streams. When you specify a seed, SAS generates the same set of pseudorandom numbers every time you run the program. However, there is no intrinsic reason to prefer one stream over another. The stream for seed=12345 is just as random as the stream for the nine-digit prime number 937162211.
Some people see the number 937162211 and think that it looks "more random" than 12345. They then assume that the random number stream that follows from CALL STREAMINIT(937162211) is "more random" than the random number stream for CALL STREAMINIT(12345). Nope, random means random. In modern pseudorandom generators, the streams for different seeds should have similar statistical properties. Furthermore, many RNGs use the base-2 representation of the seed for initialization and (12345)_{10} = (11000000111001)_{2} looks pretty random! In fact, if you avoid powers of 2, the base-2 representations of most base-10 numbers "look random."
Researchers who specialize in random number generators might criticize what I've said as overly simplistic. There have been many research papers written about how to take a 32-bit integer and use that information to initialize a RNG whose internal state contains more than 32 bits. There have been cases where a RNG was published and the authors later modified the initialization routine because certain seeds did not result in streams that were sufficiently random. There have been many discussions about how to create a seed initialization algorithm that is easy to call and that almost always results in a high-quality stream of random numbers.
These are hard problems, but fortunately researchers have developed ways to initialize a stream from a seed so that there is a high probability that the stream will have excellent statistical properties. The relevant question for many SAS programmers is "can I use 12345 or my telephone number as seed values, or do I always need to type a crazy-looking nine-digit sequence?" My response is that there is no reason to prefer the crazy-looking seed over an easy-to-type sequence such as your phone number, your birthday, or the first few digits of pi.
If you absolutely insist on using a "random seed," SAS can help. If you call the STREAMINIT subroutine with the value 0, then SAS will use the date, time of day, and possibly other information to manufacture a seed when you call the RAND function. SAS puts the seed value into the SYSRANDOM system macro variable. That means you can use %PUT to display the seed that SAS created, as follows:
data _null_; call streaminit(0); /* generate seed from system clock */ x = rand("uniform"); run; %put &=SYSRANDOM; |
SYSRANDOM=1971603567 |
Every time you run this program, you will get a different seed value that you can use as the seed for a next program.
A second method is to use the RAND function to generate a random integer between 1 and 2^{31}-1, which is the range of valid seed values for the Mersenne twister generator in SAS 9.4m4. The following program generates a random seed value:
data _null_; call streaminit(0); seed = ceil( (2**31 - 1)*rand("uniform") ); put seed=; run; |
seed=1734176512 |
Both of these methods will generate a seed for you. However, the randomly generated seed does not provide any benefit. For a modern, high-quality, pseudorandom number generator, the stream should have good statistical properties regardless of the seed value. Using a random seed value does not make a stream "more random" than a seed that is easier to type.
The post How to choose a seed for generating random numbers in SAS appeared first on The DO Loop.
]]>The post On the SMOOTHCONNECT option in the SERIES statement appeared first on The DO Loop.
]]>However, in a second post, Sanjay shows how the SMOOTHCONNECT option can result in curves that are less ideal. Here is a SAS program that reproduces the gist of Sanjay's example:
data Ex1; input segment $ x y; datalines; A 0 0 A 1 0 A 2 -3 A 3 -1 B 0 0 B 1 0 B 2 2 B 3 1 ; proc sgplot data=Ex1; series x=x y=y / group=segment markers smoothconnect; run; |
As you can see, groups A and B have identical values for the first two observations. However, the remaining Y values for group A are negative, whereas the remaining Y values for group B are positive. Notice that the interpolating curve for group A rises up before it dives down. Similarly, the curve for group B dips down before it rises up.
This visualization could give the wrong impression to the viewer of the graph, especially if the markers are not displayed. We only have data for four values of X. However, the SMOOTHCONNECT option makes it appear that the graph for group A was higher than group B on the interval (0, 1). That may or may not be true. The curves also give the impression that the curve for group A began to decrease prior to x=1, whereas in fact we have no idea when the curve reached a maximum and began to decrease.
5 things to remember when using the SMOOTHCONNECT option in SGPLOT #DataViz
Click To Tweet
The SMOOTHCONNECT option can be useful for visualizing time series data, but as with any tool, it is important to know how to use it responsibly. In thinking about these issues, I compiled a list of five ways that the SMOOTHCONNECT option might give a misleading view of the data. By understanding these issues, you can use the SMOOTHCONNECT option wisely. Here are some potential pitfalls to connecting points by using a smooth curve.
Sanjay's example demonstrates the first three issues. Notice that the maximum Y value for group A in the data is Y=0, but the interpolating spline exceeds that value.
To demonstrate the fourth and fifth items requires an artificial example and data that is not evenly spaced along the X variable:
data Ex2; input x y; datalines; 0 0 0.95 0.95 1 1 2 -2 ; proc sgplot data=Ex2; series x=x y=y / markers smoothconnect; yaxis max=1.5; run; |
There are only four points in the data set, but the SMOOTHCONNECT curve for these data exhibits a sharp turn (in fact, a loop). I intentionally created data that would create this behavior by making the data linear for X < 2. I also made the second data point very close to the third data point so that the interpolating curve would have to gyrate widely to pass through the points before sloping down to pass through the fourth point.
It is well known among applied mathematicians that interpolation can lead to issues like these. These issues are not caused by a bug in SAS, but by the fact that you are trying to force a smooth curve to pass through every data point. I point them out so that SAS users who are using SGPLOT can be aware of them.
If your data are evenly spaced in the horizontal direction, the first three issue are usually small and innocuous. The last two issue will not occur. Therefore, it is safe to use the SMOOTHCONNECT option for evenly spaced data.
If your data are not evenly spaced, here are a few guidelines to help you avoid wild interpolating curves:
In summary, if you use the SMOOTHCONNECT option in the SERIES statement, SAS will pass a smooth curve through the points. When the points are unevenly spaced in X, the curve might need to quickly change direction to pass through all the data. The resulting curve might not be a good representation of the data. In this case, you might need to relax the requirements of a smooth interpolating curve, either by using a piecewise linear interpolation or by using a smooth curve that is not constrained to pass through the data.
The post On the SMOOTHCONNECT option in the SERIES statement appeared first on The DO Loop.
]]>The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.
]]>Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ ... ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.
Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:
Thus a general formula for quantile estimates is q = (1 - λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.
You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary. The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1 and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.
SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.
It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 - c.
The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:
You can download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F. The differences between the definitions are most evident for small data sets and when there is a large "gap" between one or more adjacent data values. The following panel of graphs shows the nine sample quantile methods for a data set that has 10 observations, {0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, ..., 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.
In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.
You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.
You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.
As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.
However, in practice analysts don't worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.
The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.
So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.
In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.
SAS program to compute nine sample quantiles.
The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.
]]>The post Quantile definitions in SAS appeared first on The DO Loop.
]]>That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.
You might wonder why there are multiple definitions. Recall that a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population. The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.
In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.
For convenience, assume that the sample data are listed in sorted order. In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer, the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].
Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2). Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].
The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations. The differences between the definitions are most apparent on small data sets that contain integer values, so let's create a tiny data set and apply the five definitions to it. The following example has 10 observations and six unique values.
data Q; input x @@; datalines; 0 1 1 1 2 2 2 4 5 8 ; |
You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are "in a gap." By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.
Since the default definition is QNTLDEF=5, let's visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, ..., 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.
proc iml; use Q; read all var "x"; close; /* read data */ prob = T(1:999) / 1000; /* fine grid of prob values */ call qntl(quantile, x, prob, 5); /* use method=5 definition */ create Pctls var {"quantile" "prob" "x"}; append; close; quit; title "Sample Percentiles"; title2 "QNTLDEF = 5"; proc sgplot data=Pctls noautolegend; scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled); fringe x / lineattrs=GraphData2(thickness=3); xaxis display=(nolabel) values=(0 to 8); yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions"; refline 0 1 / axis=y; run; |
For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.
If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.
It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:
/* Compare all SAS methods */ proc iml; use Q; read all var "x"; close; /* read data */ prob = T(1:999) / 1000; /* fine grid of prob values */ create Pctls var {"Qntldef" "quantile" "prob" "x"}; do def = 1 to 5; call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */ Qntldef = j(nrow(prob), 1, def); /* ID variable */ append; end; close; quit; proc print data=Pctls noobs; where prob = 0.45; /* compare 0.45 quantile for different definitions */ var Qntldef quantile; run; |
You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let's plot them and see:
ods graphics / antialiasmax=10000; title "Sample Percentiles in SAS"; proc sgpanel data=Pctls noautolegend; panelby Qntldef / onepanel rows=2; scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled); fringe x; rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion"; refline 0 1 / axis=y; colaxis display=(nolabel); run; |
The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.
If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.
The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively. If you compute the 0.901 quantile, you will discover that the "round down" method (QNTLDEF=2) gives 5 as the sample quantile, whereas the "round up" method (QNTLDEF=3) gives the value 8. Similarly, the "backward interpolation method" (QNTLDEF=1) gives 5.03, whereas the "forward interpolation method" (QNTLDEF=4) gives 7.733.
In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.
For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) "Sample quantiles in statistical packages", American Statistician, 50, 361–365.
The post Quantile definitions in SAS appeared first on The DO Loop.
]]>