The post Conditionally append observations to a SAS data set appeared first on The DO Loop.
]]>This article shows the second method. It shows how to use the SAS DATA step to scan through observations and remember certain values. If a condition is met, it uses the values to append new observations to the end of the data by using end-of-file processing.
Often SAS programmers need to implement complicated data-dependent logic. A simple example is "If the XYZ variable contains a certain value but doesn't contain a different value, then do something." On the SAS discussion forums, the experts often suggest scanning through the data with a DATA step and keeping one or more "flag variables" that indicate which conditions have been satisfied. At the end of the DATA step, you can look at the values of the flag variables to determine what action to take.
Last week I encountered a situation where I needed to conditionally append observations to input data. Although the solution is easy if you use the SAS/IML language (which enables you to scan an entire vector of values), I needed to solve the problem by using the DATA step, which processes only one observation at a time. The problem had the following form:
The goal is to create an output data set that always contains the four values 'Min', 'Max', 'LowVal', and 'HighVal'. The goal is summarized by the figure to the right. The following list describes how to generate the 'LowVal' and 'HighVal' observations if they do not exist.
The figure shows the four situations that can occur. The input data set always contains the 'Min' and 'Max' values but can contain none, one, or two of the other values. To goal is to produce the data set on the right, which always contains all four values. The next section presents a solution, so stop reading here if you want to solve the problem on your own!
To solve the problems, I used two facts about the SAS DATA step:
The following program scans the input data. It remembers the values of the 'Min' and 'Max' observations, in case it needs them. It uses indicator variables to determine whether the data contains the 'LowVal' and 'HighVal' observations. After the input data are read, the program uses an end-of-file indicator variable (EOF) to determine whether or not to add observations for 'LowVal' and 'HighVal'.
Because the program uses an OUTPUT statement to (conditionally) create new observation, you must also put an OUTPUT statement after the SET statement to ensure that the original observations are all written.
/* Include all 4 test cases. Use WHERE clause to test each case. */ data Test; length Type $7; input Group Type $ Value; datalines; 1 Min -3 1 Max 3 1 LowVal -2 1 HighVal 2 2 Min -3 2 Max 3 2 HighVal 2 3 Min -3 3 Max 3 3 LowVal -2 4 Min -3 4 Max 3 ; /* Input order is always 'Min' and 'Max' optionally followed by 'LowVal' and 'HighVal', if they exist. */ %let dsname = Test(where=(Group=4)); /* use 1,2,3,4 to test all cases */ data Want; drop HighFound LowFound Min Max; /* temporary variables */ retain HighFound LowFound 0 /* binary indicator variables: Initialize to 0 (false) */ Min Max .; /* VALUE of 'Min' and 'Max' obs: Initialize to missing */ set &dsname end=EOF; /* EOF is temporary indicator variable */ output; /* need OUTPUT because of EOF processing */ if Type = 'Min' then min = Value; /* remember the Min value */ else if Type = 'Max' then max = Value; /* remember the Max value */ else if Type = 'LowVal' then LowFound = 1; /* Low value found; no need to create it */ else if Type = 'HighVal' then HighFound = 1; /* High value found; no need to create it */ /* end-of-file processing: conditionally append new observations */ if EOF then do; if ^LowFound then do; /* Low value not found. Add it. */ Type = "LowVal"; Value = Min; output; end; if ^HighFound then do; /* High value not found. Add it. */ Type = "HighVal"; Value = Max; output; end; end; run; proc print data=Want; var Group Type Value; run; |
The result is shown for input data that contains only the 'Min' and 'Max' observations but not the 'LowVal' or 'HighVal' observations. The output shows that the 'LowVal' or 'HighVal' observations were correctly appended to the input data, and that values for the VALUE column were copied from the 'Min' and 'Max' observations, respectively. You can verify that the other three input data sets are also correctly handled.
When performing end-of-file processing, be careful if you use a DELETE statement or a subsetting IF statement. For details and examples, see "The Perils of End-of-File Processing when Subsetting Data" (Landry, 2009). Landry summarizes the problem as follows (p. 2): "The problem occurs only when the last record in the dataset is deleted.... The reason this happens is that when a record is deleted..., SAS stops processing and returns to the next iteration of the DATA step. Thus, any executable statements placed after the [DELETE or subsetting IF statements] do not get executed."
Do you have a favorite way to conditionally append data? Do you know of other potential pitfalls with end-of-file processing? Leave a comment.
The post Conditionally append observations to a SAS data set appeared first on The DO Loop.
]]>The post Two tips for optimizing a function that has a restricted domain appeared first on The DO Loop.
]]>In SAS, you can fit these three-parameter distributions by using PROC UNIVARIATE. However, it is instructive to fit the lognormal distribution "manually" because it illustrates the general problem, which is how to handle optimization for which parameters depend on the data. This article provides two tips. First, you can use boundary constraints to ensure that the optimal solution enforces the condition θ < min(x). Second, you can define the objective function so that it does not attempt to evaluate expressions like log(x_{i} - θ) if the argument to the LOG function is not positive.
Before discussing the details of optimization, let's look at some data and the MLE parameter estimates for the three-parameter lognormal distribution. The following example is from the PROC UNIVARIATE documentation. The data are measurements of 50 diameters of steel rods that were produced in a factory. The company wants to model the distribution of rod diameters.
data Measures; input Diameter @@; label Diameter = 'Diameter (mm)'; datalines; 5.501 5.251 5.404 5.366 5.445 5.576 5.607 5.200 5.977 5.177 5.332 5.399 5.661 5.512 5.252 5.404 5.739 5.525 5.160 5.410 5.823 5.376 5.202 5.470 5.410 5.394 5.146 5.244 5.309 5.480 5.388 5.399 5.360 5.368 5.394 5.248 5.409 5.304 6.239 5.781 5.247 5.907 5.208 5.143 5.304 5.603 5.164 5.209 5.475 5.223 ; title 'Lognormal Fit for Diameters'; proc univariate data=Measures noprint; histogram Diameter / lognormal(scale=est shape=est threshold=est) odstitle=title; ods select ParameterEstimates Histogram; run; |
The UNIVARIATE procedure overlays the MLE fit on a histogram of the data. The ParameterEstimates table shows the estimates. The threshold parameter (θ) is the troublesome parameter when you fit models like this. The other two parameters ("Zeta," which I will call μ in the subsequent sections, and σ) are easier to fit.
An important thing to note is that the estimate for the threshold parameter is 5.069. If you look at the data (or use PROC MEANS), you will see that the minimum data value is 5.143. To be feasible, the value of the threshold parameter must always be less than 5.143 during the optimization process. The log-likelihood function is undefined if θ ≥ min(x).
Although PROC UNIVARIATE automatically fits the three-parameter lognormal distribution, it is instructive to explicitly write the log-likelihood function and solve the optimization problem "manually." Given N observations {x_{1}, x_{2}, ..., x_{N}}, you can use maximum likelihood estimation (MLE) to fit a three-parameter lognormal distribution to the data. I've previously written about how to use the optimization functions in SAS/IML software to carry out maximum likelihood estimation. The following SAS/IML statements define the log-likelihood function:
proc iml; start LN_LL1(parm) global (x); mu = parm[1]; sigma = parm[2]; theta = parm[3]; n = nrow(x); LL1 = -n*log(sigma) - n*log(sqrt(2*constant('pi'))); LL2 = -sum( log(x-theta) ); /* not defined if theta >= min(x) */ LL3 = -sum( (log(x-theta)-mu)##2 ) / (2*sigma##2); return LL1 + LL2 + LL3; finish; |
Recall that for MLE problems, the data are constant values that are known before the optimization begins. The previous function uses the GLOBAL statement to access the data vector, x. The log-likelihood function contains three terms, two of which (LL2 and LL3) are undefined for θ ≥ min(x). Most optimization software enables you to specify constraints on the parameters. In the SAS/IML matrix language, you can define a two-row matrix where the first row indicates lower bounds on the parameters and the second row indicates upper bound. For this problem, the upper bound for the θ parameter is set to min(x) – ε, where ε is a small number so that the quantity (x_{i} - θ) is always bounded away from 0:
use Measures; read all var "Diameter" into x; close; /* read data */ /* Parameters of the lognormal(mu, sigma, theta): mu sigma>0 theta<min(x) */ con = {. 1e-6 ., /* lower bbounds: sigma > 0 */ . . .}; /* upper bounds */ con[2,3] = min(x) - 1e-6; /* replace upper bound for theta */ /* optimize the log-likelihood function */ x0 = {0 0.5 4}; /* initial guess is not feasible */ opt = {1, /* find maximum of function */ 0}; /* do not display information about optimization */ call nlpnra(rc, result, "LN_LL1", x0, opt) BLC=con; /* Newton's method with boundary constraints */ print result[c={'mu' 'sigma' 'theta'} L="Optimum"]; |
The results are very close to the estimates that were produced by PROC UNIVARIATE. Because of the boundary constraints, the optimization process never evaluates the objective function outside of the feasible domain, as determined by the boundary constraints in the CON matrix.
Ideally, the log-likelihood function should be robust, meaning that it should be able to handle inputs that are outside of the domain of the function. This enables you to use the objective function in many ways, including visualizing the feasible region and solving problems that have nonlinear constraints. When you have nonlinear constraints, some algorithms might evaluate the function in the infeasible region, so it is important that the objective function does not experience a domain error.
The best way to make the log-likelihood function robust is to trap domain errors. But what value should the software return if an input value is outside the domain of the function?
The SAS/IML optimization routines support missing values, so the following module redefines the objective function to return a missing value when θ ≥ min(x):
/* Better: Trap out-of-domain errors to ensure that the objective function NEVER fails */ start LN_LL(parm) global (x); mu = parm[1]; sigma = parm[2]; theta = parm[3]; n = nrow(x); if theta > min(x) then return .; /* Trap out-of-domain errors */ LL1 = -n*log(sigma) - n*log(sqrt(2*constant('pi'))); LL2 = -sum( log(x-theta) ); LL3 = -sum( (log(x-theta)-mu)##2 ) / (2*sigma##2); return LL1 + LL2 + LL3; finish; |
This modified function can be used for many purposes, including optimization with nonlinear constraints, visualization, and so forth.
In summary, this article provides two tips for optimizing a function that has a restricted domain. Often this situation occurs when one of the parameters depends on data, such as the threshold parameter in the lognormal and Weibull distributions. Because the data are constant values that are known before the optimization begins, you can write constraints that depend on the data. You can also access the data during the optimization process to ensure that the objective function does not ever evaluate input values that are outside its domain. If your optimization routine supports missing values, the objective function can return a missing value for invalid input values. Otherwise, the function can return a very large (positive or negative) value for out-of-domain inputs.
The post Two tips for optimizing a function that has a restricted domain appeared first on The DO Loop.
]]>The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.
]]>I was thinking about TINSTAAFL recently when I was calling a Base SAS function from the SAS/IML matrix language. The SAS/IML language supports hundreds of built-in functions that operate on vectors and matrices. However, you can also call hundreds of functions in Base SAS and pass in vectors for the parameters. It is awesome and convenient to be able to call the virtual smorgasbord of functions in Base SAS, such as probability function, string matching functions, trig function, financial functions, and more. Of course, there is no such thing as a free lunch, so I wondered about the overhead costs associated with calling a Base SAS function from SAS/IML. Base SAS functions typically are designed to operate on scalar values, so the IML language has to call the underlying function many times, once for each value of the parameter vector. It is more expensive to call a function a million times (each time passing in a scalar parameter) than it is to call a function one time and pass in a vector that contains a million parameters.
To determine the overhead costs, I decided to test the cost of calling the MISSING function in Base SAS. The IML language has a built-in syntax (b = (X=.)) for creating a binary variable that indicates which elements of a vector are missing. The call to the MISSING function (b = missing(X)) is equivalent, but requires calling a Base SAS many times, once for each element of x. The native SAS/IML syntax will be faster than calling a Base SAS function (TINSTAAFL!), but how much faster?
The following program incorporates many of my tips for measuring the performance of a SAS computation. The test is run on large vectors of various sizes. Each computation (which is very fast, even on large vectors) is repeated 50 times. The results are presented in a graph. The following program measures the performance for a character vector that contains all missing values.
/* Compare performance of IML syntax b = (X = " "); to performance of calling Base SAS MISSING function b = missing(X); */ proc iml; numRep = 50; /* repeat each computation 50 times */ sizes = {1E4, 2.5E4, 5E4, 10E4, 20E4}; /* number of elements in vector */ labl = {"Size" "T_IML" "T_Missing"}; Results = j(nrow(sizes), 3); Results[,1] = sizes; /* measure performance for character data */ do i = 1 to nrow(sizes); A = j(sizes[i], 1, " "); /* every element is missing */ t0 = time(); do k = 1 to numRep; b = (A = " "); /* use built-in IML syntax */ end; Results[i, 2] = (time() - t0) / numRep; t0 = time(); do k = 1 to numRep; b = missing(A); /* call Base SAS function */ end; Results[i, 3] = (time() - t0) / numRep; end; title "Timing Results for (X=' ') vs missing(X) in SAS/IML"; title2 "Character Data"; long = (sizes // sizes) || (Results[,2] // Results[,3]); /* convert from wide to long for graphing */ Group = j(nrow(sizes), 1, "T_IML") // j(nrow(sizes), 1, "T_Missing"); call series(long[,1], long[,2]) group=Group grid={x y} label={"Size" "Time (s)"} option="markers curvelabel" other="format X comma8.;"; |
The graph shows that the absolute times for creating a binary indicator variable is very fast for both methods. Even for 200,000 observations, creating a binary indicator variable takes less than five milliseconds. However, on a relative scale, the built-in SAS/IML syntax is more than twice as fast as calling the Base SAS MISSING function.
You can run a similar test for numeric values. For numeric values, the SAS/IML syntax is about 10-20 times faster than the call to the MISSING function, but, again, the absolute times are less than five milliseconds.
So, what's the cost of calling a Base SAS function from SAS/IML? It's not free, but it's very cheap in absolute terms! Of course, the cost depends on the number of elements that you are sending to the Base SAS function. However, in general, there is hardly any cost associated with calling a Base SAS function from SAS/IML. So enjoy the lunch buffet! Not only is it convenient and plentiful, but it's also very cheap!
The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.
]]>The post Short-circuit evaluation and logical ligatures in SAS appeared first on The DO Loop.
]]>data _null_; set A end=eof; if x>0 & y>0 then /* Y is evaluated only if X>0 */ count + 1; if eof then put count=; run; |
SAS programmers can optimize their IF-THEN and WHERE clauses if they can estimate the probability of each separate condition in a logical expression:
The SAS documentation does not discuss the conditions for which a logical expression does or does not short circuit. Polzin (1994, p. 1579) points out that when you put function calls in the logical expression, SAS evaluates certain function calls that produce side effects. Common functions that have side effects include random number functions and user-defined functions (via PROC FCMP) that have output arguments. The LAG and DIF functions can also produce side effects, but it appears that expressions that involve the LAG and DIF functions are short-circuited. You can force a function evaluation by calling the function prior to an IF-THEN statement. You can use nested IF-THEN/ELSE statements to ensure that functions are not evaluated unless prior conditions are satisfied.
The SAS/IML language does not support short-circuiting in IF-THEN statements, but it performs several similar optimizations that are designed to speed up your code execution. One optimization involves the ANY and ALL functions, which test whether any or all (respectively) elements of a matrix satisfy a logical condition. A common usage is to test whether a missing value appear anywhere in a vector, as shown in the following SAS/IML statement:
bAny = any(y = .); /* TRUE if any element of Y is missing */ /* Equivalently, use bAll = all(y ^= .); */ |
The SAS/IML language treats simple logical expressions like these as a single function call, not as two operations. I call this a logical ligature because two operations are combined into one. (A computer scientist might just call this a parser optimization.)
You might assume that the expression ANY(Y=.) is evaluated by using a two-step process. In the first step, the Boolean expression y=. is evaluated and the result is assigned to a temporary binary matrix, which is the same size as Y. In the second step, the temporary matrix is sent to the ANY function, which evaluates the binary matrix and returns TRUE if any element is nonzero. However, it turns out that SAS/IML does not use a temporary matrix. The SAS/IML parser recognizes that the expression inside the ANY function is a simple logical expression. The program can evaluate the function by looking at each element of Y and returning TRUE as soon it finds a missing value. In other words, it short-circuits the computation. If no value is missing, the expression evaluates to FALSE.
Short circuiting an operation can save considerable time. In the following SAS/IML program, the vector X contains 100 million elements, all equal to 1. The vector Y also contains 100 million elements, but the first element of the vector is a missing value. Consequently, the computation for Y is essentially instantaneous whereas the computation for X takes a tenth of a second:
proc iml; numReps = 10; /* run computations 10 times and report average */ N = 1E8; /* create vector with 100 million elements */ x = j(N, 1, 1); /* all elements of x equal 1 */ y = x; y[1] = .; /* the first element of x is missing */ /* the ALL and ANY functions short-circuit when the argument is a simple logical expression */ /* these function calls examine only the first elements */ t0 = time(); do i = 1 to numReps; bAny = any(y = .); /* TRUE for y[1] */ bAll = all(y ^= .); /* TRUE for y[2] */ end; t = (time() - t0) / numReps; print t[F=BEST6.]; /* these function calls must examine all elements */ t0 = time(); do i = 1 to numReps; bAny = any(x = .); bAll = all(x ^= .); end; t = (time() - t0) / numReps; print t[F=BEST6.]; |
Although the evaluation of X does not short circuit, it still uses the logical ligature to evaluate the expression. Consequently, the evaluation is much faster than the naive two-step process that is shown explicitly by the following statements, which require about 0.3 seconds and twice as much memory:
/* two-step process: slower */ b1 = (y=.); /* form the binary vector */ bAny = any(b1); /* TRUE for y[1] */ |
In summary, the SAS DATA step uses short-circuit evaluation in IF-THEN statements and WHERE clauses that use simple logical expressions. If the expression contains several subexpressions, you can optimize your program by estimating the probability that each subexpression is true. In the SAS/IML language, the ANY and ALL functions not only short circuit, but when their argument is a simple Boolean expression, the language treats the function call as a logical ligature and evaluates the call in an efficient manner that does not require any temporary memory.
Short circuits can be perplexing if you don't expect them. Equally confusing is expecting a statement to short circuit, but it doesn't. If you have a funny story about short-circuit operators, in any language, leave a comment.
The post Short-circuit evaluation and logical ligatures in SAS appeared first on The DO Loop.
]]>The post The math you learned in school: Yes, itâ€™s useful! appeared first on The DO Loop.
]]>I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.
In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"
Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:
Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:
Many high schools offer a unit on probability and statistics, and some students take AP Statistics.
Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.
What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.
The post The math you learned in school: Yes, itâ€™s useful! appeared first on The DO Loop.
]]>The post The essential guide to binning in SAS appeared first on The DO Loop.
]]>Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.
SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.
The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.
In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:
Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.
A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).
In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.
The post The essential guide to binning in SAS appeared first on The DO Loop.
]]>The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.
]]>There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.
The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.
data Heart; format PatientID Z5.; set Sashelp.Heart(keep=Sex Cholesterol Systolic); PatientID = 10000 + _N_; run; proc means data=Heart nolabels N NMISS Min Max Skewness; var Cholesterol Systolic; run; |
The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.
Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.
A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.
Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.
The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.
/* equally divide the range each variable (bucket binning) */ proc hpbin data=Heart output=HeartBin numbin=3; /* global NUMBIN= option */ input Cholesterol / numbin=5; /* override global NUMBIN= option */ input Systolic; id PatientID Sex; run; |
Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).
The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:
proc freq data=HeartBin; tables BIN_Cholesterol BIN_Systolic / nocum; run; |
Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.
Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.
You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:
/* bin by quantiles of each variable */ proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile; input Cholesterol / numbin=5; /* override global NUMBIN= option */ input Systolic; /* use global NUMBIN= option */ id PatientID Sex; code file='C:/Temp/BinCode.sas'; /* generate scoring code */ run; |
The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.
In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:
***************** BIN_Systolic ********************; length BIN_Systolic 8; if missing(Systolic) then do; BIN_Systolic = 0; end; else if Systolic < 124.0086 then do; BIN_Systolic = 1; end; else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic = 2; end; else if 140.0098 <= Systolic then do; BIN_Systolic = 3; end; ***************** BIN_Cholesterol ********************; length BIN_Cholesterol 8; if missing(Cholesterol) then do; BIN_Cholesterol = 0; end; else if Cholesterol < 190.0224 then do; BIN_Cholesterol = 1; end; else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol = 2; end; else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol = 3; end; else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol = 4; end; else if 263.0408 <= Cholesterol then do; BIN_Cholesterol = 5; end; |
You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.
The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:
data Patients; length Sex $6; input PatientID Sex Systolic Cholesterol; datalines; 13021 Male 96 . 13022 Male 148 242 13023 Female 144 217 13024 Female 164 376 13025 Female . 248 13026 Male 190 238 13027 Female 158 326 13028 Female 188 266 ; data MakeBins; set Patients; %include 'C:/Temp/BinCode.sas'; /* include the binning statements */ run; /* Note: HPBIN puts missing values into bin 0 */ proc print data=MakeBins; run; |
The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.
In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.
The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.
]]>The post Use numeric values for column headers when printing a matrix appeared first on The DO Loop.
]]>A common situation in which I want to use numeric values as a column header is when I am using the TABULATE function to compute the frequencies for a discrete numerical variable. For example, the Cylinders variable in the Sashelp.Cars data set indicates the number of cylinders in each vehicle. You might want to know how many vehicles in the data are four-cylinder, six-cylinder, eight-cylinder, and so forth. In SAS/IML 15.1, you can use a numeric variable (such as the levels of the Cylinders variable) directly in the COLNAME= option of the PRINT statement, as follows:
proc iml; use Sashelp.Cars; read all var "Cylinders" into X; close; call tabulate(level, freq, X); /* count number of obs for each level of X */ /* New in SAS/IML 15.1: use numeric values as COLNAME= option to PRINT statement */ print freq[colname=level]; |
Prior to SAS/IML 15.1, you had to convert numbers to character values by applying a format. This is commonly done by using the CHAR function, which applies a W.D format. This technique is still useful, but optional. You can also use the PUTN function to apply any format you want, such as COMMA., DATE., TIME., and so forth. For example, if you like Roman numerals, you could apply the ROMAN. format!
labels = char(level, 2); /* apply w.d format */ print freq[colname=labels]; labels = putn(level, "ROMAN."); /* second arg is name of format */ print freq[colname=labels]; |
You can also use numeric values for the COLNAME= option to obtain row headers in SAS/IML 15.1. Sure, it's a small enhancement, but I like it!
The post Use numeric values for column headers when printing a matrix appeared first on The DO Loop.
]]>The post Vectorize the computation of the Mandelbrot set in a matrix language appeared first on The DO Loop.
]]>Although fractals are no longer in vogue, the desire for fast computations never goes out of style. In a matrix-vector language such as SAS/IML, you can achieve fast computations by vectorizing the computations, which means operating on large quantities of data and using matrix-vector computations. Sometimes this paradigm requires rethinking an algorithm. This article shows how to vectorize the Mandelbrot set computation. The example provides important lessons for statistical computations, and, yes, you get to see a pretty picture at the end!
As a reminder, the Mandelbrot set is a visualization of a portion of the complex plane. If c is a complex number, you determine the color for c by iterating a complex quadratic map z <- z^{2} + c, starting with z=0. You keep track of how many iterations it takes before the iteration diverges to infinity, which in practice means that |z| exceeds some radius, such as 5. The parameter values for which the iteration remains bounded belong to the Mandelbrot set. Other points are colored according to the number of iterations before the trajectory exceeded the specified radius.
The classical computation of the Mandelbrot set iterates over the parameter values in a grid, as follows:
For each value c in a grid: Set z = 0 For i = 1 to MaxIter: Apply the quadratic map, f, to form the i_th iteration, z_i = f^i(z; c). If z_i > radius, stop and remember the number of iterations. End; If the trajectory did not diverge, assume parameter value is in the Mandelbrot set. End; |
An advantage of this algorithm is that it does not require much memory, so it was great for the PCs of the 1980s! For each parameter, the color is determined (and often plotted), and then the parameter is never needed again.
In the classical algorithm, all computations are scalar computations. The outer loop is typically over a large number, K, of grid points. The maximum number of iterations is typically 100 or more. Thus, the algorithm performs 100*K scalar computations in order to obtain the colors for the image. For a large grid that contains K=250,000 parameters, that's 25 million scalar computations for the low-memory algorithm.
A vectorized version of this algorithm inverts the two loops. Instead of looping over the parameters and iterating each associated quadratic map, you can store the parameter values in a grid and iterate all K trajectories at once by applying the quadratic map in a vectorized manner. The vectorized algorithm is:
Define c to be a large grid of parameters. Initialize a large grid z = 0, which will hold the current state. For i = 1 to MaxIter: For all points that have not yet diverged: Apply the quadratic map, f, to z to update the current state. If any z > radius, assign the number of iterations for those parameters. End; End; If a trajectory did not diverge, assume parameter value is in the Mandelbrot set. |
The vectorized algorithm performs the same computations as the scalar algorithm, but each iteration of the quadratic map operates on a huge number of current states (z). Also, each all states are checked for divergence by using a single call to a vector operation. There are many fewer function calls, which reduces overhead costs, but the vectorized program uses a lot of memory to store all the parameters and states.
Let's see how the algorithm might be implemented in the SAS/IML language. First, define vectorized functions for performing the complex quadratic map and the computation of the complex magnitude (absolute value). Because SAS/IML does not provide built-in support for complex numbers, each complex number is represented by a 2-D vector, where the first column is the real part and the second column is the imaginary part.
ods graphics / width=720px height=640px NXYBINSMAX=1200000; /* enable large heat maps */ proc iml; /* Complex computations: z and c are (N x 2) matrices that represent complex numbers */ start Magnitude(z); return ( z[,##] ); /* |z| = x##2 + y##2 */ finish; /* complex iteration of quadratic mapping z -> z^2 + c For complex numbers z = x + iy and c = a + ib, w = z^2 + c is the complex number (x^2 - y^2 + a) + i(2*x*y + b) */ start Map(z, c); w = j(nrow(z), 2, 0); w[,1] = z[,1]#z[,1] - z[,2]#z[,2] + c[,1]; w[,2] = 2*z[,1]#z[,2] + c[,2]; return ( w ); finish; |
The next block of statements defines a grid of parameters for the parameter, c:
/* Set parameters: initial range for grid of points and number of grid divisions Radius for divergence and max number of iterations */ nRe = 541; /* num pts in Real (horiz) direction */ nIm = 521; /* num pts in Imag (vert) direction */ re_min = -2.1; re_max = 0.6; /* range in Real direction */ im_min = -1.3; im_max = 1.3; /* range in Imag direction */ radius = 5; /* when |z| > radius, iteration has diverged */ MaxIter = 100; /* maximum number of iterations */ d_Re = (Re_max - Re_min) / (nRe-1); /* horiz step size */ d_Im = (Im_max - Im_min) / (nIm-1); /* vert step size */ Re = do(re_min, re_max, d_Re); /* evenly spaced array of horiz values */ Im = do(im_min, im_max, d_Im); /* evenly spaced array of vert values */ c = expandgrid( re, im); /* make into 2D grid */ z = j(nrow(c), 2, 0); /* initialize z = 0 for grid of trajectories */ iters = j(nrow(c), 1, .); /* for each parameter, how many iterations until diverge */ |
In this example, the parameters for the mapping are chosen in the rectangle with real part [-2.1, 0.6] and imaginary part [-1.3, 1.3]. The parameter grid contains 541 horizontal values and 521 vertical values, for a total of almost 282,000 parameters.
The vectorized program will iterate all 282,000 mappings by using a single call to the Map function. After each iteration, all trajectories are checked to see which ones have left the disk of radius 5 (diverged). The iters vector stores the number of iterations until leaving the disk. During the iterations, a missing value indicates that the trajectory has not yet left the disk. At the end of the program, all trajectories that have not left the disk are set to the maximum number of iterations.
There are two efficiencies you can implement. First, you can pre-process some of the parameters that are known to be inside the "head" or "body" of the bug-shaped Mandelbrot set. Second, for each iteration, you only need to apply the quadratic map to the points that have not yet diverged (that is, iters is missing for that parameter). This is implemented in the following program statements:
/* Pre-process parameters that are known to be in Mandelbrot set. Params in "head": inside circle of radius 1/8 centered at (-1, 0) Params in "body": inside rect [-0.5, 0.2] x [-0.5, 0.5] */ idxHead = loc( (c - {-1 0})[,##] < 0.0625 ); if ncol(idxHead) > 0 then iters[idxHead] = MaxIter; idxBody = loc(c[,1]> -0.5 & c[,1]< 0.2 & c[,2]> -0.5 & c[,2]< 0.5); if ncol(idxBody) > 0 then iters[idxBody] = MaxIter; /* compute the Madelbrot set */ done = 0; /* have all points diverged? */ do i = 1 to MaxIter until(done); j = loc(iters=.); /* find the points that remain bounded */ w = z[j,]; a = c[j,]; /* keep only states and parameters tht have not diverged */ w = Map(w, a); /* map those points forward one iteration */ z[j,] = w; /* update current state */ mag = Magnitude(w); /* compute magnitude of current states */ diverged = (mag >= radius) & (iters[j] = .); /* which points diverged on this iteration? */ idx = loc(diverged); /* indices of the diverged points */ if ncol(idx)> 0 then /* if diverged, remember the iteration number */ iters[j[idx]] = i; if all(diverged > 0) then done = 1; end; /* Points that never diverged are part of the Mandelbrot set. Assign them MaxIter */ idx = loc(iters=.); if ncol(idx)>0 then iters[idx] = MaxIter; |
The last step is to assign colors that visualize the Mandelbrot set. Whereas Robert Allison used a scatter plot, I will use a heat map. The iters vector contains the number of iterations before divergence, or MaxIters if the trajectory never diverged. You can assign a color ramp to the number of iterations, or, as I've done below, to a log-transformation of the iteration number. I've previously discussed several ways to assign colors to a heat map.
/* Color parameter space by using LOG10(number of iterations) */ numIters = log10( shape(iters, nRe, nIm) ); /* shape iterations into matrix */ mandelbrot = numIters`; /* plot num iterations for each parameter */ call heatmapcont(mandelbrot) xValues=Re yValues=Im ColorRamp="Temperature" displayoutlines=0 showlegend=0 title="Mandelbrot Set from Vector Computations"; |
On my PC, the Mandelbrot computation takes about a second. This is not blazingly fast, but it is faster than the low-memory, non-vectorized, computation. If you care about the fastest speeds, use the DATA step, as shown in Robert's blog post.
I'm certainly not the first person to compute the Mandelbrot set, so what's the point of this exercise? The main purpose of this article is to show how to vectorize an iterative algorithm that performs the same computation many times for different parameter values. Rather than iterate over the set of parameter values and perform millions of scalar computations, you create an array (or matrix) of parameter values and perform a small number of vectorized computations. In most matrix-vector programming languages, a vectorized computation is more efficient than looping over each parameter value. There are few function calls and each call performs high-level computations on large matrices and vectors.
Statistical programmers don't usually have a need to compute fractals, but the ideas in this program apply to time series computations, root-finding, optimization, and any computation where you compute the same quantity for many parameter values. In short, although the Mandelbrot set is fun to compute, the ability to vectorize computations in a matrix language is a skill that rewards you with more than pretty pictures. And fast computations never go out of style.
The post Vectorize the computation of the Mandelbrot set in a matrix language appeared first on The DO Loop.
]]>The post Implement the Gumbel distribution in SAS appeared first on The DO Loop.
]]>SAS supports more than 25 common probability distributions for the PDF, CDF, QUANTILE, and RAND functions. Of course, there are infinitely many distributions, so not every possible distribution is supported. If you need a less-common distribution, I've shown how to extend the functionality of Base SAS (by using PROC FCMP) or the SAS/IML language by defining your own functions. This article shows how to implement the Gumbel distribution in SAS. The density functions for four parameter values of the Gumbel distribution are shown to the right. As I recently discussed, the Gumbel distribution is used to model the maximum (or minimum) in a sample of size n.
If you are going to work with a probability distribution, you need at least four essential functions: The CDF (cumulative distribution), the PDF (probability density), the QUANTILE function (inverse CDF), and a way to generate random values from the distribution. Because the RAND function in SAS supports the Gumbel distribution, you only need to define the CDF, PDF, and QUANTILE functions.
These functions are trivial to implement for the Gumbel distribution because each has an explicit formula. The following list defines each function. To match the documentation for the RAND function and PROC UNIVARIATE (which both support the Gumbel distribution), μ is used for the location parameter and σ is used for the scale parameter:
The RAND function in Base SAS and the RANDGEN function in SAS/IML support generating random Gumbel variates. Alternatively, you could generate u ~ U(0,1) and then use the QUANTILE formula to transform u into a random Gumbel variate.
Using these definitions, you can implement the functions as inline functions, or you can create a function call, such as in the following SAS/IML program:
proc iml; start CDFGumbel(x, mu, sigma); z = (x-mu)/sigma; return exp(-exp(-z)); * CDF of Gumbel distribution; finish; start PDFGumbel(x, mu, sigma); z = (x-mu)/sigma; return exp(-z-exp(-z)) / sigma; finish; start QuantileGumbel(p, mu, sigma); return mu - sigma # log(-log(p)); finish; /* Example: Call the PDFGumbel function */ mu = 3.09; /* params for max value of a sample of size */ sigma = 0.286; /* n=1000 from the std normal distrib */ x = do(2, 5, 0.025); PDF = PDFGumbel(x, mu, sigma); title "Gumbel PDF"; call series(x, PDF) grid={x y}; |
The graph shows the density for the Gumbel(3.09, 0.286) distribution, which models the distribution of the maximum value in a sample of size n=1000 drawn from the standard normal distribution. You can see that the maximum value is typically between 2.5 and 4.5, which values near 3 being the most likely.
I recently discussed the best way to draw a family of curves in SAS by using PROC SGPLOT. The following DATA step defines the PDF and CDF for a family of Gumbel distributions. You can use this data set to display several curves that indicate how the distribution changes for various values of the parameters.
data Gumbel; array n[4] _temporary_ (10 100 1000 1E6); /* sample size */ array mu[4] _temporary_ (1.28 2.33 3.09 4.75); /* Gumbel location parameter */ array beta [4] _temporary_ (0.5 0.36 0.29 0.2); /* Gumbel scale parameter */ do i = 1 to dim(beta); /* parameters in the outer loop */ m = mu[i]; b = beta[i]; Params = catt("mu=", m, "; beta=", b); /* concatenate parameters */ do x = 0 to 6 by 0.01; z = (x-m)/b; CDF = exp( -exp(-z)); /* CDF for Gumbel */ PDF = exp(-z - exp(-z)) / b; /* PDF for Gumbel */ output; end; end; drop z; run; /* Use ODS GRAPHICS / ATTRPRIORITY=NONE if you want to force the line attributes to vary in the HTML destination. */ title "The Gumbel(mu, beta) Cumulative Distribution"; proc sgplot data=Gumbel; label cdf="Probability"; series x=x y=cdf / group=Params lineattrs=(thickness=2); keylegend / position=right; xaxis grid; yaxis grid; run; title "The Gumbel(mu, beta) Probability Density"; proc sgplot data=Gumbel; label pdf="Density"; series x=x y=pdf / group=Params lineattrs=(thickness=2); keylegend / position=right; xaxis grid offsetmin=0 offsetmax=0; yaxis grid; run; title "The Gumbel(mu, beta) Quantile Function"; proc sgplot data=Gumbel; label cdf="Probability"; series x=cdf y=x / group=Params lineattrs=(thickness=2); keylegend / position=right sortorder=reverseauto; xaxis grid; yaxis grid; run; |
The graph for the CDF is shown. A graph for the PDF is shown at the top of this article. I augmented the PDF curve with labels that show the sample size for which the Gumbel distribution is associated. The leftmost curve is the distribution of the maximum in a sample of size n=10; the rightmost curve is for a sample of size one million.
For completeness, I will mention that you can use the RAND function to generate random values and use PROC UNIVARIATE to fit Gumbel parameters to data. For example, the following DATA step generates 5,000 random variates from Gumbel(1.28, 0.5). The call to PROC UNIVARIATE fits a Gumbel model to the data. The estimates for (μ, σ) are (1.29, 0.51).
data GumbelRand; call streaminit(12345); mu = 1.28; beta = 0.5; do i = 1 to 5000; x = rand("Gumbel", mu, beta); output; end; keep x; run; proc univariate data=GumbelRand; histogram x / Gumbel; run; |
In summary, although the PDF, CDF, and QUANTILE functions in SAS do not support the Gumbel distribution, it is trivial to implement these functions. This article visualizes the PDF and CDF of the Gumbel distributions for several parameter values. It also shows how to simulate random values from a Gumbel distribution and how to fit the Gumbel model to data.
The post Implement the Gumbel distribution in SAS appeared first on The DO Loop.
]]>