The DO Loop

The length of the string of lights that you need to wrap around your Christmas tree

Rick Wicklin — Mon, 15 Dec 2025 10:40:28 +0000

Rockin' around the Christmas tree,
Have a happy holiday
– Brenda Lee

An internet search for "How many lights do I need for my Christmas tree," returns many opinions. A popular answer is that you need "one strand of lights for each foot of your tree." Another common answer is "100 mini lights per vertical foot." However, strands vary in length. Furthermore, these easy-to-remember rules consider only the height of the tree. They ignore a fundamental fact of Christmas tree geometry: The tree is approximately a cone. If the base of the tree has a wide radius, you will need more lights than if the base of the tree has a small radius. Thus, when calculating the length of a string of lights for a Christmas tree, you should account for the shape of the tree, which includes both the height and the radius of the lowest branches. (Technically, the full tree height doesn't matter, only the height from the lowest branches to the top.)

I recently saw a post on social media that used math to compute the length of a helical string of lights wrapped on a perfectly conical Christmas tree that has height H and radius of the base R. Again, H measures only the "green" part of the tree, from the bottom of the first set of branches to the top of the tree. The formula is given in the next section. It is very complicated.

I think there is a better mathematical model that is simpler and for which the length of the lights is easier to compute. I call it the n-rings model. For the n-rings model, you first wrap the lights around the lowest set of branches. This makes a horizontal ring around the tree. You then pass the string up to the next set of branches and wrap the lights horizontally around the tree on that set of branches. You continue this process a total of n times until you wrap the lights around the last set of branches near the top of the tree. The length of the string of lights needed for this process is simple to evaluate by using a handheld calculator (or the calculator app on your phone).

You can use any units in these formulas. This article uses "feet" because that is the US measurement used for Christmas tree heights and for strings of lights.

The helical model of stringing lights

The helical model for stringing lights looks like the images below. In these images, a string of lights wraps around a perfectly conical tree with base radius R and branch-height H. The branch-height measures the height from the lowest branch to the top of the tree. It excludes the trunk and tree stand. In the images, n=10, R=2, and H=5.

The string of light progresses up the tree at a constant rate. The lights wrap around the tree a total of n times. In the images, n=10. For any value of n, the formula for the length of the lights is given by the following complicated expression:
$L_\mbox{Helix} = \frac{H}{2} \sqrt{1+R^2(1+(2\pi n)^2)} + \frac{H(1+R^2)}{4\pi n R} \sinh^{-1} \left(\frac{2\pi n R}{\sqrt{1+R^2}}\right)$

Yuck! Not only is this formula complicated, but it involves the function " $\sinh^{-1}$ ," which is called the "inverse hyperbolic sine" function. Some simple calculators do not even contain a key for this function, although most scientific calculators can compute this function by combining a "2nd" or "inverse" key with the "sinh" key. If your calculator doesn't support this function, you can use the identity $\sinh^{-1}(x) = \log\left(x + \sqrt{x^2 + 1}\right)$ . No matter how you express it, the formula is complicated to use. The formula can be derived by parameterizing the helix and applying the arclength formula from multivariable calculus.

In the images, the length of the helical string of lights wrapped n=10 times is 63.33 feet.

The n-rings model of stringing lights

The helical model is not always an accurate model of how people string lights on their tree. I use the n-rings approach, perhaps because I have an artificial tree for which the branches are more or less aligned in rows. My tree has a total of 12 rings of branches that are about 6 inches apart, so it is natural to wrap each set of branches with lights. The n-rings model for stringing lights looks like the images below, which again use R=2 and H=5.

In the images, n=10 because there are 10 horizontal rings of lights. Each ring is connected by a slanted vertical segment that connects each horizontal ring to the next one above it. (I put the vertical segment of light on the back of the tree.) For any value of n, the formula for the length of the lights is given by the following simple expression:
$L_\mbox{Rings} = \sqrt{R^2 + H^2} + (n+1) \pi R$

Ah! This formula is much simpler! In the images, the length of the string of lights in the n-rings model is 74.50 feet when n=10. Notice that this length is longer than the length for the helical model. This is always the case. Therefore, you can always use the simpler n-rings formula if you are trying to estimate how many boxes of lights you'll need for a conical tree of a given height and base radius. For the math nerds, the formula is derived in the Appendix.

A SAS program to compute the length of Christmas tree lights

You can write a program to compute length of lights for the two models. I think that placing lights about 6 inches (0.5 feet) apart looks nice, so for a tree that has height H, I suggest winding the lights around the tree n = ceil(2*H) times. (Math comment: The formulas don't care whether n is an integer. They still make sense if you want to wrap the lights 10.5 times!)

Trees come in many shapes and sizes, from small "tabletop trees" to massive 9-foot giants in living room with a vaulted ceiling. However, the most common trees in the US are between 5 and 7 feet tall, which includes the trunk length. Thus, the branch-height (which excludes the trunk) is 0.5 to 1.0 feet less than the tree height. Accordingly, the following SAS program varies the branch-height between 4 and 6 feet. The base radius of trees also varies according to species and design, such as slender "pencil trees" that are designed for a hallway or a small apartment. However, many trees have a height-to-base-radius ratio (H/R) that is between 2 and 3. That is, the branch-height of the tree is 2-3 times taller than the base radius. In the previous images, the ratio H/R is 5/2 = 2.5. The following choices of H and R all have a ratio between 2 and 3.

/* SAS DATA step to compute the length of strings of lights for two 
   strategies for winding them onto a conical Christmas tree: 
   helical model and the n-rings model */
data LengthOfLights;
keep H R n L_Helix L_Rings;
label H = "Height of Branches (ft)"
      R = "Radius of Lower Branches (ft)"
      n = "Number of Rings"
      L_Helix = "Length of Helical Model"
      L_Rings = "Length of Rings Model";
format L_Rings 6.1 L_Helix 6.1;
input H R;
 
pi = constant('pi');
n = 2*H;           /* wind the lights twice per vertical foot */
s2 = R**2 + H**2;
s = sqrt(s2);      /* vertical (slant) length of tree (Pythagorean Theorem) */
 
/* 1. formula for arc length of helix */
w = 2*pi*n;
v = R * w;
L_Helix = 0.5*sqrt(s2 + v**2) + (s2/(2*v)) * arsinh(v/s);
 
/* 2. formula for rings with a vertical connection */
sum_Rings = (n+1)*pi*R;
L_Rings = s + sum_Rings; 
datalines;
3.5 1.5
4   1.5
4.5 1.5
5   2
5.5 2
5.5 2.5
6   2.5
6   3
;
 
proc print noobs data=LengthOfLights label; 
run;

For the record, my family's (artificial) tree has H=5 feet and R=2 feet. The table says that the n-Rings method with n=10 requires 74.5 feet of lights. Accordingly, I use 8 boxes of 10-foot strings, which gives me 80 feet of lights. Because this is more length than I need for 10 rings, I let the lights undulate up and down as I wind them.

If you already have boxes of lights at home, you can invert the n-rings formula to solve for n. This tells you how many rings you can form if you have a string of length L. The inverse formula is
$n = (L - s)/(\pi R) - 1, \,\mbox{where}\, s = \sqrt{R^2 + H^2}$
For example, I have L=80 feet of lights, so for R=2 and H=5 I could wrap the tree 10.88 times by using equidistant rings.

Summary

This article considers the question, "How many lights do I need to wrap around my Christmas tree." Common online answers use only the height of the tree, not the radius of the base.

A better answer uses the base radius. One online formula is for a mathematical model in which the light strings are wrapped in a helix. The helical formula is very complicated, and for some trees, a helical winding might not be practical.

This article presents a simple mathematical model, which is the n-rings model. This model is appropriate for artificial trees or a natural tree for which the branches look like stack of horizontal disks. You can wrap the lights around the lowest branches, transition to the next level, wrap the lights around the branches at that level, and continue until you reach the top of the tree. For any conical-shaped tree, measure the following:

R: The radius of the lowest branches (from the trunk to the outer branches)
H: The vertical distance up the trunk from the lowest branch to the top of the tree.

If you want to wind the lights n times around the Christmas tree, you will need a string of lights that is at least

$L = \sqrt{R^2 + H^2} + (n+1) \pi R$ long. This formula works for feet, inches, meters, centimeters, or any other units of length.

Appendix: Derivation of the n-rings formula

We want to find the total sum of circumferences for n circles that are equidistant on a right circular cone of height H and base radius R. The height between adjacent circles is dz = H/n. As shown in the image at the top of this article, if the first circle is at z=0, the radius of the i_th circle is
$r_i = \frac{R}{H}(H - z_i) \,\mbox{where}\, z_i = (i - 1)H/n$

Each circle has circumference $L_i = 2 \pi r_i$ , so the total length of the circles is
$L = 2 \pi \frac{R}{n} \sum_{i=1}^n \left( n - i + 1 \right)$

There are some convenient formulas that simplify this formula:

$\sum_{i=1}^n 1 = n$
$\sum_{i=1}^n i = n (n+1)/2$
$\sum_{i=1}^n n = n \sum_{i=1}^n 1 = n^2$

By applying these formulas, the summation signs vanish and we are left with the sum of the circumferences of the n circles:
$L = \pi R (n + 1)$

This is the length of the n equi-spaced circles. We can add in the sum of the lengths of the slanted segments that connect the circles. The union of those segments is the hypothesis of the right triangle, which, by the Pythagorean theorem, has the length $\sqrt{R^2 + H^2}$ . Consequently, the total length of lights that are wrapped n times arounds a right circular cone of height H and radius R is
$L_{\mbox{Rings}} = \sqrt{R^2 + H^2} + (n + 1) \pi R$

The post The length of the string of lights that you need to wrap around your Christmas tree appeared first on The DO Loop.

What is a medoid?

Rick Wicklin — Mon, 08 Dec 2025 10:21:35 +0000

In univariate data analysis, the median is often used as an alternative to the mean because the mean is sensitive to outliers in the data, whereas the median is a robust statistic. For higher-dimensional data, the mean and the centroid are both used to represent the "center" of a cloud of high-dimensional points. However, these statistics, too, are sensitive to outliers.

One robust high-dimensional robust estimate of a center is the geometric median. Unfortunately, the geometric median is not simple to compute. You can compute the geometric median in SAS by using nonlinear optimization or iterative matrix computations. A simpler "center" is the medoid.

What is a medoid?

The medoid (pronounced MEH-doyd, which rhymes with "red void") is a robust statistic that is conceptually easy to understand. Given a set of points, X = {x₁, x₂, ..., x_k}, the medoid is a point, m, in the set that minimizes the sum of the distances from m to every other point in X. Notice from this definition that the medoid is always one of the data points. Because the medoid minimizes the distance to other points, the medoid can be used to represent the center of a high-dimensional point cloud. The medoid tends to be more robust to outliers than means or centroids.

Compute the medoid in SAS

Calculating the medoid is straightforward, but it requires calculating all pairwise distances. You first load the data into a matrix, X, that has n rows. You compute the matrix of pairwise distances by using the DISTANCE function in SAS IML (or by using PROC DISTANCE in Base SAS). The distance matrix is symmetric, so you can sum either the columns or the rows to find the sum of the distances. (I'll sum the rows.) If the k_th row has the smallest sum, the k_th point in the data is the medoid.

The SAS IML language enables you to vectorize this process and avoid an explicit loop over the points in the set. Here are the three main steps:

Use the DISTANCE function in SAS IML to compute the full n x n distance matrix, D, where D[i, j] is the distance between row i and row j in the X matrix. By default, the DISTANCE function computes the Euclidean distance.
Use the "summation" subscript reduction operator ([ ,+]) to sum across the columns of the distance matrix. The resulting vector, sumDist, holds the sum of the distances from each point i to all others.
Use the "minimum index" subscript reduction operator ([>:<]) to find the index corresponding to the minimum value in the sumDist vector.

Compute the medoid in two dimensions

The following example uses 10 points in two dimensions. The computation shows that medoid is the point (7, 5), which is the nineth row of the matrix of points:

proc iml;
X = {7 3,
     8 7,
     4 9,
     9 6,
     8 4,
     5 8,
     8 5,
     3 7,
     7 5,   /* This is the 9th row */
     4 5};
 
/* The medoid (https://en.wikipedia.org/wiki/Medoid) is the data value (say, y=x[j]) that 
   minimizes the sum of the distances from y to every other point */
D = distance(X);           /* D[i,] is the distance from X[i,] to every other point */
sumDist = D[ ,+];          /* sum of distances from X[i,] to every other point */
rowIdx = sumDist[>:<];     /* row that has the smallest sum of distances */
medoid = X[rowIdx, ];      /* coordinates of the medoid */
 
print sumDist[r=(1:nrow(X))];
print medoid[r=rowIdx c={'x' 'y'}];

In this example, the sumDist vector contains the sum of the distances from each point to all others. The 9th point, with coordinates (7, 5), has the smallest total distance (24.96), so it is the medoid. You can use a scatter plot to visualize the data and the medoid:

/* visualize the data and the medoid */
title "Medoid in 2-D";
isMedoid = j(nrow(X), 1, 0);  /* binary indicator variable */
isMedoid[rowIdx] = 1;
call scatter(X[,1], X[,2]) group=isMedoid grid={x y} option="markerattrs=(symbol=CircleFilled size=12)";

The point (7, 5) is displayed by using a red marker. The sum of the distances from (7, 5) to the other points is smaller than the sum of distances for any other point. From among the points, it is the best choice for the "center" of the point cloud.

Compute the medoid in higher dimensions

The beauty of this matrix/vector implementation is that the code does not change when the points are in higher dimensions. The following program performs the same computations on Fisher's Iris dataset, which has 150 observations in a four-dimensional space. So that you can reuse the computations, I've encapsulated the computations into two IML functions, sumDist and MedoidIndex.

proc iml;
/* Helper function: return the sum of the distances from each row of X to all others */
start sumDist(X);
   D = distance(X);           /* D[i,] is the distance from X[i,] to every other point */
   sumDist = D[ ,+];          /* sum of distances from X[i,] to every other point */
   return sumDist;
finish;
 
/* Find the medoid of a point set in d-dimensions.
   X is an (n x d) matrix where each row is a point in R^d
   Return the row index, k, such that the sum of the distances from X[k,]
   to the other rows is smallest.
*/
start MedoidIndex(X);
   sumDist = sumDist(X);      /* sum of distances from X[i,] to every other point */
   rowIdx = sumDist[>:<];     /* row that has the smallest sum of distances */
   return rowIdx;
finish;
 
/* Example for points in higher dimension (4-D) */
use sashelp.iris;
read all var _NUM_ into X[c=varNames];
close;
 
/* call the MedoidIndex function and use the result to extract the medoid from X */
rowIdx = MedoidIndex(X);
medoid = X[rowIdx, ];   /* coordinates of the medoid */
print medoid[c=varNames];

For these data, the 83rd row is the one that minimizes the sum of distances to all other points. The 4-dimensional coordinates of the points are shown in the output.

As before, you can use scatter plots to visualize the medoid. Since the points are in 4-dimensional space, you can project the points onto pairs of coordinate planes. It can be informative to color the markers in the scatter plots by the sum of distances from each point to all other points. The following statements write this information to a SAS data set, then write the medoid to a separate data set. These two data sets are merged and PROC SGPLOT creates two scatter plots in which the medoid is represented by using a different marker shape.

/* write to SAS data set for visualization. Include the sum of distances for each point */
sumDist = sumDist(X);
Y = sumDist || X;
create Out from Y[c=('sumDist' || varNames)];
   append from Y;
close;
create Out2 from medoid[c=('X1':'X4')];
   append from medoid;
close;
QUIT;
 
data All;
label sumDist = "Sum of Distances";
set Out Out2;
run;
 
ods graphics / width=480px height=480px;
title "Visualization of the Medoid in Fisher's Iris Data";
/* color palette from https://blogs.sas.com/content/iml/2016/07/18/color-markers-third-variable-sas.html */
%let attrs = markerattrs=(symbol=CircleFilled) transparency=0.25 colormodel=(CX3288BD CX99D594 CXE6F598 CXFEE08B CXFC8D59 CXD53E4F);
proc sgplot data=All aspect=1 ;
   scatter x=SepalLength y=SepalWidth / colorresponse=sumDist &attrs;
   scatter x=x1 y=x2 / markerattrs=(symbol=StarFilled size=10 color=black) legendlabel="medoid";
   xaxis grid;
   yaxis grid;
run;
 
proc sgplot data=All aspect=1 ;
   scatter x=PetalLength y=PetalWidth / colorresponse=sumDist &attrs;
   scatter x=x3 y=x4 / markerattrs=(symbol=StarFilled size=10 color=black) legendlabel="medoid";
   xaxis grid;
   yaxis grid;
run;

The scatter plots show that the medoid (marked with a star) is a data value that is near the "center" of the point cloud. resulting datasets to generate scatter plots of pairs of variables. The key insight is that even in 4-D, the medoid still represents the most centrally located, robust observation in the set.

Remarks on the medoid computation

Choice of distance: By default, the DISTANCE function computes the Euclidean distance. You can use the L1 or "Manhattan" distance instead by adding "L1" as a second argument. In theory, you can use any metric function such as the cosine similarity or Mahalanobis distance.
Robustness: The medoid algorithm tends to be robust to outliers. I added some extreme values to the Iris data and noticed that the medoid did not change for my experiments.
Standardization: For the Iris data, all variables are measured on the same scale (millimeters). As is common for distance-based algorithms, if your data contains variables that have different measurements (age, length, weight, rates, etc.), then you must standardize the data first. If you don't, the distance doesn't really make sense, and the computation will be dominated by the variable with the largest variance.
Memory: A drawback of this method is that it computes the full (n x n) matrix of pairwise distances. This limits the number of data points that you can process. For example, I've previously calculated the RAM required by a square matrix. Processing 16,000 points requires just under 2GB of memory. Processing 40,000 observations requires 12GB, which might be more than your SAS administrator permits. It is not practical to use the algorithm in this article to process 100,000 points, which would require allocating 75GB of RAM. Instead, you would need to process the data by using block matrices to compute the smallest sum of distances.

Summary

The medoid is a robust choice for the "center" of a multivariate point cloud. Unlike means, centroids, and geometric medians, the medoid is always a point in the data. Mathematically, it is the point for which the sum of the distances to the other points is the smallest. This article used the standard Euclidean formula to measure the distance between points, but you can substitute other distance metrics. The SAS IML language enables you to compute the medoid in a vectorized manner independent of the number of points or their dimension.

The post What is a medoid? appeared first on The DO Loop.

Two debugging tips for SAS IML programmers

Rick Wicklin — Mon, 01 Dec 2025 10:28:31 +0000

Recently, a colleague struggled to find the source of a run-time error happening somewhere within a very large library of SAS IML function modules. Since the error happens at run time, I told my colleague about how to find the location of a run time error by reading the traceback messages in the SAS log. The log identifies the call stack at the time of the error. The call stack is the sequence of module calls that lead to the error. For example, perhaps the error happens in module 'ONE', which was called by module 'TWO', which was called by module 'THREE', and so forth.

I empathize with my colleague's frustrations. It can be hard to debug a long program that has many levels of nested function calls. The best prevention is a comprehensive set of test programs for the low-level routines. However, in spite of our best development and testing efforts, sometimes we must debug a program. This article presents an IML function and a SAS macro that I sometimes use during the development and debugging of large IML programs. The first uses the built-in MODULESTACK function to print the location of module and the call stack. The second uses the "old-school method" of printing local variables to investigate their values as the program runs.

The MODULESTACK function

The MODULESTACK function can be a life-saver for debugging. It returns a character vector that contains the call stack: the name of the current module and the chain of modules that resulted in the current call. That is, if s = ModuleStack(), then s[1] is the name of the current module, s[2] is the name of the parent module, and so forth. By using the MODULESTACK function, you can print that information at any time during the execution of any module. You don't have to wait for a run-time error to find that information in the log.

Let's run an example that shows how the MODULESTACK function works. The following two modules are taken from a previous article about how to find the location of a run-time error. I have inserted a call to the MODULESTACK function into each function, and the functions now print the call stack:

proc iml;
/* example from https://blogs.sas.com/content/iml/2025/03/17/traceback-error-sas.html */
start mod1(x);
  stack = ModuleStack();
  print "In the " (stack[1]) " module. The full stack is ", stack;
  y=1; 
  w=mod2(x);     /* call MOD2 from MOD1 */
  sum = x + y + w;
  return sum;
finish;
 
/* this function produces as ERROR if x <= 0 */
start mod2(x);
  stack = ModuleStack();
  print "In the " (stack[1]) " module. The full stack is ", stack;
  a=1;          
  b=2; 
  c=log(x);
  d=4;
  return a+b+c+d;
finish;
 
q = mod1(1);

The output shows the logical flow of the program when you call the MOD1 function:

The program enters the MOD1 module, which prints "In the MOD1 module." It also prints the call stack, which reveals that MOD1 was called from the main program.
Inside the MOD1 function is a call to the MOD2 function. If the program runs without error, it enters the MOD2 function, which prints "In the MOD2 function." It also prints the call stack: MOD2 was called by MOD1, which was called from the main program.

Use the MODULESTACK function to debug a program

After you understand how the MODULESTACK function works, you can create a small helper function. By default, the following function prints the name of the current module. You can pass in 0 as an argument to print the full call stack.

/* Debugging function to unconditionally print the name of the 
   module that calls the PrintLoc function. 
   Syntax:
   run PrintLoc();     * prints the module name ;
   run PrintLoc(0);    * prints the full module call stack ;
   run PrintLoc(2:3);  * prints the module and the calling module ;
*/
start PrintLoc(level=2);  /* default is level=2 b/c stack[1] = "PrintLoc" */
   stack = ModuleStack();
   if level=0 then 
      levels = 2:nrow(stack);
   else 
      levels = level;
   location = stack[levels];
   print location[L="Module"];
finish;

With this new PrintLoc module, let's revise the definitions of the MOD1 and MOD2 functions. For MOD1, I want to print the name of the module. However, I suspect that the source of my bug is in MOD2, so in that module I will call PrintLoc(0) to print the full call stack. Finally, I will call MOD1:

start mod1(x);
  run PrintLoc();
  y=1; 
  w=mod2(x);     /* call MOD2 from MOD1 */
  sum = x + y + w;
  return sum;
finish;
 
/* this function produces as ERROR if x <= 0 */
start mod2(x);
  run PrintLoc(0);
  a=1;          
  b=2; 
  c=log(x);
  d=4;
  return a+b+c+d;
finish;
 
z = mod1(1);

The program generates the following output. This output can help you to understand the flow of the program.

Print the value of variables if a condition is true

One of the simplest (and oldest!) techniques for debugging run-time errors is to strategically insert PRINT statements into a program. To avoid massive amounts of output, these are often conditional PRINT statements. That is, printing happens only when a certain condition is true. For example, in the previous section, you could add a PRINT statement in MOD2 that checks whether X (the argument to the LOG function) is not positive. If the condition is true, then the program can print the call stack and the values of local variables to help you understand why X is not positive.

Programmers like to use functions or macros to reduce the typing in tasks that are repeated several times in a program. If you plan to insert conditional PRINT statements throughout your IML program, you might want to define a SAS macro to encapsulate the logic. The following macro uses the PARMBUFF option, which enables the macro to take an arbitrary list of arguments. The first argument is the name of an IML symbol. The macro generates an IF/THEN statement that checks whether the value of the symbol is true (nonzero) or false (zero). Any additional arguments are IML symbols that you want to print if the first argument is true.

/* Debugging macro to conditionally print an IML symbols if a specified variable is nonzero.
   Use the PARMBUFF option on the macro definition. See
   https://blogs.sas.com/content/sgf/2017/06/16/using-parameters-within-macro-facility/ 
   Syntax:
     proc iml;
     x = {1 2, 3 4};
     y = {5 6};
     G_DEBUG = 1;
     %DEBUGPRINT(G_DEBUG, x, y);  * print symbols ;
     G_DEBUG = 0;                 * or use FREE G_DEBUG ;
     %DEBUGPRINT(G_DEBUG, x, y);  * nothing printed ;
*/
%macro DEBUGPRINT / parmbuff;
DO;
   IF (%scan(&syspbuff,1)) THEN DO;   /* check the 1st argument */
   %local i cnt;
   %let cnt = %sysfunc(countw(&syspbuff));
   %do i= 2 %to &cnt;                /* loop over the argument list */
      print %scan(&syspbuff,&i);     /* print the i_th argument */
   %end;
   END;
END;
%mend DEBUGPRINT;
 
/* test the macro at the main scope */
b = 1:5;
c = {"Hello" "World!"};
 
G_DEBUG = 1;
%DEBUGPRINT(G_DEBUG, b, c);      /* prints the symbols */
G_DEBUG = 0;
%DEBUGPRINT(G_DEBUG, b, c);      /* does not print the symbols */

The %DEBUGPRINT macro (like all SAS macros) generates SAS statements. You can use OPTIONS MPRINT to see the code generated by any SAS macro. For this macro, the code is very simple. It generates an IF-THEN statement. If the first argument is true, it prints every subsequent argument. In the first call, the G_DEBUG symbol is true, so the program prints the symbols b and c. In the second call, the value is false, so the program does not print anything.

The first argument contain the value of a local variables, or it can be a global variable. If you use the GLOBAL option on the START statement to enable modules to access a global variable, you can easily toggle "debug mode" in which the modules print the values of local variables. For example, the following function uses the GLOBAL option to define a function that can access the G_DEBUG variable. When the variable is true, the function prints the name of the module and the arguments to the module.

start MyFunction(a, b) global(G_DEBUG);
  if G_DEBUG then run PrintLoc();
  %DEBUGPRINT(G_DEBUG, a, b)
  return 1;
finish;
 
x = {1 2, 3 4};
y = {5 6};
G_DEBUG = 1;          /* turn on global debugging flag */
z = MyFunction(x, y); /* printed output */
G_DEBUG = 0;          /* turn off global flag */
z = MyFunction(x, y); /* nothing printed */

Summary

This article demonstrates two useful tips for debugging large SAS IML libraries in which there are many levels of nested module calls. The first technique uses the MODULESTACK function to print the name of a module or the complete call stack. The second technique is a macro that conditionally prints the values of local variables when a certain condition is true. You can combine the macro with a GLOBAL variable to quickly turn on and turn off printing as a debugging tool. The macro uses the PARMBUFF option, which enables the macro to take an arbitrary list of arguments.

The post Two debugging tips for SAS IML programmers appeared first on The DO Loop.

Permute the order of rows in a matrix

Rick Wicklin — Mon, 24 Nov 2025 10:30:01 +0000

In some applications, it is useful to permute the rows or the columns of a matrix. A previous article discusses how random permutation of columns (within each row) are useful in constructing permutation tests. This article shows a simpler situation: Permuting the rows of a matrix to change their order. One application is to sort the rows according to a statistic, but there are other applications, too.

Let's start with an example matrix that has seven rows. To make the permutations easier to see, the first element in each row is the row number:

proc iml;
M = {
1  2  3  4  5 ,
2  7  8  9 10 ,
3 12 13 14 15 ,
4 17 18 19 20 ,
5 22 23 24 25 ,
6 27 28 29 30 ,
7 32 33 34 35 
};
nc = ncol(M);      /* nr = number of rows */
nr = nrow(M);      /* nc = number of columns */

Random permutations of rows

Some applications require a random permutation of the rows. In the SAS IML language, the RANPERM function returns a random permutation of the set {1,2,3,...,n}. If you generate a random permutation, you can use it as the row indices in a subscript operation to obtain a permuted matrix, as follows:

/* random permutation of rows */
call randseed(12345);
kr = ranperm(nr);     /* random permutation of {1,2,3,..., nr} */
R1 = M[kr, ];
print kr, R1;

Cyclic permutation of rows

A special kind of permutation is called a cyclic permutation. In a cyclic permutation, all values are shifted to the right by some number of positions. Those values that "fall off the end" are wrapped around and appear in the leftmost positions. For example, a cyclic permutation that shifts the values in the vector v={1,2,3,4,5} by two positions is the vector w={4,5,1,2,3}.

For computer languages that use 0-based indexing, it is easy to program a cyclic permutation by using the MOD operator. If d is the length of a vector, v, and you want to shift the elements to the right by s positions, then the new vector, w, is defined by w[mod(i+s, d)] = v[i], i=0,1,...,d-1. Again, this assumes 0-based indexing. Equivalently, you can write the transformation as w[i]= v[mod(i-s, d)] if the MOD function always returns positive values. Unfortunately, the MOD function in SAS might return a negative value, so you must add d to the formula to ensure proper indexing: w[i]= v[mod(i+d-s, d)]. Lastly, indexing in the IML language is 1-based, not 0-based, so you need to subtract 1 before applying the MOD function and then add 1 to the result. The following statements show how to generate a cyclic permutation and use it to shift the rows of a matrix:

/* shift indices 1:d by k (mod d).
   For example, if d=5 and k=2, v=1:5 shifted by 2 (mod 5) is {4 5 1 2 3} */
start CyclicShift(k, d);
   idx = (1:d) + (d - k);          /* assume k < d */
   shiftIdx = mod(idx-1, d) + 1;   /* for 1-based indices: subtract 1 before applying MOD, then add it back */
   return shiftIdx;
finish;
 
k = 2;
newIdx = CyclicShift(k, nrow(M));
R2 = M[newidx, ];
print R2;

Perform cyclic or specific permutation of rows

You can combine the ideas in the previous two sections. The following program contains a SAS IML function that takes two arguments. The first is the matrix, M, whose rows you want to permute. The second argument specifies the permutation. If the second argument, k, is a scalar, you can treat is as the cyclic permutation i → i+k (mod d), where d is the number of rows in M. Otherwise, assume that k is a vector that is a permutation of 1:d and apply that permutation. The program shows two examples of calling the function. The first call uses a vector for the second argument; the second call passes in a scalar, which indicates a cyclic permutation:

/* M is a matrix with d rows. 
   If k is a scalar 1 <= k < d, perform a cyclic permutation of rows of M by shifting rows by k positions
   If k is a vector that has d elements, use k as the indices for the permutation
*/
start PermuteRows(M, _k);
   k = colvec(_k);
   if nrow(k)=nrow(M) then 
      R = M[k, ];      /* user-specified permutation */
   else if nrow(k)=1 then do;
      newIdx = CyclicShift(k, nrow(M));
      R = M[newidx, ]; /* cyclic permutation */
      end;
   else 
      STOP "ERROR: In PermuteRows, k must be a scalar or nrow(k)=nrow(M)";
   return R;
finish;
 
k = {7,6,3,2,1,5,4};
R1 = PermuteRows(M, k);     /* permute according to k */
R2 = PermuteRows(M, 4);     /* cyclic permutation: shift rows by 4 */
print R1[L='Rand Perm of Rows'], R2[L='Cyclic Perm of Rows'];

Summary

This article shows how to permute the rows of a matrix by specifying a permutation vector. In addition, it shows how to create a permutation vector for a cyclic permutation where each row moves down by k positions and the last k rows get moved to the top.

The post Permute the order of rows in a matrix appeared first on The DO Loop.

Quasi-Monte Carlo integration in SAS

Rick Wicklin — Mon, 17 Nov 2025 10:26:15 +0000

It is difficult to evaluate high-dimensional integrals. One numerical technique that can be useful is quasi-Monte Carlo integration. In this article, I show how you can generate quasirandom points in SAS and use them to evaluate a definite integral on a compact region. For simplicity, the example in this article integrates a function of two variables over a rectangular region.

What is quasi-Monte Carlo integration?

Multivariate calculus provides ways to integrate simple functions over simple 2-D and 3-D regions. However, analytical techniques often fail for high-dimensional regions and for complicated functions. Monte Carlo and quasi-Monte Carlo integration enables you to numerically estimate the definite integral of a function over an arbitrary high-dimensional region.

The standard Monte Carlo (MC) method evaluates a function at N points that are distributed uniformly at random in a region, then estimates the definite integral by using the average value of the function over those points. Previous articles show how to perform Monte Carlo integration in SAS for one-dimensional integrals and for two-dimensional integrals over rectangular and nonrectangular domains. For simplicity, this article discusses only rectangular regions.

The standard Monte Carlo method requires that you generate N points uniformly at random in the region. Unfortunately, Monte Carlo integration has slow convergence: The convergence is O(1/sqrt(N)). To halve the error, you must quadruple the number of points! This is because random points tend to have gaps and clusters, so some regions of the function's domain are sampled multiple times whereas other regions are sampled rarely.

In the quasi-Monte Carlo (QMC) method, you replace the pseudorandom points with quasirandom points. You evaluate the function at each point and compute the average value. As discussed in the Wikipedia article about quasi-Monte Carlo, the estimate of the integral converges according to O(1/N), which is much faster than for traditional Monte Carlo integration. This is because quasirandom points do not have gaps or clusters, which means the domain of the function is sampled more evenly.

The property of "no gaps or clusters" is accomplished by using deterministic low-discrepancy sequences to generate each coordinate of the points. This article generates quasirandom points by using the Halton sequence, but you can use other sequences. The Sobol sequences is one popular alternative.

The following sections use the SAS IML language to generate both pseudorandom and quasirandom points. The program compares the convergence for the MC and QMC methods for an integral of two variables on a rectangular domain.

Visualize pseudorandom and quasirandom points

A previous article shows how to create a library of SAS IML functions that generate quasirandom points by using Halton sequences. This deterministic sequence, derived from converting integers to different bases and reflecting the digits, produces points that fill the unit interval [0, 1] much more uniformly than pseudorandom numbers.

A GitHub repo contains all functions that generate the graphs and table in this article. You can run the program in the link to store the modules, then use the LOAD statement to use the functions in your own programs.

The following program compares pseudorandom and quasirandom points in a rectangular region in 2-D. In the DATA step, you can generate N uniform random variates in [0,1] by using the RAND("Uniform") function in a loop. In PROC IML, the RANDGEN subroutine enables you to generate many random variates in a single call. The variates on [0,1] can be transformed into any other interval [a,b] by using the affine transformation x → a + (b-a)*x. The following program generates N=1E4 points in the rectangle [0,π/2]x[0.1]:

proc iml;
N = 1E4;                 /* generate N points in D */
 
/* Domain of integration is D = [a,b] x [c,d] */
pi = constant('pi');
a = 0;  b = pi/2;          /* 0 < x < pi/2 */
c = 0;  d = 1;             /* 0 < y < 1    */
x0 = a || c;               /* offset vector */
s = (b-a) || (d-c);        /* scaling vector */
 
title "Pseudorandom Evaluation Points";
call randseed(1234);
X = j(N,2);                      
call randgen(X, "uniform", 0, 1);  /* X ~ U[0,1]^2 */
X = x0 + s # X;                    /* X ~ U(D) */
call scatter(X[,1], X[,2]) grid={x y} 
     option="markerattrs=(symbol=circlefilled size=3) transparency=0.5";

The scatter plot shows the gaps and clusters that are typical of points that are distributed uniformly at random. If you evaluate a function at these points, the function is sampled more often in some regions than in other regions.

Let's compare this situation to a quasirandom set of points. The following statements load the Halton functions, then generate 1E4 quasirandom points. Be sure to run the program mentioned in the Appendix to store the functions. The call to QRand_Halton(N,2) generates an Nx2 matrix, Q, where each row of Q is a quasirandom point in the unit square [0.1]x[0,1].

load module=_ALL_;    /* first run the program in the Appendix to store the Halton and quasirandom functions */
 
title "Quasirandom Evaluation Points";
Q = QRand_Halton(N, 2);            /* Q is quasirandom in unit square */
Q = x0 + s # Q;                    /* Q is quasirandom in D */
call scatter(Q[,1], Q[,2]) grid={x y} 
     option="markerattrs=(symbol=circlefilled size=3) transparency=0.5";

The scatter plot of the quasirandom points shows no gaps or clusters. Because of the way a Halton sequence is constructed, each new point is generated in an empty space among the previous set of points. As you add more and more points from the Halton sequence, the empty regions get smaller and smaller.

You could achieve a "no gaps, no clusters" point pattern by using points on a regular grid, but the grid method lacks an important feature of the low-discrepancy sequences. Namely, in a low-discrepancy sequence, a sequence of length N contains all the points of the sequence of length N-1, plus one additional point. A regular-spaced grid does not have that property. For example, a set of nine evenly spaced points in [0,1] is 1/10, 2/10, ..., 9/10. But none of these points are contained in the set of eight evenly spaced points, which is 1/9, 2/9, ..., 8/9. In integration, you must evaluate a function at every point of a quasi-random point set. A low-discrepancy sequence enables you to generate points sequentially and monitor the convergence of the integral. You can generate points until the integral converges and then stop.

Monte Carlo and quasi-Monte Carlo Integration

A previous article shows how to estimate the definite integral of a function of two variables by using the traditional Monte Carlo integration in SAS. In that article, the function is f(x,y) = cos(x)*exp(y), and the domain of integration is D = [0,π/2]x[0,1]. The previous section generated both pseudorandom points (X) and quasirandom points (Q) in this domain. To estimate the integral, you evaluate the function on the set of points, take the average, and scale the average to account for the area of the domain. The following statements estimate the integral first by using the pseudorandom points and next by using the quasirandom points.

/***********************************************/
/* estimate the double integral in https://blogs.sas.com/content/iml/2021/04/07/double-integral-monte-carlo.html */
/***********************************************/
/* define integrand: f(x,y) = cos(x)*exp(y) */
start func(x);
   return cos(x[,1]) # exp(x[,2]);
finish;
 
/* the double integral is separable; solve exactly */
Exact = (sin(b)-sin(a))*(exp(d)-exp(c));  
Area = (b-a)*(d-c);                /* area of rectangular region */
 
/* MC = Monte Carlo */
W = func(X);                       /* f(X1,X2) for columns of X */
MCEst = Area * mean(W);            /* MC estimate of double integral */
DiffMC = Exact - MCEst;
/* QMC = quasi-Monte Carlo */
W = func(Q);                       /* f(Q1,Q2) for columns of Q */
QMCEst = Area * mean(W);           /* MC estimate of double integral */
DiffQMC = Exact - QMCEst;
print Exact MCEst QMCEst DiffMC DiffQMC;

I chose this function because the definite integral has an analytical solution. On the domain, the exact integral is e – 1 ≈ 1.71828. With this random number seed, the traditional Monte Carlo estimate is accurate to about 0.5% of the true value with N=10,000 points. The first wrong digit is in the second decimal place. In contrast, the quasi-Monte Carlo estimate is accurate to about 0.008% with the same number of points. The first wrong digit is in the fourth decimal place.

The fact that the Monte Carlo estimate depends on the random number seed can be annoying. It introduces random variation into the estimate. It makes the estimate hard to reproduce for users of other software. It means that the Monte Carlo estimate has a large standard error. In contrast, there is nothing random about the quasi-Monte Carlo estimate. Any other implementation that uses Halton sequences and the same choice for Halton bases will produce exactly the same estimate.

The convergence of MC and QMC integration

Let's take a closer look at the convergence of the integral estimates for both the MC and QMC methods. As I have discussed previously, the convergence for the traditional MC method is fairly slow, and there is considerable variation in the estimate. The estimate for the QMC method is much tamer. The following statements show the convergence by plotting the estimate of the integral for N=100, 200, ..., 10000 points.

/* How does the estimate depend on the sample size? 
   We have already computed W = func(X) */
size = do(1E2, N, 1E2);
QMCest = j(1,ncol(size),.);
do i = 1 to ncol(size);
   Y = W[1:size[i],];              /* Y = f(Q) for the first size[i] points */
   QMCEst[i] = Area*mean(Y);       /* estimate integral */
end;
 
title "Quasi-Monte Carlo Estimates as a Function of Sample Size";
stmt = "refline " + char(Exact,6,4) + "/ axis=y; format size comma10.;";
call scatter(size, QMCEst) grid={x y} other=stmt label={"Sample Size"};

The scatter plot shows the convergence of the estimate for the integral as N increases. You can see that the estimate converges quickly. The estimate is already good after using only 2,000 quasirandom points. The estimate generally improves as N increases, and there are no wild swings in the estimate as can happen with the traditional Monte Carlo method. As a rule of thumb, QMC enables you to use fewer points than MC to achieve the same accuracy. For that reason, the QMC is often preferred over MC for estimating definite integrals in low to moderate dimensions (such as d ≤ 10).

Summary

This article demonstrates quasi-Monte Carlo integration in SAS by using SAS IML functions that implement Halton sequences. An example of a 2-D integral shows that the QMC estimate converges quickly. The traditional MC converges like O(1/sqrt(N)), whereas the QMC estimate converges as O(1/N). In addition, QMC is deterministic, which means that the answer does not depend on a random number seed.

Before concluding, I mention two drawbacks of the quasi-Monte Carlo method. First, not every integral of interest has a finite domain. There are plenty of integrals in probability theory that require integration over an infinite domain. Second, in this implementation, I use consecutive prime numbers (2, 3, 5, ...) as the bases to generate a Halton sequence in each coordinate direction. This technique works well for small primes, so quasi-Monte Carlo integration with Halton sequence is often used for six or fewer dimensions. It is known that for larger primes (for example, 17 and 19), the Halton sequences might be strongly correlated. The Wikipedia article mentions some ways to overcome this deficiency.

Appendix: SAS IML functions for generating quasirandom numbers in d dimensions

The SAS IML functions that are used to create all graphs and table in this article are available on GitHub. They enable you to generate quasirandom ordered tuplets up to d=168 dimensions. Almost all of the functions are explained and demonstrated in previous article about how to generate a Halton sequence. The two new functions are

HaltonSeqD : Create a matrix with N rows where each column is a Halton sequence for a specified base
QRand_Halton : Create a matrix with N rows and d columns, where each column is a Halton sequence for a base chosen from the first d prime numbers: 2, 3, 5, 7, 11, ...

The post Quasi-Monte Carlo integration in SAS appeared first on The DO Loop.

Create a Halton sequence in SAS

Rick Wicklin — Mon, 10 Nov 2025 10:28:13 +0000

A previous article shows how to convert a positive integer from base 10 to any other arbitrary base. For example, 15 (base 10) = 120 (base 3) because 15 = 1*3² + 2*3¹ + 0*3⁰. Representing integers is probably familiar to many readers. But did you know that you can do the same with fractions? This article shows how to convert a decimal value such as (0.021) in base 3 into a fraction. In addition, by using this technique, you can construct a Halton sequence, which can be used to create quasirandom numbers. In a subsequent article, I will show how to construct quasirandom numbers and use them in quasi-Monte Carlo techniques to estimate an integral.

The word "decimal" implies base 10, so radix point is the general name given to the period or dot that separates the integer part of a number from the fractional part of a number. Similarly, a "decimal value" is more properly called a radix fraction. I'll use these fancy terms, but feel free to replace "radix point" with "decimal point" and "radix fraction" with "decimal value" if you prefer.

Represent decimals as fractions in any base

In base 10, a decimal is really a fraction in disguise. A decimal such as 0.123 is a representation of a fraction that uses powers of 10 in the denominator. That is, the decimal 0.456 (base 10) equals 4*10^-1 + 5*10^-2 + 6*10^-3. More generally, a base-10 decimal is a sum of terms $\Sigma c_{i} 10^{-i}$ , where $c_{i}$ are the digits to the right of the decimal point.

In a similar way, radix fractions in other bases are the sum of fractions that have powers of the base in the denominator. For example:

In base 2, the radix fraction (0.1101)₂ represents the fraction 1*2^-1 + 1*2^-2 + 0*2^-3 + 1*2^-4, or 1/2 + 1/4 + 0/8 + 1/16.
In base 3, the radix fraction (0.0121)₃ represents the fraction 0*3^-1 + 1*3^-2 + 2*3^-3 + 1*3^-4, or 0/3 + 1/9 + 2*27 + 1/81.

The following SAS IML function (called ConvertFracFromBase) converts a row vector of coefficients (the digits after the radix point) into a fraction in a specified base. In the function, the expression base##pow computes the vector of powers b^-1, b^-2, and so forth. The elementwise multiplication (#) with the digit vector (c) and the subsequent summation ([,+]) calculates the value of the fraction.

After defining the function, the program shows examples of calling the functions for radix values in base 2 and base 3:

proc iml;
/* Convert row vectors from any base to a fraction in base 10.
   The row vector represents coefficients for a radix fraction in base b.
   For example:
   Base 2:  c = {1 1 0 1}   represents 1/2 + 1/4 + 0/8 + 1/16
   Base 3:  c = {0 1 2 1}   represents 0/3 + 1/9 + 2/27 + 1/81.
   The input, c, can also be a matrix for which each row c[i,] is a vector of coefficients.
*/
start ConvertFracFromBase(c, base);
   pow = -( 1:ncol(c) );         /* decreasing powers */
   factor = base##pow;           /* b^{-1}, b^{-2}, ... */
   fract10 = (c # factor)[ ,+];  /* fractions */
   return fract10;
finish;
 
/* test the function */
s2 = {1 1 0 1,     /* 1/2 + 1/4 + 0/8 + 1/16 */
      0 0 1 1};    /* 0/2 + 0/4 + 1/8 + 1/16 */
f2 = ConvertFracFromBase(s2, 2);
print s2[L="" c={'1/2' '1/4' '1/8' '1/16'}], f2[F=FRACT32.];
 
s3 = {1 2 0 1,   /* 1/3 + 2/9 + 0/27 + 1/81 */
      0 2 1 0};  /* 0/3 + 2/9 + 1/27 + 0/81 */
f3 = ConvertFracFromBase(s3, 3);
print s3[L="" c={'1/3' '1/9' '1/27' '1/81'}], f3[F=FRACT32.];

The tables show the coefficients that represent the radix fraction in each base, along with the fractional value. Notice that I used the FRACT. format in SAS to display the floating-point result as a fraction.

The ConvertFracFromBase function is efficient because it uses matrix operations. Instead of looping over digits, it computes the entire set of denominators in a single vector (factor), multiplies them by the corresponding digits, and then sums the resulting products. This vectorization is key to generating the Halton sequence efficiently.

The Halton sequence

If you can convert an integer to any base (from the previous article) and evaluate a radix fraction (from the previous section), then you can construct a Halton sequence. You first specify the base, b, which is typically a prime number. Then, as discussed in the Wikipedia article about the Halton sequence, you construct the Halton sequence of length n as follows.

Base conversion: Represent the integers 1-n as numbers in base b.
Digit reflection: Place a radix point at the end of each number and flip or reflect the digits across the radix point to create a set of radix fraction values.
Convert to base 10: Evaluate these radix fraction values.

For example, in base 2, here is how to construct the first n=7 terms of the Halton sequence:

The integers in base 2 are represented as 1, 10, 11, 100, 101, 110, 111, 1000, ...
Reverse the digits by reflecting them across the radix point: 0.1, 0.01, 0.11, 0.001, 0.101, 0.011, 0.111, 0.0001, ...
Express these radix fraction values as fractions: 1/2, 1/4, 3/4, 1/8, 5/8, 3/8, 7/8, ... You can evaluate these fraction to obtain the base-10 values.

The following SAS IML function carries out this algorithm. The utility function, reverseCols, performs the reversal of the digits from integers to radix fractions. You must first load the ConvertToBase function from a previous article. For convenience, that and all the functions in this article are provided in the Appendix.

/* load ConvertToBase and other functions from first blog post */
load module = _ALL_;
 
/* reverse the order of the columns of a matrix */
start reverseCols(M);
   return M[, ncol(M):1];
finish;
 
/* generate a Halton sequence for one specified base
   INPUT: _n : a column vector of integers
          base : a positive scalar integer, often chosen to be a prime number 
   OUTPUT: a vector of length nrow(_n) that contains a Halton sequence
*/
start HaltonSeq1( _n, base );
   n = colvec(_n);
   C1 = ConvertToBase(n, base);          /* convert integers to base-b digits */
   C = reverseCols(C1);                  /* reverse digits to get fraction coefficients */
   return ConvertFracFromBase(C, base);  /* convert base-b fraction to base-10 decimal value */
finish;
 
/* test the functions */
n = T(1:10);
s2 = HaltonSeq1( n, 2);
s3 = HaltonSeq1( n, 3);
print n s2[F=FRACT32.] s3[F=FRACT32.];

The table shows an interesting property of the fractions in the Halton sequence. At each "level," the fractions uniformly fill the interval [0, 1]. For example:

In base 2, the first fraction is 1/2, which evenly divide the interval into two subintervals. Then come 1/4 and 3/4, which evenly divide the subintervals from the previous fraction. Then come 1/8, 5/8, 3/8 and 7/8, which evenly subdivide the intervals again.
In base 3, the first two fractions are 1/3 and 2/3, which evenly divides the interval into three subintervals. Then come the fractions with 9 in the denominator, which evenly divide the subintervals from the previous fraction.

The Halton fractions are deterministic (not random), but their distribution on the unit interval is highly regular. As you increase the number of terms, the points fill the interval more uniformly than truly random numbers do. If n is a power of the base, there are no clusters or gaps with the Halton fractions.

Quasirandom points from the Halton sequence

You can use the Halton sequence to obtain n quasirandom points in the unit cube in d-dimensional space. For each dimension, you choose a (different) prime number as the base. You then generate n points for each dimension and create the Cartesian product of the sequences. For example, to generate points in the unit square in 2-D, you can use b=2 as the base for the first dimension and b=3 as the base for the second dimension.

The following statements generate n=100 points in base 2 and 3, then display a scatter plot of the resulting Cartesian product in 2-D:

/* 2-D quasirandom points  */
n = T(1:100);
s2 = HaltonSeq1( n, 2);
s3 = HaltonSeq1( n, 3);
 
title "Halton QuasiRandom Values";
call scatter(s2,s3) grid={x y} label={'x' 'y'} procopt="aspect=1";

The scatter plot shows an important property of the ordered pairs in the unit square. As discussed previously, the Halton points are evenly spaced. As you add more and more points, the new points fill one of the larger gaps between points. This property makes them a popular choice for numerical quasi-Monte Carlo integration, which I will discuss in a future article.

Summary

This article discussed some functions that enable you to generate Halton sequences. These sequences are useful for quasi-Monte Carlo integration. The Appendix contains the complete set of SAS IML functions that are used for this purpose. A future article shows how to use these functions to estimate a definite integral by using quasi-Monte Carlo integration.

Appendix: SAS IML functions for base conversion and Halton sequences

The following IML statements define and store the functions used in this and a previous article. You can use the LOAD statement to use these functions in your own programs.

/* APPENDIX: A set of SAS IML functions that generate quasirandom points 
   in d dimensions by using Halton sequences */
proc iml;
/* Compute LOG_b(x) for a vector of x values and for an integer base, b > 0.
   If an input value is not positive, this function returns a missing value.
   Computers represent numbers internally in base b=2, so use the change-of-base formula
   log_b(x) = log2(x) / log2(b)
   to compute the logarithm in any base, b.
*/
start logbase(x, base=10);
   if all(x)>0 then 
      return log2(x) / log2(base);
   /* otherwise, return missing values for x <= 0 */
   y = j(nrow(x), ncol(x), .);
   idx = loc(x>0);
   if ncol(idx)>0 then 
      y[idx] = log2(x[idx]) / log2(base);
   return y;
finish;
 
/* This function returns the number of digits in the base-b representation of an integer, x.
   If n >= 0 is a base-10 integer, it has
      k = ceil(log_b(n+1))
   digits when represented in base b. See
   https://blogs.sas.com/content/iml/2015/08/31/digits-in-integer.html
*/
start numDigBase(x, base=10);
   n = round(x);              /* ensure argument is an integer */
   return ceil( logbase(n+1,base) );
finish;
 
/* Convert integer x > 0 to a row vector in base b.
   If x is a column vector, return a matrix where each row is the base-b representation of x[i].
   The most significant bit is to the left; the least significant bit is to the right.
   For example, n=15 and base=3 gives (120)_3 because 15 = 1*3##2 + 2*3##1 + 0*3##0.
*/
start ConvertToBase(x, base);
   n = round(x);     /* ensure inputs are integers */
   numDig = numDigBase(max(n), base);
   /* For an explanation, see https://blogs.sas.com/content/iml/2015/08/31/digits-in-integer.html */
   c = j(nrow(n), numDig, 0);
   do i = 1 to numDig;
      a = mod(n, base);
      n = floor( n/base );
      c[ , numDig - i + 1] = a;
   end;
   return c;
finish;
 
/* Convert row vectors from any base to a fraction in base 10.
   The row vector represents coefficients for a fraction in base b.
   For example:
   Base 2:  c = {1 1 0 1}   represents 1/2 + 1/4 + 0/8 + 1/16
   Base 3:  c = {0 1 2 1}   represents 0/3 + 1/9 + 2/27 + 1/81.
   The input, c, can also be a matrix for which each row c[i,] is a vector of coefficients.
*/
start ConvertFracFromBase(c, base);
   pow = -( 1:ncol(c) );         /* decreasing powers */
   factor = base##pow;           /* b^{-1}, b^{-2}, ... */
   fract10 = (c # factor)[ ,+];  /* fractions */
   return fract10;
finish;
 
/* reverse the order of the columns of a matrix */
start reverseCols(M);
   return M[, ncol(M):1];
finish;
 
/* generate a Halton sequence for a specified base
   INPUT: _n : a column vector of integers
          base : a positive scalar integer, often chosen to be a prime number 
   OUTPUT: a vector of length nrow(_n) that contains a Halton sequence
*/
start HaltonSeq1( _n, base );
   n = colvec(_n);
   C1 = ConvertToBase(n, base);
   C = reverseCols(C1);
   return ConvertFracFromBase(C, base);
finish;
 
store module=(
  logbase numDigBase ConvertToBase ConvertFromBase
  reverseCols ConvertFracFromBase HaltonSeq1
);
QUIT;

The post Create a Halton sequence in SAS appeared first on The DO Loop.

Convert integers between different bases in SAS

Rick Wicklin — Mon, 03 Nov 2025 10:29:15 +0000

While many applications of Monte Carlo techniques use pseudorandom numbers, some applications that involve integrals are more accurate when you use quasirandom numbers, which, despite their names, are not random but are deterministic sequences of numbers. Many of these sequences are constructed by representing base-10 numbers in a different base. Familiar bases include b=2 (binary), b=3 (ternary), and b=16 (hexadecimal). The following tasks are important when converting numbers to and from bases:

Compute how many digits a base 10 number has when represented in an arbitrary base. This requires computing logarithms in an arbitrary base.
Convert a number from base 10 to an arbitrary base.
Convert a number from an arbitrary base to base 10.

This article shows how to compute these quantities by using the SAS IML language. I have previously presented DATA step implementations for a few of these tasks. However, since my ultimate goal is to perform quasi-Monte Carlo computations, the SAS IML language is a better choice.

The logarithm (base b) of a number

In working with arbitrary bases, it is useful to be able to evaluate the logarithm function, x = log_b(n). Many computer languages, including SAS, support built-in functions only for log₁₀ (LOG10), log₂ (LOG2), and the natural logarithm log_e (LOG).

To compute the logarithm for an arbitrary base, b, you can use the "change-of-base formula,"
$\log_b(n) = \log_k(n) / \log_k(b)$
where k can be any convenient base. For details, see a previous article on this topic.

Because computers natively operate in binary, choosing k=2 makes the computation more efficient. The following SAS IML function implements the formula for any base, b. You can pass in a vector of x values as the input. If any element is not positive, the function returns a missing value for that element.

proc iml;
/* Compute LOG_b(x) for a vector of x values and for an integer base, b > 0.
   If an input value is not positive, this function returns a missing value.
*/
start logbase(x, base=10);
   if all(x)>0 then 
      return log2(x) / log2(base);
   /* otherwise, return missing values for x <= 0 */
   y = j(nrow(x), ncol(x), .);
   idx = loc(x>0);
   if ncol(idx)>0 then 
      y[idx] = log2(x[idx]) / log2(base);
   return y;
finish;
 
/* test the function for base b=2 and 3 */
t = {2,3,4,5,8,9,15,16,48};
log2 = logbase(t, 2);
log3 = logbase(t, 3);
print t log2 log3;

The number of digits in a base-b representation of an integer

You can use the base-b logarithm to compute the number of digits required to represent an integer n in an arbitrary base b. As explained in a previous article, the number of digits, k, is given by the formula
k = ceil(log_b(n+1))

The following function calls the logbase function in the previous section:

/* This function returns the number of digits in the base-b representation of an integer, x.
   If n >= 0 is a base-10 integer, it has
      k = ceil(log_b(n+1))
   digits when represented in base b. See
   https://blogs.sas.com/content/iml/2015/08/31/digits-in-integer.html
*/
start numDigBase(x, base=10);
   n = round(x);              /* ensure argument is an integer */
   return ceil( logbase(n+1,base) );
finish;
 
/* test the function for base b=2 and 3 */
t = {2,3,4,5,8,9,15,16,48};
numDig2 = numDigBase(t, 2);
numDig3 = numDigBase(t, 3);
print t numDig2 numDig3;

Convert an integer from base 10 to another base

I previously wrote an article that explains how to convert an integer from base 10 to another base. The technique uses repeated division: you divide n by the base b, store the remainder as the next digit, and continue the division until the quotient is zero. You can use the MOD and FLOOR functions to perform these operations.

For example, if you want to convert 15 (base 10) to base 3:

15 / 3 = 5 remainder 0, so store 0 as the least significant bit and use 5 as the next value of 'n'. The least significant bit is the rightmost digit.
5 / 3 = 1 remainder 2, so store 2 as the next digit and use 1 as the next value of 'n'.
1 / 3 = 0 remainder 1, so store 1 as the next digit. The method ends.

Thus, 15 (base 10) = 120 (base 3). The previous article includes a DATA step that implements the technique. But you can vectorize the computation in PROC IML to make it more efficient. The following function accepts a vector of inputs, which are positive base-10 integers. It returns a matrix of digits where the i_th row represents the i_th integer in the specified base. The number of rows in the matrix is determined by using the numDigBase function on the largest input value.

/* Convert integer x > 0 to a row vector in base b.
   If x is a column vector, return a matrix where each row is the base-b representation of x[i].
   The most significant bit is to the left; the least significant bit is to the right.
   For example, n=15 and base=3 gives (120)_3 because 15 = 1*3##2 + 2*3##1 + 0*3##0.
*/
start convertToBase(x, base);
   n = round(x);     /* ensure inputs are integers */
   numDig = numDigBase(max(n), base);
   /* For an explanation, see https://blogs.sas.com/content/iml/2015/08/31/digits-in-integer.html */
   c = j(nrow(n), numDig, 0);
   do i = 1 to numDig;
      a = mod(n, base);
      n = floor( n/base );
      c[ , numDig - i + 1] = a;
   end;
   return c;
finish;
 
/* test the function for base b=2 and 3 */
t = {2,3,4,5,8,9,15,16,48};
base2 = ConvertToBase(t, 2);
base3 = ConvertToBase(t, 3);
print base2[r=t c=('c5':'c0')];
print base3[r=t c=('c3':'c0')];

Convert to base 10 from an arbitrary base

I have previously shown how to convert a number from base 2 to base 10. The process is similar for other bases. The input is a row vector of digits (0 through b-1) in the specified base. The output is a base-10 integer. You can vectorize the computations so that the input can be a matrix of digits, where each row is a base-b representation of an integer. Then the output is a vector of integers.

/* Convert a row vector from any base to a number in base 10.
   If c is a kxm matrix, then return a base-10 number fo each row.
   For example, in base 3, define:
   c = {0 1 2 1,   
        1 2 1 0 };
   The first row represents 16. The second row represents 48.
   See https://blogs.sas.com/content/iml/2011/11/16/converting-from-base-2-to-base-10.html 
*/
start ConvertFromBase(c, base);
   pow = (ncol(c)-1):0;
   factor = base##pow;
   base10 = (c # factor)[ ,+];  /* c[k-1]*b##(k-1) + ... + c[1]*b##1 + c[0]*b##0 */ 
   return base10;
finish;
 
/* test the function for base b=2 and 3 */
n2 = ConvertFromBase(base2, 2);
n3 = ConvertFromBase(base3, 3);
print n2 n3;

Summary

This article introduces four SAS IML functions that are useful for converting an integer, n, between base 10 and an arbitrary base, b. The first enables you to compute the logarithm of n in base b. The others enable you to count the number of digits required to represent n in base b, to convert n from base 10 into base b, and to convert a number in base b to base 10. Examples are given for b=2 and b=3. In quasirandom Monte Carlo applications, the base is often a prime number. These functions are useful for working with integer sequences in arbitrary bases. In a future article, I show how to use these functions to generate quasirandom numbers.

The post Convert integers between different bases in SAS appeared first on The DO Loop.

Visualize an ordinal response regression model

Rick Wicklin — Mon, 27 Oct 2025 09:24:42 +0000

Many data analysts are familiar with logistic regression, where the response variable, Y, has two observed values, often represented as Y=0 and Y=1. The case Y=0 encodes that an event did not happen. For example, a patient did not experience some disease or did not die. The opposite case (Y=1) means that the event did happen. These binary responses are common in many areas but are especially useful in medicine.

However, sometimes the response variable encodes multiple levels of severity. The response variable can have k states, where k > 2. This is called an ordinal response model. For example, the spread of cancer in a patient is often described by using k=5 stages: Stage 0 indicates abnormal cells that have not spread, whereas Stage 4 indicates advanced cancer that has metastasized through the body. Since each stage is more severe than the ones before, the stages of cancer represent an ordinal variable. As in this example, usually k is a small integer.

PROC LOGISTIC in SAS enables you to build an ordinal regression model where the response variable is ordinal. Given values for the explanatory variables in the model, the model predicts the probability that Y=0, Y=1, ..., and Y=k-1. (You can also use levels 1-k.) The References section contains several references that explain the mathematics of ordinal regression models. Unfortunately, most references do not visualize these ordinal models. Robin High (2023) and the documentation for PROC LOGISTIC include graphs for a model that contains only categorical explanatory variables. Derr (2013) draws graphs for models that involve either categorical or continuous covariates. This article focuses on how to visualize an ordinal regression model that has a continuous covariate. It shows two techniques: use the built-in EFFECTPLOT statement in PROC LOGISTIC or create your own graphs by evaluating the model on a set of scoring data.

Example data

The Appendix shows some example data for which the response variable, YCat, has k=4 levels with values 0-3. The explanatory variable, X, is continuous. A scatter plot of the data is shown to the right. The graph shows a negative association between YCat and X: high values of YCat are associated with low values of the X variable, whereas low values of YCat are associated with high values of X.

An ordinal response model tries to predict the classes (YCat=0, 1, 2, or 3) based on the X variables. From the graph, it looks like the model should predict YCat=3 when X is less than 2800, then predict either YCat=2 or YCat=1 as X increases further. After X exceeds 4500, the model should predict YCat=0. From the graph, it is not clear how to assign probabilities for YCat when X is in the range 3000-4500.

The cumulative logit model

There are several kinds of ordinal regression models, but this article examines only the cumulative logit model, which is the default model in PROC LOGISTIC when the response variable has more than two values. The cumulative logit model is used to predict the probability P(Y ≤ j | x) for j=0, 1, ..., k-1. For simple models, you can use the EFFECTPLOT statement in PROC LOGISTIC to visualize the model. When there is one continuous covariate, use the SLICEFIT option on the EFFECTPLOT statement, as follows:

proc logistic data=Have;
   model YCat = X / link=CLOGIT;
   ods select ResponseProfile  ParameterEstimates SliceFitPlot; 
   effectplot slicefit(x=X sliceby=YCat) / noobs;
   /* to see the linear predictors, use 
      effectplot slicefit(x=X sliceby=YCat) / noobs link; */
run;

The tabular output includes the ResponseProfile table and the ParameterEstimates table. The ResponseProfile table shows the number of observations in each level of the ordinal response variables. For these data, the each category has more than 50 observations. If there is a category that has very few observations, you should probably combine it with one of the adjacent levels. The ParameterEstimates table gives the parameter estimates for the model. Derr (2013, p. 2-3) explains that the model predicts the probabilities Pr(Y ≤ j) for each level j=0,1,2,3. The linear predictors have the form α_j + X`*β. For these data, α₀ = -23.3, α₀ = -18.8, α₀ = -16.1, and β = 0.0054.

When you apply the cumulative logit link to these linear predictors, you get the predicted cumulative probabilities on the following graph:

Here is how to interpret this graph:

The brown horizontal line is the predicted cumulative probability for Pr(YCat ≤ 3| x). Because YCat=3 is the largest ordinal value, the probability is always 1 that YCat is 3 or less.
The green curve is the leftmost sigmoid-shaped curve. This is the predicted cumulative probability for Pr(YCat ≤ 2| x). When X exceeds 2500, the probability that YCat is 2 or less begins to increase. By the time X reaches 4000, there is about 100% chance that YCat is 2 or less.
The red curve is the middle sigmoid-shaped curve. This is the predicted cumulative probability for Pr(YCat ≤ 1| x). When X exceeds 3000, the probability that YCat is 1 or less begins to increase. By the time X reaches 4500, there is almost 100% chance that YCat is 1 or less.
The blue curve is the rightmost sigmoid-shaped curve. This is the predicted cumulative probability for Pr(YCat ≤ 0| x), which means Pr(YCat = 0| x). When X exceeds 3800, the probability that YCat is 0 begins to increase. By the time X reaches 4500, there is almost 100% chance that YCat is 0. This agrees with the scatter plot that was shown earlier.

I have written several articles about how to use the EFFECTPLOT statement in SAS procedures to visualize regression models. For ordinal regression models, Derr (2013) provides other examples that use the EFFECTPLOT statement in PROC LOGISTIC.

A visualization that shows predicted probabilities

The EFFECTPLOT statement can visualize many simple regression models. However, for more complicated models (for example, models with spline effects) and for regression procedure that do not support the EFFECTPLOT statement, I have shown how to create a sliced fit plot manually.

Notice that the EFFECTPLOT statement visualizes the cumulative probabilities (the CDF) for the cumulative logit model. This makes sense, but sometimes it is more useful to display the predicted probability for each response level. The SCORE statement in PROC LOGISTIC outputs the predicted probability for each level.

As discussed in a previous article about the sliced fit plot, there are three steps:

Create scoring data that contains a grid of values for the continuous covariate and sets the other explanatory variables to convenient reference values.
Use the SCORE statement in the regression procedure to evaluate the fitted model on the scoring data. Output the predictions.
Use PROC SGPLOT to graph the predicted values against the continuous covariate.

/* Visualize the density of P(YCat=y) vs X for y=0, 1, 2, 3. 
   Step 1: Create scoring data  */
data ScoreData;
do X = 1850 to 7190 by 10;
   output;
end;
run;
 
/* Step 2: Evaluate the fitted model on the scoring data. Output predictions. */
proc logistic data=Have;
   model YCat = X / link=CLOGIT;
   ods select ResponseProfile ParameterEstimates; 
   score data=ScoreData out=ScoreWide;   /* output contains 4 columns: P_0, P_1, P_2, and P_3 */ 
run;
 
/* Step 3: Graph the predicted values against the continuous covariate. */
title "Visualize the Cumulative Logit Model";
title2 "Predicted Probabilities for YCat = 0, 1, 2, and 3";
proc sgplot data=ScoreWide;
   series x=X y=P_0 / lineattrs=(thickness=2);
   series x=X y=P_1 / lineattrs=(thickness=2);
   series x=X y=P_2 / lineattrs=(thickness=2);
   series x=X y=P_3 / lineattrs=(thickness=2);
run;

I like this visualization better than the one that shows cumulative probabilities. This graph shows the intervals of the X axis on which each level of YCat is most likely. Here is how to interpret this graph:

The brown curve is the predicted probability for Pr(YCat = 3| x). For small values of X, the probability is quite high that the response is 3. For X ≥ 4000, the probability is essentially zero.
The green bell-shaped curve is the predicted probability for Pr(YCat = 2| x). This level is most probable when X is in the interval [2500, 4000].
The red bell-shaped curve is the predicted probability for Pr(YCat = 1| x). This level is most probable when X is in the interval [3000, 5000].
The blue curve is the predicted probability for Pr(YCat = 0| x). For small values of X, the probability is almost 0. For X ≥ 4000, the probability rises. For X ≥ 5000, the probability is essentially 1.
For every fixed value of X, the sum of the probabilities is 1. That is, at every value of X, P_0 + P_1 + P_2 + P_3 = 1.

Summary

This article shows two ways to visualize a regression model for an ordinal response variable. In SAS, you can use the EFFECTPLOT statement in PROC LOGISTIC to automatically create a visualization. The graph shows the cumulative probabilities, such as Pr(Y ≤ j | x) for j=0, 1, ..., k-1. You can also create a graph of the individual probabilities, Pr(Y = j | x). One way to do that is to create a scoring data set that specifies values for the explanatory variables. You can then use the SCORE statement to evaluate the model at these points and plot the predicted probabilities for each response level.

References

Derr, B. (2013) "Ordinal Response Modeling with the LOGISTIC Procedure." Proceedings of SAS Global Forum 2013.
High, R. (2013) "Models for Ordinal Response Data." Proceedings of SAS Global Forum 2013.
High, R. (2023) "The Analysis of Ordinal Data with Graphs and Odds Ratios." Conference presentation at the 2023 Iowa SAS User Group.
Wicklin, R. (2017a) "Visualize multivariate regression models by slicing continuous variables." The DO Loop blog.
Wicklin, R. (2017b) "How to create a sliced fit plot in SAS." The DO Loop blog.
Wicklin, R. (2024) "Visualize a multivariate regression model when using spline effects." The DO Loop blog.

Appendix: Create the example data

The following data set is used in this article. The scatter plot for the data is shown at the top of this article.

data Have;
input YCat X Group $ @@;
datalines;
3 1850 A 3 2035 A 3 2055 A 3 2085 A 3 2195 A 3 2255 A 3 2290 A 3 2339 A 
3 2340 A 3 2348 C 3 2370 C 3 2387 A 3 2387 A 3 2403 A 3 2425 A 3 2432 A 
3 2447 A 3 2458 A 3 2500 A 3 2500 A 3 2502 A 3 2513 A 3 2513 A 3 2524 A 
3 2524 A 3 2581 C 3 2581 A 3 2601 A 3 2606 C 3 2606 C 3 2612 C 3 2617 C 
3 2617 C 3 2626 C 3 2635 A 3 2635 A 3 2656 A 3 2661 A 3 2676 C 3 2676 A 
3 2676 A 3 2679 A 3 2686 A 3 2691 C 3 2692 C 3 2692 C 3 2692 C 3 2696 A 
3 2697 A 3 2698 A 3 2701 C 3 2701 A 3 2702 C 3 2732 A 3 2744 A 2 2750 C 
2 2750 A 3 2751 C 3 2751 C 2 2756 A 3 2761 A 3 2762 A 3 2771 C 3 2771 C 
3 2778 A 3 2782 A 3 2795 A 2 2835 A 1 2866 C 3 2890 A 3 2932 A 3 2946 C 
3 2960 A 2 2965 A 3 2994 A 3 3020 A 1 3020 A 1 3023 A 3 3028 C 1 3029 A 
2 3039 A 3 3042 A 3 3047 A 1 3053 A 1 3060 C 2 3085 C 2 3085 A 3 3086 A 
2 3090 A 2 3091 A 2 3101 C 2 3105 C 3 3109 C 2 3118 C 2 3119 A 1 3153 A 
2 3173 C 3 3174 C 2 3175 C 3 3175 A 2 3182 C 2 3188 A 2 3197 A 2 3197 C 
2 3217 C 1 3217 A 1 3217 A 2 3222 C 2 3230 A 2 3240 A 2 3241 A 1 3246 C 
1 3248 C 1 3255 A 2 3258 A 1 3263 A 1 3263 A 2 3279 A 3 3281 A 1 3285 A 
2 3285 A 2 3290 C 2 3294 A 2 3296 A 2 3296 A 2 3297 C 2 3306 C 1 3306 A 
2 3308 C 1 3313 C 3 3315 C 1 3315 C 1 3336 A 2 3340 C 1 3346 C 1 3347 C 
2 3349 A 1 3351 C 3 3351 A 2 3353 C 2 3357 C 2 3362 A 2 3380 A 2 3381 C 
2 3395 A 0 3410 C 1 3410 A 1 3416 A 2 3417 A 2 3417 A 2 3428 A 2 3430 A 
1 3434 C 2 3439 A 2 3439 A 2 3448 C 2 3458 C 2 3460 A 2 3461 C 2 3465 C 
2 3468 A 2 3469 C 2 3473 A 2 3476 C 2 3476 A 2 3477 C 2 3479 C 2 3484 C 
2 3485 A 1 3487 C 1 3488 C 2 3495 A 1 3497 C 1 3536 C 1 3548 C 2 3549 A 
1 3549 A 2 3567 C 0 3571 A 2 3575 A 0 3575 C 1 3581 C 2 3591 C 1 3606 C 
1 3610 A 1 3623 C 1 3630 A 1 3647 C 1 3649 A 1 3649 A 1 3650 C 1 3651 A 
1 3651 A 1 3677 A 2 3681 C 2 3681 C 1 3682 A 1 3694 C 1 3699 C 0 3714 C 
1 3715 A 0 3725 C 1 3760 A 1 3768 C 1 3768 C 2 3778 C 1 3779 C 1 3780 C 
0 3790 C 1 3790 C 2 3801 A 1 3803 C 1 3812 A 2 3826 C 0 3829 C 1 3836 A 
1 3840 A 1 3851 A 2 3862 C 0 3871 A 1 3880 A 1 3893 A 1 3909 C 0 3925 A 
1 3932 A 1 3935 A 1 3948 C 1 3977 A 1 3984 C 1 3990 A 1 3992 C 1 4012 A 
1 4021 A 1 4024 C 1 4035 A 1 4044 C 1 4052 C 1 4052 C 1 4052 C 0 4056 A 
1 4057 C 1 4057 C 1 4057 C 1 4065 A 1 4068 C 0 4083 C 0 4112 A 1 4120 A 
1 4134 A 0 4142 C 1 4165 A 1 4175 A 1 4195 C 1 4275 C 0 4302 C 0 4309 C 
0 4309 A 1 4310 A 1 4331 C 0 4340 C 1 4365 A 1 4369 C 1 4369 C 0 4374 C 
1 4387 A 0 4425 C 1 4431 C 0 4435 A 1 4440 C 1 4451 A 0 4463 C 1 4474 C 
0 4542 C 1 4548 C 0 4600 C 0 4605 C 1 4675 C 0 4718 A 0 4740 A 0 4760 C 
0 4788 C 0 4802 A 0 4804 C 0 4834 C 0 4945 C 0 4947 C 0 4967 A 0 4987 C 
0 5000 C 0 5013 A 0 5042 C 0 5050 C 0 5270 A 0 5287 A 0 5367 C 0 5390 A 
0 5440 C 0 5464 C 0 5590 A 0 5678 C 0 5879 C 0 5969 C 0 6133 C 0 6400 C 
0 7190 C
;

I didn't make up these data. They are derived from the Sashelp.cars data set. I used PROC RANK to bin the MPG_City variable into four groups (0-3). I then renamed the Weight variable to X.

The post Visualize an ordinal response regression model appeared first on The DO Loop.

Overlay multiple custom density curves on a histogram in SAS

Rick Wicklin — Mon, 20 Oct 2025 09:28:51 +0000

A previous article discusses various ways to overlay a density curve on a histogram in SAS. SAS provides several procedures that handle this task for common univariate probability distributions such as normal, lognormal, and gamma. If you define and use a less common distribution, you can write a GTL template that overlays the histogram and a custom density estimate, or you can use a HIGLOW statement to display the histogram and use the SERIES statement to overlay a density curve. Today's article extends the previous articles by showing how to overlay multiple density curves on a histogram, as shown in the graph to the right.

An overview of the technique

To overlay multiple custom curves on a histogram, use the following steps:

Use the %EmulateHistogram macro to create a data set that contains the bins for the histogram. The heights of the bins can be on the count scale or on the percentage scale.
Create a data set that contains the density curves you wish to overlay. These are usually created by using the PDF function in SAS. However, you must rescale the PDF to match the scale of the histogram.
Merge the two data sets.
Use the HIGHLOW statement in PROC SGPLOT to emulate a histogram. Use the SERIES statement to overlay the curves.

The following sections demonstrate each step.

Call the %EmulateHistogram macro

The first step is to call the %EmulateHistogram macro. The Plates data set and the %EmulateHistogram macro are defined in the previous article. The %EmulateHistogram macro writes a data set called _HistBins that contains variables (_MIDPT_, _COUNT_, _PCT_, and _ZERO_) that are used to create the high-low plot. It also creates several useful macro variables (_VARNAME, _BINSTART, _BINEND, _BINWIDTH, and _NOBS). The following statement runs the macro on the Gap variable in the Plates data set:

/* The %EmulateHistogram macro is defined at 
   https://blogs.sas.com/content/iml/2025/10/13/high-low-emulate-histogram.html */
%EmulateHistogram(dsIn=Plates, varIn=Gap)   /* creates the _HistBins data set */

Create scaled density estimates in long form

The second step is to create a data set that contains the (x,y) coordinates of the curves. You can convert a density to the count or percentage scale by multiplying the height of a density curve by a scaling factor. Let h be the width of the histogram bins and let n be the number of observations. If f(x) is the density estimate, then:

Count scale: an estimate for the count scale is n*h*f(x).
Percent scale: an estimate for the percent scale is 100%*h*f(x).

The parameter n is stored in the _NOBS macro. The parameter h is stored in the _BINWIDTH macro. Both macros were created by the call to the %EmulateHistogram macro.

As an example, the following DATA step creates density estimates for the Lognormal(1.7, 0.5) and the Gamma(4.1, 1.5) distributions. These are chosen because they are familiar distributions, and we can check the results by using PROC UNIVARIATE, which can overlay these curves automatically. The value of this technique, however, is for distributions that are NOT supported by any SAS procedure.

The technique is easiest if you store the PDFs in the "long form", which means that there is a group variable (Distrib) that contains a unique name for each curve, and two variables (_X and PDF) for the (X,Y) values of the curves. It evaluates the PDFs on the range of the histogram bins, which is [b1-h/2, bn+h/2], where b1 is the center of the first bin, bn is the center of the last bin, and h is the bin width. These values are stored in the _BINSTART, _BINEND, and _BINWIDTH macros, respectively. It then scales the PDFs onto the count and percentage scales by using the formulas in the previous paragraph.

/* Create a data set that evaluates the PDF for density estimates. 
   Scale the curves to PCT and COUNT scales. 
   This data set is in the LONG form, so use a GROUP= option to graph the curves.
   _X         : name of the variable on the X axis at which to evaluate the PDF
   Pred_Pct   : 100*h*PDF(x), which scales the PDF onto the percentage scale
   Pred_Count : n*h*PDF(x), which scales the PDF onto the count scale
   Distrib    : Group variable with unique values for each curve
 
   See https://blogs.sas.com/content/iml/2024/06/19/scale-density-curve-histogram.html
*/
/* For this example, evaluate the Logn(1.7, 0.5) and the Gamma(4.1, 1.5) distributions on the range of the histogram bins. */
data _Curves;
length Distrib $10;
keep Distrib _x Pred_Pct Pred_Count;
label Distrib="Distribution"
      _x="&_varName"
      Pred_Pct = "Predicted Percentage"
      Pred_Count="Predicted Count";
array Distribs[2] $10  _TEMPORARY_ ('Lognormal' 'Gamma');  /* unique names for curves */
array Params[2,2]      _TEMPORARY_ (1.7 0.5                /* lognormal estimates (zeta, sigma) */
                                    4.1 1.5 );             /* gamma estimates (alpha, sigma) */
nPts = 50;     /* number of evaluation points for curves */
do i = 1 to dim(Distribs);
   Distrib = Distribs[i];
   p1 = Params[i,1];
   p2 = Params[i,2];
   do _x = &_binStart - &_binWidth/2 to 
           &_binEnd + &_binWidth/2 by 
           (&_binEnd - &_binStart + &_binWidth)/(nPts-1);
      PDF = PDF(Distrib, _x, p1, p2);           /* density scale */
      Pred_Pct   = &_binWidth * 100 * PDF;      /* percent scale */
      Pred_Count = &_NOBS * &_binWidth * PDF;   /* count scale */
      output;
   end;
end;
run;

Merge the data sets

The third step is to merge the _HistBin and _Curves data sets:

/* merge the histogram and curve data sets */
data Overlay;
   set _HistBins _Curves;
run;

Overlay the histogram and density curves

The fourth and last step is to use PROC SGPLOT to overlay the histogram and the model curves. You can use the HIGHLOW statement to display the histogram. You can use the SERIES statement to display the curves. The two statements should use the same scale:

For the percentage scale: use HIGH=_PCT_ on the HIGHLOW statement and use Y=PRED_PCT on the SERIES statement.
For the count scale: use HIGH=_COUNT_ on the HIGHLOW statement and use Y=PRED_COUNT on the SERIES statement.

The following call to PROC SGPLOT displays the graph by using the percent scale:

title "Histogram and Density Estimates";
proc sgplot data=Overlay;
   highlow x=_midpt_ low=_zero_ high=_pct_ / type=bar barwidth=1;
   series x=_x y=Pred_Pct / group=Distrib nomissinggroup;
   yaxis min=0 offsetmin=0 grid;
   xaxis values=(&_binStart to &_binEnd by &_binWidth) valueshint;
run;

Summary

SAS provides several built-in procedures to overlay a histogram and density estimates for common probability models. If you want to overlay a density estimate from a custom distribution, you can either use a GTL template or use the technique shown in this article, which uses the HIGHLOW statement to display a histogram and the SERIES statement to overlay curves. This process is simplified if you use the %EmulateHistogram macro to create a data set that can be merged with the data for the curves. Remember to scale the heights of the curves to match the scale of the histogram.

The post Overlay multiple custom density curves on a histogram in SAS appeared first on The DO Loop.

Use a high-low plot to emulate a histogram in SAS

Rick Wicklin — Mon, 13 Oct 2025 09:27:34 +0000

SAS has several procedures that can fit a probability distribution to data, plot a histogram, and overlay one or more density estimates:

PROC UNIVARIATE in Base SAS enables you to overlay parametric density curves from about 20 common continuous probability distributions, such as normal, lognormal, and gamma. It also enables you to overlay a nonparametric kernel density estimate.
PROC SEVERITY in SAS/ETS software enables you to overlay density curves for severity models. It also enables you to define your own probability distribution and overlay a fit from that distribution on the histogram.

In addition to these built-in models, SAS supports ways for you to define your own probability distribution and fit the parameters, typically by using maximum likelihood estimation. The visualization of these models requires overlaying a curve (the model density) on a histogram (the empirical distribution), which is not as easy as I wish it were. I like to use PROC SGPLOT, which is designed to overlay "compatible plot types." However, the only plot statement compatible with the HISTOGRAM statement is the DENSITY statement, which supports only normal curves and kernel density estimates. Specifically, you cannot combine the HISTOGRAM statement and a SERIES statement because those plot types are not compatible.

There are two ways to work around this issue:

You can write a GTL template that overlays the histogram and a custom density estimate.
You can use a high-low plot to emulate a histogram. The HIGLOW statement and the SERIES statement are compatible statements, so it's easy to overlay one or more curves.

Recently, I needed to overlay an arbitrary number of curves. I wanted the same code to work whether I had one, two, three, or more curves to overlay. For this application, it is easier to use the "high-low emulation" technique. To simplify the process, this article provides a SAS macro that creates a data set and macro variables that you can use to emulate a histogram by using the HIGHLOW statement. A subsequent article shows how to overlay multiple density estimates.

Example data

First, let's create some example data. The documentation for PROC UNIVARIATE uses a data set of measurements (in mm) of the gaps between 50 welded plates. The following SAS DATA step creates the data:

data Plates;
   label Gap = 'Plate Gap (mm)';
   input Gap @@;
   datalines;
7.46 3.57 3.76 3.27 4.85 17.41 2.41 7.77 7.68 4.09 
2.52 5.12 5.34 16.56 7.42 3.78 7.14 11.21 5.97 2.31 
5.41 8.05 6.82 4.18 5.06 5.01 2.47 9.22 8.8 3.44 
5.19 13.02 2.75 6.01 3.88 4.5 8.45 3.19 4.86 5.29 
15.47 6.9 6.76 3.14 7.36 6.43 4.83 3.52 6.36 10.8 
;

You can use PROC UNIVARIATE to create a histogram of the data, which is shown above. The next section creates a similar histogram by using the HIGHLOW statement in PROC SGPLOT.

Create data for a high-low plot

The high-low plot requires a data set that contains three variables:

The X location for the middle of each bar.
The Y location of the top of each bar.
The Y location of the bottom of each bar.

You can use the OUTHIST= option on the HISTOGRAM statement in PROC UNIVARIATE to obtain a data set that contains the middle of each bar and the height of each bar. You can manually add a variable that has the constant value 0, which represents the bottom of the bars.

To simplify the process, I wrapped the relevant steps in a macro, which you can call whenever you need to use the high-low plot to emulate a histogram. The syntax for the macro call is documented below. The macro also creates several macro variables that contain useful information such as the width of the bins and the location of the first and last bins.

/* Syntax:
     %EmulateHistogram(dsIn=DATASET, varIn=VARIABLE)
   where 
     DATASET = name of a SAS data set 
     VARIABLE= name of variable in data set whose distribution you want to model
 
   The macro does the following:
      1. Writes a data set called _HistBins that contains variables 
         _MIDPT_ : Centers of histogram bins
         _COUNT_ : Frequency count in each bin
         _PCT_   : Percentage of observations in each bin
         _ZERO_  : The constant value 0, which is the lower boundary of the high-low plot
      2. Creates the following macro variables:
         &_VARNAME  : the name of the variable whose distribution is modeled
         &_BINSTART : the value of the center of the first bin
         &_BINEND   : the value of the center of the last bin
         &_BINWIDTH : the width of the bins
         &_NOBS     : the number of nonmissing observations in the data
   You can emulate a histogram by using the HIGHLOW stmt in PROC SGPLOT:
   proc sgplot data=_HistBins;
      highlow x=_midpt_ low=_zero_ high=_obspct_ / type=bar barwidth=1;
      yaxis min=0 offsetmin=0 grid;
      xaxis values=(&_binStart to &_binEnd by &_binWidth) valueshint;
   run;
*/
%macro EmulateHistogram(dsIn=, varIn=);
%global _varName _binStart _binEnd _binWidth _NObs;
proc univariate data=&dsIn noprint;
   var &varIn;
   histogram &varIn / outhist=_HistBins(rename=(_OBSPCT_=_PCT_)) noplot;
   output out=_HistOut n=_NOBS_;        /* number of nonmissing observations */
run;
data _HistBins;
   set _HistBins;
   _ZERO_ = 0;        /* add baseline for histogram */
   label _MIDPT_=&varIn   _PCT_="Percent"  _COUNT_="Count";
run;
/* create some useful macro variables */
data _null_;
   set _HistBins end=EOF;
   if _N_=1 then 
      call symputx("_binStart", _MIDPT_);
   h = dif(_MIDPT_);
   if EOF then do;
      call symputx("_binEnd", _MIDPT_);
      call symputx("_binWidth", h);
      call symputx("_varName", "&varIn");
   end;
run;
data _null_;
   set _HistOut;
   call symputx("_NOBS", _NOBS_);
run;
%mend;

Let's call the macro on the Gap variable in the Plates data set. Running the macro creates a data set named _HistBins and several macro variables.

%EmulateHistogram(dsIn=Plates, varIn=Gap)   /* creates the _HistBins data set */
 
proc print data=_HistBins;
run;

The output data set is displayed. This histogram has only six bins. The first bin is centered at Gap=3. The last bin is centered at Gap=18. The width of the bins is 3 mm, which is the difference between adjacent midpoints. You can see that the _MIDPTS_ variable provides the center of the bins. The height of the bins is provided by the _COUNT_ variable (the bin counts) or the _PCT_ variable (the bin percentages).

The macro defines several macro variables. You can use the %PUT statement to display their values in the SAS log. These statistics can be useful for customizing the high-low plot.

%PUT The following macro variables have been defined:;
%PUT &=_varName &=_binStart &=_binEnd &=_binWidth &=_NObs;

The following macro variables have been defined:
_VARNAME=Gap _BINSTART=3 _BINEND=18 _BINWIDTH=3 _NOBS=50

Emulate a histogram by using the high-low plot

When you use the HIGHLOW statement to emulate a histogram, use the TYPE=BAR option to display the high-low plot as bars. You can use the BARWIDTH=1 option to eliminate gaps between adjacent bars. In essence, the HIGHLOW statement emulates a "HISTOGRAMPARM" statement, which is not a supported statement in PROC SGPLOT.

title "High-Low Plot: Emulation of a Histogram";
proc sgplot data=_HistBins;
   highlow x=_midpt_ low=_zero_ high=_pct_ / type=bar barwidth=1;  /* or use _COUNT_ to use the count scale */
   yaxis min=0 offsetmin=0 grid;
   xaxis values=(&_binStart to &_binEnd by &_binWidth) valueshint;
run;

The high-low plot looks similar to the histogram that was created by PROC UNIVARIATE. I used three macro variables to place tick marks at the center of the bins. If you do not manually specify the tick values, the X axis will contain ticks in locations that might not correspond to the centers of the bars.

Summary

PROC UNIVARIATE (and PROC SEVERITY) enable SAS users to overlay about 20 common density curves on a histogram of data. To overlay a custom density curve requires some manual effort. For one curve, you can use a GTL method to overlay the histogram and the curve. For overlaying multiple curves, you can emulate a histogram by using a high-low plot. The %EmulateHistogram macro in this article enables you to quickly create a data set and macro variables that you can use on the HIGHLOW statement in PROC SGPLOT. The high-low plot is compatible with many other plot types, including the SERIES statement, so this is the first step towards overlaying custom density curves on a "histogram." In a subsequent article, I show how to create and overlay custom density curves.

The post Use a high-low plot to emulate a histogram in SAS appeared first on The DO Loop.