<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The DO Loop</title>
	<atom:link href="https://blogs.sas.com/content/iml/feed" rel="self" type="application/rss+xml" />
	<link>https://blogs.sas.com/content/iml/</link>
	<description>Statistical programming in SAS with an emphasis on SAS/IML programs</description>
	<lastBuildDate>Fri, 15 May 2026 14:22:09 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.6</generator>
	<item>
		<title>Compute the orthant probability for the 4-D multivariate normal distribution</title>
		<link>https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 18 May 2026 09:21:20 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58755</guid>

					<description><![CDATA[<p>This is the last article in a series about computing probabilities for the multivariate normal distribution. A previous article shows how to compute the probability of a standardized multivariate normal (MVN) distribution in an orthant for dimensions k=1, 2, and 3 by using the Childs-Sun formulas. The formulas are exact [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html">Compute the orthant probability for the 4-D multivariate normal distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
This is the last article in a series about computing probabilities for the multivariate normal distribution.
A previous article shows <a href="https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html">how to compute the probability of a standardized multivariate normal (MVN) distribution in an orthant for dimensions k=1, 2, and 3 by using the Childs-Sun formulas.</a> 
The formulas are exact and are expressed in terms of the arcsine of the correlation coefficients of a general correlation matrix. 
For example, if X~MVN(0, R) is a standard trivariate normal random variable, the probability that X is in the first orthant 
{X | X1 &gt; 0, X2 &gt; 0, X3 &gt; 0}
is given by the formula:
<br />
<span class='MathJax_Preview'>\(P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)\)</span><script type='math/tex'>P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)</script>
<br />
where R<sub>12</sub>, R<sub>13</sub>, and R<sub>23</sub> are the off-diagonal elements of a general 3x3 correlation matrix. 
</p>
<p>
You might hope that higher-dimensional Childs-Sun formulas are similarly simple.
Alas, they are not!
For the higher-dimensional formulas, the probabilities are defined recursively.
As the dimension increases, the number of terms explodes combinatorially.
This article shows a SAS IML module that implements the Childs-Sun formula for the orthant probability of a 4-D standardized normal distribution.
After implementing the 4-D case, I have no desire to attempt higher dimensions!
</p>

<h3>The Childs-Sun formula for dimension 4</h3>
<p>
Appendix 1 shows a snippet from p. 12 of Genz and Bretz (2009), which describes the Childs-Sun formula for an MVN orthant probability. 
The formula relies on the Childs-Sun reduction method (Childs, 1967; Sun, 1988). 
Similar to 
the lower-dimensional formulas, the 4-D formula contains a sum of arcsines of the correlation coefficients.
However, the formula adds additional terms that involve an integral, I<sub>4</sub>. 
The integrands are complicated expressions of the off-diagonal elements of a 4x4 correlation matrix. 
</p><p>
In 4-D, there are 16 orthants. As discussed in a previous article, if you can find the probability in the "first orthant" (the one where all components have positive signs), you can use a clever "reflection trick" to find the probability in the other 15 octants.
</p><p>
In Genz and Bretz (2009), the Childs-Sun formula for the orthant probability, which they call P<sub>4</sub>, has an unfortunate typo.
The correct formula is shown in Appendix 1, where the red "delete" symbol removes an erroneous "plus sign."
It took me many hours to decipher the notation. By consulting other references (Gehrlein and Saniga (1977); Sun and Asano (1990)), 
and with some assistance from the Google Gemini AI, I was able to decipher the formulas and create a 
SAS IML function that evaluates the orthant probability. The IML function is presented in this article. 
</p>

<h3>Conditional probability and the geometry of the formula</h3>
<p>
I usually explain the mathematical details of my SAS programs, but I'm not going to do that today. 
The formula is too complicated to describe in a short blog post.
Some details are in <a href="https://doi.org/10.1093/biomet/54.1-2.293">the original Childs (1967) paper</a>, 
which is unfortunately behind a paywall. 
</p><p>
The general formula (see Appendix 1) is very complicated. But if you restrict to k=4 dimensions (hence, n=2), the formula 
looks something like this:
<br />
<span class='MathJax_Preview'>\(P_4 = \frac{1}{16} + \frac{1}{8\pi} \sum_{i < j}^4 \sin^{-1}(R_{ij}) + \frac{1}{4\pi^2} I_4(R)\)</span><script type='math/tex'>P_4 = \frac{1}{16} + \frac{1}{8\pi} \sum_{i < j}^4 \sin^{-1}(R_{ij}) + \frac{1}{4\pi^2} I_4(R)</script>
<br />
where I<sub>4</sub> is a complicated sum of one-variable integrals. Each integral 
depends on a lower-dimensional integration, called I<sub>2</sub>, which is why this is a recursive formula.
The notation is explained further in Appendix 2.
</p><p>

The one-variable integrals are the key.
In general, to evaluate the probability of a 4-D distribution, you must solve a 4-D volume integral. 
The Childs-Sun method uses the geometry of the multivariate normal distribution to reduce this volume to a sum of
iterated integrals that accumulate lower-dimensional slices. 
</p>
<p>
This formula is an application of <em>conditional probability</em>. 
The dummy integration variable, <em>x</em>, is a continuous variable on [0,1]. 
For a fixed value of <em>x</em>, the term R<sup>1i</sup>(x) evaluates to a conditional correlation matrix. 
The variables inside the integrand (e.g., the quadratic denominator w = 1 - R<sub>1i</sub><sup>2</sup> x<sup>2</sup>) 
are <a href="https://online.stat.psu.edu/stat505/Lesson06">conditional variances and partial correlations</a>.
</p>
<p>
The I<sub>2</sub> function evaluates a 2-D orthant probability (which has an exact arcsine solution) for the remaining two variables, given that the other variables are fixed at a slice determined by <em>x</em>. You can use the QUAD subroutine in SAS IML to integrate over <em>x</em> in [0,1], 
which aggregates these 2-D conditional probabilities to sweep out the exact probability mass of the 4-D volume.
</p>

<h3>Implementing the P4 formula in SAS IML</h3>
<p>
The main SAS IML function is called P4_childs, which computes the 4-D orthant probability. 
It calls the QUAD subroutine to compute I<sub>4</sub>. 
The integrand is ProbInt4, which builds terms from the off-diagonal 
elements of the 4x4 correlation matrix. The DO loop evaluates three different conditional correlation matrices. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* The base case: P2. For a scalar correlation, rho, returning the orthant probability, P2.
   See https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html */</span>
start P2<span style="color: #66cc66;">&#40;</span>rho<span style="color: #66cc66;">&#41;</span>;
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> <span style="color: #2e8b57; font-weight: bold;">0.25</span> + <span style="color: #0000ff;">arsin</span><span style="color: #66cc66;">&#40;</span>rho<span style="color: #66cc66;">&#41;</span> / <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #006400; font-style: italic;">*constant(&quot;pi&quot;)) );</span>
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* The sum of the I4 integrands. Instead of evaluating the three integrals separately,
   evaluate the sum of the integrals and perform one numerical integration.   
   Evaluates the Sun (1988) residual partial correlation matrix.
   Whereas Genz/Bretz Eqn 2.8-2.9 uses a general covariance matrix, this function assumes 
   R is a correlation matrix, so R[i,i]=1  */</span>
start ProbInt4<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>R_global<span style="color: #66cc66;">&#41;</span>;
  R = R_global;
  pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;pi&quot;</span><span style="color: #66cc66;">&#41;</span>;
  x2 = <span style="color: #0000ff;">x</span> # <span style="color: #0000ff;">x</span>;
  <span style="color: #0000ff;">sum</span> = <span style="color: #2e8b57; font-weight: bold;">0</span>;
&nbsp;
  <span style="color: #006400; font-style: italic;">/* The sum of three terms: i=2, 3, and 4 */</span>
  <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">2</span> to <span style="color: #2e8b57; font-weight: bold;">4</span>;
    <span style="color: #006400; font-style: italic;">/* Determine the indices (j, k) of the two remaining variables */</span>
    <span style="color: #0000ff;">if</span> i=<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;      j=<span style="color: #2e8b57; font-weight: bold;">3</span>; k=<span style="color: #2e8b57; font-weight: bold;">4</span>; <span style="color: #0000ff;">end</span>;
    <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">if</span> i=<span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>; j=<span style="color: #2e8b57; font-weight: bold;">2</span>; k=<span style="color: #2e8b57; font-weight: bold;">4</span>; <span style="color: #0000ff;">end</span>;
    <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">do</span>;             j=<span style="color: #2e8b57; font-weight: bold;">2</span>; k=<span style="color: #2e8b57; font-weight: bold;">3</span>; <span style="color: #0000ff;">end</span>;
&nbsp;
    <span style="color: #006400; font-style: italic;">/* R_ii term: The quadratic denominator */</span>
    w = <span style="color: #2e8b57; font-weight: bold;">1</span> - R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,i<span style="color: #66cc66;">&#93;</span>##<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* x2;</span>          <span style="color: #006400; font-style: italic;">/* R[i,i]=1 for all i */</span>
    <span style="color: #0000ff;">if</span> w&lt; 1E-12 <span style="color: #0000ff;">then</span> w = 1E-12;      <span style="color: #006400; font-style: italic;">/* Numerical safeguard */</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Compute the conditional variances and covariance. These are a_11, a_12, etc, in Sun and Asano */</span>
    a_jj = <span style="color: #2e8b57; font-weight: bold;">1</span>      -R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,j<span style="color: #66cc66;">&#93;</span>##<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* x2   -((R[i,j] - R[1,i]*R[1,j]*x2)##2) / w;</span>
    a_kk = <span style="color: #2e8b57; font-weight: bold;">1</span>      -R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,k<span style="color: #66cc66;">&#93;</span>##<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* x2   -((R[i,k] - R[1,i]*R[1,k]*x2)##2) / w;</span>
    a_jk = R<span style="color: #66cc66;">&#91;</span>j,k<span style="color: #66cc66;">&#93;</span> -R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,j<span style="color: #66cc66;">&#93;</span><span style="color: #006400; font-style: italic;">*R[1,k]*x2 -(R[i,j] - R[1,i]*R[1,j]*x2)*(R[i,k] - R[1,i]*R[1,k]*x2) / w;</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* rho is the off-diagonal element of the conditional correlation matrix */</span>
    rho = a_jk / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>a_jj <span style="color: #006400; font-style: italic;">* a_kk);</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* ensure rho is in [-1,1]. See
       https://blogs.sas.com/content/iml/2026/02/04/clip-values.html */</span>
    rho = <span style="color: #66cc66;">&#40;</span> -<span style="color: #2e8b57; font-weight: bold;">1</span> &lt;&gt; <span style="color: #66cc66;">&#40;</span>rho &gt;&lt; <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* clamp to [-1,1] */</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Call P2 function to get the I2 term */</span>
    I2_val = <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* pi * ( P2(rho) - 0.25 );</span>     <span style="color: #006400; font-style: italic;">/* this actually simplifies to ARSIN(rho) :-) */</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Add the i-th term to the sum */</span>
    <span style="color: #0000ff;">sum</span> = <span style="color: #0000ff;">sum</span> + <span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,i<span style="color: #66cc66;">&#93;</span> / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>w<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #006400; font-style: italic;">* I2_val;</span>
  <span style="color: #0000ff;">end</span>;
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* The main P4 function, which evaluates the Childs-Sun formula for 4 dimensions */</span>
start P4_Childs<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>R_global<span style="color: #66cc66;">&#41;</span>;
  R_global = R; <span style="color: #006400; font-style: italic;">/* Set the global correlation matrix for QUAD */</span>
&nbsp;
  <span style="color: #006400; font-style: italic;">/* Integrate ProbInt4 over the interval [0, 1] */</span>
  <span style="color: #0000ff;">call</span> quad<span style="color: #66cc66;">&#40;</span>I4, <span style="color: #a020f0;">&quot;ProbInt4&quot;</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
  <span style="color: #006400; font-style: italic;">/* Apply the Childs (1967) formula */</span>
  pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;pi&quot;</span><span style="color: #66cc66;">&#41;</span>;
  sum_arsin = <span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">arsin</span><span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">7</span> <span style="color: #2e8b57; font-weight: bold;">8</span> <span style="color: #2e8b57; font-weight: bold;">12</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* sum of arcsine for all 6 off-diagonal elements */</span>
  prob = <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #2e8b57; font-weight: bold;">16</span> + sum_arsin / <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">8</span><span style="color: #006400; font-style: italic;">*pi) + I4 / (4 * pi**2);</span>  <span style="color: #006400; font-style: italic;">/* fixes the typo in Genz and Bretz, Eqn 2.8 */</span>
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span>prob<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* --- Simple Test --- */</span>
R = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1.0</span>  <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">0.3</span>  <span style="color: #2e8b57; font-weight: bold;">0.2</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1.0</span>  <span style="color: #2e8b57; font-weight: bold;">0.4</span>  <span style="color: #2e8b57; font-weight: bold;">0.3</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.3</span>  <span style="color: #2e8b57; font-weight: bold;">0.4</span>  <span style="color: #2e8b57; font-weight: bold;">1.0</span>  <span style="color: #2e8b57; font-weight: bold;">0.5</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.2</span>  <span style="color: #2e8b57; font-weight: bold;">0.3</span>  <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1.0</span><span style="color: #66cc66;">&#125;</span>;
&nbsp;
ans = p4_Childs<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
print ans;</pre></td></tr></table></div>




<img decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb5.png" alt="" width="83" height="68" class="alignnone size-full wp-image-58803" />


<p>
When you run this code, the exact Childs-Sun formula produces an orthant probability of <code class="preserve-code-formatting">0.1611218</code> for the given matrix.
You can verify that value by using a Monte Carlo simulation. The following Monte Carlo simulation 
uses one million random points from the MVN(0, R) distribution. In addition to the 
Monte Carlo estimate, I generate a 95% confidence interval for the probability.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Use Monte Carlo simulation to estimate the probability that a 
   MVN random variable is less than a specified value in each coordinate.
   Sigma is a kxk covariance matrix; mu is an option row vector with k elements.
   The row vector b specifies the upper limits of integration. 
   The function returns an estimate of 
   P(X1&lt;b[1] &amp; X2&lt;b[2] &amp; ... &amp; Xk&lt;b[k] | X~MVN(mu, Sigma))
   by simulating N random variates from MVN(mu, Sigma) and returning the proportion
   that are in the specified region.
*/</span>
start MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = randnormal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, mu, Sigma<span style="color: #66cc66;">&#41;</span>;
   inRegion = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
      v = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> &lt; b<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;
      inRegion = inRegion &amp; v;
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>inRegion<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #0000ff;">N</span> = 1E6;
prob_MC = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#125;</span>, R<span style="color: #66cc66;">&#41;</span>;
SE_MC = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span> prob_MC <span style="color: #006400; font-style: italic;">* (1-prob_MC)/N );</span>
Lower95 = prob_MC - <span style="color: #2e8b57; font-weight: bold;">1.96</span><span style="color: #006400; font-style: italic;">*SE_MC;</span>
Upper95 = prob_MC + <span style="color: #2e8b57; font-weight: bold;">1.96</span><span style="color: #006400; font-style: italic;">*SE_MC;</span>
print prob_MC Lower95 Upper95;</pre></td></tr></table></div>





<img decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb6.png" alt="" width="220" height="64" class="alignnone size-full wp-image-58800" />


<p>The output from the Monte Carlo simulation shows that the Monte Carlo estimate is close to the value returned by the Childs-Sun formula. In addition, the 95% confidence interval includes the exact value, which gives me "confidence" (pun intended!) that I implemented the Childs-Sun formula correctly.
You can repeat the computations for other 4x4 correlation matrices and obtain similar results.
</p>

<h3>Summary</h3>
<p>
As my dear mother used to say when faced with a challenging situation, "this is not for the faint of heart." 
The Childs-Sun formula for the orthant probability of a multivariate normal distribution is a complicated equation.
In 2-D and 3-D, it simplifies into a sum of arcsines. In higher dimensions, you must compute conditional correlation matrices 
and integrate them to evaluate the formula. I have shown a SAS IML function that implements the formula in 4-D. 
</p>
<p>
Despite the complexity, this is an amazing mathematical and computational result!
The Childs-Sun formula is an exact analytical solution to a 4-dimensional multivariate normal volume integral over an orthant.
And, as shown previously, you can use the P4_Childs function
to compute the MVN probability for any 4x4 correlation matrix in any orthant.
</p>


<h3>Appendix 1: The general Childs-Sun formula for orthant probabilities</h3>
<p>
The following image is from p. 12 of Genz and Bretz (2009), <em>Computation of Multivariate
Normal and t Probabilities</em>. The last term of Eqn 2.8 has a typo in Genz and Bretz. The equation contains a plus sign that should not be there. This typo caused me hours of confusion! I finally cross-referenced the equation with Sun and Asano (1990) and the original article by Childs (1967), which both display the correct equation.
</p><p>
Note the somewhat confusing notation in the Childs-Sun formula. 
The superscript on the argument to the I<sub>2j</sub> function describes the submatrix to use in computing the conditional probabilities.
This is not explained further in G&amp;B.
For the details of the implementation, see the SAS IML function in this article.

The superscript on the argument to the I<sub>4</sub> function describes the submatrix to use in computing the conditional probabilities.
This is not explained further in G&amp;B.
For the details of the implementation, see the SAS IML function in this article.
</p>

<img fetchpriority="high" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb3.png" alt="" width="548" height="776" class="alignnone size-full wp-image-58794" srcset="https://blogs.sas.com/content/iml/files/2026/05/orthantProb3.png 548w, https://blogs.sas.com/content/iml/files/2026/05/orthantProb3-212x300.png 212w" sizes="(max-width: 548px) 100vw, 548px" />


<h3>Appendix 2: The sum notation</h3>
<p>
The Genz and Bretz notation for the sum of integrals is very dense. Here is what it means in the 4-D case. For this case, n=2, which means that the outer sum reduces to j=2. The inner sum looks like 
<br />
<span class='MathJax_Preview'>\(\sum_{i_1 < \ldots < i_4}^4 I_4(R^{i_1,\ldots,i_4})\)</span><script type='math/tex'>\sum_{i_1 < \ldots < i_4}^4 I_4(R^{i_1,\ldots,i_4})</script>.
</p>
<p>
The notation is a combination sum. It means, "Sum over all possible combinations of the distinct indices chosen from the set {1, 2, 3, 4}
subject to the constraint that the combinations are in increasing order." Well, for four indices, there in only one combination that satisfies this constraint, and it is the tuple (1, 2, 3, 4). So the inner summation also vanishes and becomes I<sub>4</sub>(R). You can drop the superscript because you are sending in all four columns of the 4x4 matrix, R.
</p>
<p>
The I<sub>4</sub> term is defined recursively in terms of I<sub>2</sub>. Now the sum matters. When j=1, the summation is 
<br />
<span class='MathJax_Preview'>\(\sum_{i_1 < i_2}^4 I_2(R^{i_1,i_2})\)</span><script type='math/tex'>\sum_{i_1 < i_2}^4 I_2(R^{i_1,i_2})</script>.
There are six ways to choose two distinct ordered indices from the set {1,2,3,4}. Since  I<sub>2</sub>(R<sup>i1, i2</sup>) = arcsine(R[i1, i2]),
we get the sum-of-arcsine terms 
<span class='MathJax_Preview'>\(\sum_{i < j} \sin^{-1}( R_{ij} )\)</span><script type='math/tex'>\sum_{i < j} \sin^{-1}( R_{ij} )</script>.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html">Compute the orthant probability for the 4-D multivariate normal distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2023/11/probbnrm8-150x150.png" />
	</item>
		<item>
		<title>Compute multivariate normal orthant probabilities</title>
		<link>https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 11 May 2026 09:24:15 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58590</guid>

					<description><![CDATA[<p>In general, it is very difficult to compute a probability for a multivariate continuous distribution. For all continuous distributions, the probability requires solving a complicated multiple integral. For example, the probability for a bivariate normal distribution requires integrating the bivariate normal density over a two-dimensional (2-D) area. The probability for [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html">Compute multivariate normal orthant probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
In general, it is very difficult to compute a probability for a multivariate continuous distribution.
For all continuous distributions, the probability requires solving a complicated multiple integral.
For example, the probability for a bivariate normal distribution requires integrating the bivariate normal density 
over a two-dimensional (2-D) area.
The probability for a k-dimensional distribution requires integrating the density over a k-D volume.
</p><p>
The probability depends on the density (PDF) of the distribution as well as the region of integration.
One special region is the <em>orthant</em>. An orthant is a generalization of a quadrant.
In two dimensions, the Cartesian coordinate system divides the plane into four <em>quadrants</em>. 
In three dimensions, the Cartesian coordinate system divides the plane into eight <em>octants</em>. 
In k dimensions, there are 2<sup>k</sup> <em>orthants</em>. 
</p><p>
It is rare to find an exact formula for the probability of a multivariate distribution.
Amazingly, there are formulas for the probability of a standardized multivariate normal (MVN) distribution in an orthant
(Genz and Bretz (2009) p. 11-13, who cite Sheppard (1898), D. R. Childs (1967), and H. J. Sun (1988)).
</p><p>
This article shows the formulas for dimensions 1, 2, and 3. In a subsequent article, I will discuss
a SAS IML function that implements a formula for dimension k=4. 
The general case was solved by Childs (1967) and extended by Sun (1988), so 
I refer to these formulas as the Childs-Sun formulas.
</p>
<p>As shown in a previous article, <a href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">it suffices to compute a probability for the standardized MVN distribution</a>
because a probability for a general MVN distribution can always be found by standardizing the problem.
To save typing, this article deals exclusively with the standardized MVN distribution and often omits the word "standardized".
</p>  

<h3>Orthant probabilities</h3>
<p>
For a vector X, <a href="https://blogs.sas.com/content/iml/2025/06/16/find-orthant.html">the signs of X are constant within each orthant</a>.
For example, a 2-D vector has signs (+1, +1) when it is in the first quadrant, and signs (-1, -1) when it is in the third quadrant. 
The term "orthant probability" traditionally refers to the first orthant, where all signs are positive.
</p><p>
The MVN distribution is symmetric under the transformation X &rarr; -X, so the probability that X~MVN(0,R) is in the "all-signs-positive" octant 
equals the probability that X is in the "all-signs-negative" octant. 
The "all-signs-negative" octant is important in probability and statistics. 
It represents the "CDF region," which is a left-tail probability. For example, for a 2-D random vector X=(X1,X2), the CDF 
evaluated at (0,0) is &Phi;<sub>2</sub>(0,0) = P(X1&lt;0 and X2&lt;0), which is the probability that X is in the third quadrant.  In terms of an integral,
<br />
<span class='MathJax_Preview'>\(
\Phi_2(0,0; \rho)
=
\int_{-\infty}^{0} \int_{-\infty}^{0} \phi_2(x_1, x_2; \rho)\, dx_2 \, dx_1
\)</span><script type='math/tex'>
\Phi_2(0,0; \rho)
=
\int_{-\infty}^{0} \int_{-\infty}^{0} \phi_2(x_1, x_2; \rho)\, dx_2 \, dx_1
</script>
<br />
Here, &rho; is the Pearson correlation between X1 and X2, &phi;<sub>2</sub> is the bivariate standard normal density,
and &Phi;<sub>2</sub> is the bivariate standard normal CDF.
</p><p>
This concept generalizes to higher-dimensional distributions: the "CDF orthant" is the "all-signs-negative" orthant for which every coordinate of X has a negative sign.
The CDF of the standardized normal distribution at the origin is the integral of the distribution's density over that orthant.
This article shows some Childs-Sun formulas for these probabilities for a general correlation matrix.
</p>

<h3>Orthant probabilities in dimensions 1 and 2</h3>
<p>
For the one-dimensional problem, the 1-D orthant is the set {X &lt; 0}.
The CDF at zero is P<sub>1</sub> = &Phi;<sub>1</sub>(0) = 0.5. You can obtain this value by solving an integration problem, but it is also evident from 
the fact that the normal distribution is symmetric about 0.
For this article, the important fact is that there is a formula for the 1-D orthant probability: P<sub>1</sub> = 1/2.
</p>
<p>
For 2-D, the correlation matrix has one parameter, &rho; = R[1,2], which is the correlation between the bivariate normal variates X1 and X2.
The 2-D (bivariate) formula (originally found by Sheppard, 1898) is 
<br />
<span class='MathJax_Preview'>\(P_2 = \Phi_2(0,0; \rho) = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)\)</span><script type='math/tex'>P_2 = \Phi_2(0,0; \rho) = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)</script>
</p>
<p>
Let's run a simple DATA step to verify this result. In SAS, you can use the PROBBNRM function to compute a bivariate normal probability:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* 2-D orthant probability for various rho values */</span>
<span style="color: #000080; font-weight: bold;">data</span> MVN2D;
<span style="color: #0000ff;">input</span> rho @@;
pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'pi'</span><span style="color: #66cc66;">&#41;</span>;
P2_formula = <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #2e8b57; font-weight: bold;">4</span> + <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #006400; font-style: italic;">*pi) * arsin(rho);</span>
P2_integral = probbnrm<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">0</span>, rho<span style="color: #66cc66;">&#41;</span>;
diff = P2_formula - P2_integral;
<span style="color: #0000ff;">drop</span> pi;
datalines;
-<span style="color: #2e8b57; font-weight: bold;">0.8</span> -<span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0.8</span> 
;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc print</span> <span style="color: #000080; font-weight: bold;">data</span>=MVN2D noobs; 
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb1.png" alt="" width="241" height="171" class="alignnone size-full wp-image-58728" />

<p>
As shown in the output, the formula agrees with the PROBBNRM function to within machine precision.
</p>


<h3>Orthant probabilities in dimension 3</h3>
<p>
In 3-D the standard multivariate normal distribution has three parameters: the off-diagonal elements of the 3x3 correlation matrix, R.
The formula for the orthant probability in dimension 3 is given by 
<br />
<span class='MathJax_Preview'>\(P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)
\)</span><script type='math/tex'>P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)
</script>
</p><p>
Notice that you can write P3 by using a summation formula in which you sum over the indices 
of the upper triangular elements of the correlation matrix, R.
</p><p>
To validate this formula, let's use <a href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">a Monte Carlo estimation routine</a>, which was developed in a previous article. The following SAS IML program 
evaluates the formula for a few correlation matrices, and compares the formula and the Monte Carlo estimate. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* 3-D orthant probability for various correlation matrices, R */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Use Monte Carlo simulation to estimate the probability that a 
   MVN random variable is less than a specified value in each coordinate.
   See https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html
   The function estimates CDF(b).
*/</span>
start MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = randnormal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, mu, Sigma<span style="color: #66cc66;">&#41;</span>;
   inRegion = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
      v = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> &lt; b<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* binary variable for the i_th coordinate */</span>
      inRegion = inRegion &amp; v;  <span style="color: #006400; font-style: italic;">/* find the intersection of all conditions */</span>
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>inRegion<span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* proportion of random points that satisfy criterion */</span>
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Evaluate the MVN CDF at the origin (=&quot;orthant probability&quot;) by using the exact P3 formula */</span>
start Orthant_P3<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
  pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;pi&quot;</span><span style="color: #66cc66;">&#41;</span>;
  upper_tri_corrs = R<span style="color: #66cc66;">&#91;</span> <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">6</span><span style="color: #66cc66;">&#125;</span> <span style="color: #66cc66;">&#93;</span>;
  sum_arsin = <span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">arsin</span><span style="color: #66cc66;">&#40;</span> upper_tri_corrs <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
  prob = <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #2e8b57; font-weight: bold;">8</span> + sum_arsin / <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #006400; font-style: italic;">* pi);</span>
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span>prob<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #0000ff;">N</span> = 1E6;     <span style="color: #006400; font-style: italic;">/* size of Monte Carlo sample */</span>
b = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#125;</span>; <span style="color: #006400; font-style: italic;">/* the origin */</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Example 1: AR(1) correlation matrix */</span>
R = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0.25</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">0.5</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.25</span> <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #66cc66;">&#125;</span>;
P3_formula = Orthant_P3<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
P3_MC_est  = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, R<span style="color: #66cc66;">&#41;</span>;
diff       = P3_formula - P3_MC_est;
print P3_formula P3_MC_est diff;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Example 2: positive and negative correlations */</span>
R = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span>   -<span style="color: #2e8b57; font-weight: bold;">0.5</span> -<span style="color: #2e8b57; font-weight: bold;">0.1</span>,
    -<span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #2e8b57; font-weight: bold;">0.8</span>,
    -<span style="color: #2e8b57; font-weight: bold;">0.1</span>  <span style="color: #2e8b57; font-weight: bold;">0.8</span>  <span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #66cc66;">&#125;</span>;
P3_formula = Orthant_P3<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
P3_MC_est  = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, R<span style="color: #66cc66;">&#41;</span>;
diff       = P3_formula - P3_MC_est;
print P3_formula P3_MC_est diff;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb2.png" alt="" width="241" height="136" class="alignnone size-full wp-image-58725" />

<p>
Results are shown for two different correlation matrices. The formula in the Orthant_P3 function is exact.
The Monte Carlo estimate is close to the exact value.
</p>

<h3>The computation for other 2-D quadrants</h3>
<p>
The formula is more powerful than it initially seems because of reflective symmetries of the multivariate normal distribution.
The Childs-Sun formula enables you to compute the probability in <em>all</em> orthants, not just the "CDF orthant" where 
every component of the random variable has a negative sign. This is because you can change the signs of the components of the 
MVN random variable to obtain the other quadrants. Geometrically, "changing the sign" means considering the transformation 
X<sub>i</sub> &rarr; -X<sub>i</sub> for some coordinate. This transformation changes the sign of the correlation 
between X<sub>i</sub> and the other variables. Thus, you can obtain the correlation matrix for any transformation that reflects 
one or more coordinates.
</p>
<p>
Let's see how this works in two dimensions. 
Let &rho; be the off-diagonal (R[1,2]) correlation. Here is how to get the probability in all four quadrants:
</p>
<ol>
<li>Quadrant I: This is the orthant probability, P2 = 1/4 + arsin(&rho;) / (2&pi;).</li>
<li>Quadrant II: To find the probability in the region Q2 = {X1 &lt; 0, X2 &gt; 0}, 
define Y1 = -X1 and Y2 = X2. Y is bivariate normal with correlation -&rho;. Because arsin(-&rho;) = -arsin(&rho;),
the probability in Q2 is 1/4 - arsin(&rho;) / (2&pi;). </li>
<li>Quadrant III: To find the probability in the region Q3 = {X1 &lt; 0, X2 &lt; 0}, 
define Y1 = -X1 and Y2 = -X2. Y is bivariate normal with correlation &rho;, so the probability over Q3 is P2.
</li>
<li>Quadrant IV: To find the probability in the region Q4 = {X1 &gt; 0, X2 &lt; 0}, 
define Y1 = X1 and Y2 = -X2. Y is bivariate normal with correlation -&rho;, therefore 
the probability in Q4 is 1/4 - arsin(&rho;) / (2&pi;). </li>
</ol>

<h3>The computation for other k-D orthants</h3>
<p>
The same idea applies for any dimension. 
The Childs-Sun formula gives the orthant probability in the first quadrant where all coordinates are positive.
Every orthant is identified by a sign vector s=(s1, s2, ..., sk), where each component is either +1 or -1.
To compute the probability that X~MVN(0, R) is in the orthant with the sign vector s, do the following:
</p>
<ol>
<li>Define a new MVN variable, Y = (s1*X1, s2*X2, ..., sk*Xk). 
</li>
<li>Compute the correlation matrix for Y. This requires changing some of the signs of the off-diagonal elements in R. See the Appendix for details.
</li>
<li>Use the Childs-Sun formula to find the probability that Y is in the first orthant. This is also the 
probability that X is in the orthant with sign vector s.
</li>
</ol>

<h3>Summary</h3>
<p>
In summary, this article shows three Childs-Sunformulas for computing the orthant probability for a 
standardized multivariate normal distribution 
with correlation matrix R in dimension k:
</p>
<ul>
<li>If k=1, the univariate probability is <span class='MathJax_Preview'>\(P_1 = \frac{1}{2}\)</span><script type='math/tex'>P_1 = \frac{1}{2}</script>.
</li>
<li>If k=2, the bivariate probability is <span class='MathJax_Preview'>\(P_2 = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(R_{12})\)</span><script type='math/tex'>P_2 = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(R_{12})</script>.
</li>
<li>If k=3, the trivariate probability is 
<span class='MathJax_Preview'>\(P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)\)</span><script type='math/tex'>P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)</script>
</li>
</ul>
<p>
You can apply these formulas as written to compute the probability in the first orthant (all signs positive) or in the "CDF orthant" (all signs negative). You can use the reflective symmetry of the MVN distribution to find the probabilities in every other orthant.
</p>


<h3>Appendix: How the correlation matrix changes under coordinate reflections</h3>
<p>
This main article indicates that you can apply reflection transformations for coordinates of X~MVN(0,R) and find a 
new correlation matrix, C, such that Childs-Sun formula applies to C gives the probability that X is in some orthant.
This appendix shows the linear algebra steps.
</p><p>
Suppose X~MVN(0,R), and you want to find the probability that X is in an orthant defined by a sign vector s=(s1, s2, ..., sk). In this vector, each s<sub>i</sub> is either +1 or -1.
We can use s to define a new random vector, Y, such that Y is in the first orthant exactly when X is in the target orthant.
</p>
<p>
Let S be the k x k diagonal matrix whose diagonal elements are the components of the sign vector, s.
Define the linear transformation Y = S X.
Because S is a diagonal matrix of signs, the i_th component of Y is simply Y<sub>i</sub> = s<sub>i</sub> * X<sub>i</sub>.
By definition, if X is in the target orthant, then the sign of X<sub>i</sub> matches s<sub>i</sub>.
Therefore, the product s<sub>i</sub> * X<sub>i</sub> is always positive, which means Y is in the first orthant.
</p>
<p>
Because Y is a linear transformation of MVN variables, it is also multivariate normal.
The expected value of Y is E[Y] = S E[X] = 0.
The covariance matrix for Y (call it C) is given by the standard formula for the variance of a linear transformation:
<br />
<span class='MathJax_Preview'>\(
C = \text{Cov}(Y) = S \text{Cov}(X) S^T = S R S
\)</span><script type='math/tex'>
C = \text{Cov}(Y) = S \text{Cov}(X) S^T = S R S
</script>

</p>
<p>
If you multiply the matrices, you will find that the elements of C are given by
C<sub>ij</sub> = s<sub>i</sub> s<sub>j</sub> R<sub>ij</sub>.
Because s<sub>i</sub><sup>2</sup> = 1, the diagonal elements of C are 1, which means C is a valid correlation matrix.
The off-diagonal elements of C simply flip the sign of the correlation whenever exactly one of the two corresponding variables is reflected.
</p>
<p>
In summary, Y~MVN(0, C). The Childs-Sun formula with the correlation matrix C 
gives the probability that Y is in the first orthant.
This is exactly the probability that X is in the orthant with the given sign vector, s.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html">Compute multivariate normal orthant probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2023/11/probbnrm8-150x150.png" />
	</item>
		<item>
		<title>On standardizing multivariate normal probabilities</title>
		<link>https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 04 May 2026 09:24:47 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58605</guid>

					<description><![CDATA[<p>A course in elementary statistics always introduces the "Z-score." A Z-score is the result of standardizing a normally distributed random variable. By subtracting the distribution's mean and dividing by its standard deviation, you transform a general normal random variable into a standardized variable that has zero mean and unit standard [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">On standardizing multivariate normal probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A course in elementary statistics always introduces the "Z-score." 
A Z-score is the result of standardizing a normally distributed random variable. By subtracting the
distribution's  mean and dividing by its standard deviation, you transform a general 
normal random variable into a standardized variable that has zero mean and unit standard deviation. 
Historically, this transformation was essential because it enabled statisticians to look up probabilities in a single printed table. 
Conceptually, it enables a uniform analysis of normally distributions by measuring all quantities in units of the 
standard deviation about the mean.  
</p>

<p>The concept of a "Z-score" also applies to multivariate normal distributions. 
In multidimensional statistics, you can standardize a multivariate normal (MVN) probability calculation by transforming the covariance matrix into the corresponding correlation matrix. This article shows how to shift and scale high-dimensional MVN distributions, both mathematically and by using the SAS IML language.
</p>

<h3>The 1-D Z-score transformation</h3>

<p>Let's start by reviewing the Z-score transformation for the 1-D normal distribution.
Let <em>X</em> be a normally distributed random variable with mean &mu; and standard deviation &sigma;.
In notation, <em>X</em> &sim; N(&mu;, &sigma;). (Some authors use N(&mu;, &sigma;<sup>2</sup>), but the SAS probability functions (CDF, PDF, QUANTILE, etc.) use the standard deviation as a parameter.)
If you want to compute a left-tail probability, such as P(<em>X</em> &lt; <em>b</em>), you can 
compute the probability on the original (data) scale, or you can apply a transformation and compute the 
probability on the standardized scale. This is useful in numerical computations because it 
reduces the problem to two steps: first, standardize the problem, then compute a probability on the standardized problem. 
Each of these steps can be developed, debugged, and tested independently. 
</p>
<p>
Let's apply the Z-score transformation. Define a new random variable <em>Z</em> = (<em>X</em> - &mu;)/&sigma;. Because a linear transformation of a normal random variable is also normal, <em>Z</em> follows a standard normal distribution: <em>Z</em> &sim; N(0, 1).
By applying the same transformation to the integration limit <em>b</em>, we define a standardized limit <em>c</em> = (<em>b</em> - &mu;)/&sigma;. Consequently, computing the probability on the original data scale is mathematically equivalent to computing the probability on the standardized scale:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
P(<em>X</em> &lt; <em>b</em> | X~N(&mu;, &sigma;)) = P(<em>Z</em> &lt; <em>c</em> | Z~N(0, 1))
</p>

<p>In SAS, you can compute these probabilities directly using the <code class="preserve-code-formatting">CDF</code> function. 
The following SAS IML program demonstrates this equivalence. 
The goal is to find the probability that a random variable from N(&mu;, &sigma;) is less than <em>x</em>. 
You can compute this directly on the original scale, or you can convert the 
limit to a Z-score and evaluate it using the standard normal distribution:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">data</span> Prob;
<span style="color: #0000ff;">input</span> mu sigma @@;
<span style="color: #0000ff;">do</span> <span style="color: #0000ff;">x</span> = -<span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #2e8b57; font-weight: bold;">1</span>;
   <span style="color: #006400; font-style: italic;">/* 1. Compute probability on original scale */</span>
   prob_x = <span style="color: #0000ff;">CDF</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, <span style="color: #0000ff;">x</span>, mu, sigma<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #006400; font-style: italic;">/* 2. Compute probability on standardized scale */</span>
   z = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span> - mu<span style="color: #66cc66;">&#41;</span>/sigma;
   prob_z = <span style="color: #0000ff;">CDF</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, z<span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* N(0,1) */</span>
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
datalines;
 <span style="color: #2e8b57; font-weight: bold;">2</span>   <span style="color: #2e8b57; font-weight: bold;">3</span>
 <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">2</span>
-<span style="color: #2e8b57; font-weight: bold;">0.5</span>   <span style="color: #2e8b57; font-weight: bold;">0.5</span>
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc print</span> noobs; <span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/stdmvn1.png" alt="" width="289" height="281" class="alignnone size-full wp-image-58653" />

<p>
The program uses several examples of <em>x</em>, &mu; and &sigma;.
For all examples, the PROB_X and the PROB_Z columns are identical. 
Geometrically, the X variable represents a data value whereas the Z variable measures the number of standard deviations from the mean.
As shown in the next section, you can generalize this concept to multivariate normal distributions.
</p>

<h3>Standardizing the multivariate normal distribution</h3>
<p>
Suppose you want to compute the joint cumulative probability for a multivariate normal distribution. 
Let <em>X</em> be a <em>k</em>-dimensional random variable such that <em>X</em> &sim; MVN(&mu;, &Sigma;),
 where &mu; = (&mu;<sub>1</sub>, &mu;<sub>2</sub>, ..., &mu;<sub>k</sub>) is the mean vector and 
&Sigma; is a <em>k</em>&nbsp;x&nbsp;<em>k</em> covariance matrix. 
The CDF or left-tail probability is 
P(<em>X</em> &lt; <em>b</em>), where <em>b</em> = (<em>b</em><sub>1</sub>, <em>b</em><sub>2</sub>, ..., <em>b</em><sub>k</sub>) is a vector of upper limits, and the inequality is evaluated component by component.
</p>

<p>
To standardize this problem, you can standardize each original variable. Instead of doing this one variable at a time,
you can use matrix-vector notation and transform all variables simultaneously.
Let <em>D</em> be the diagonal matrix of standard deviations. That is, the diagonal elements of <em>D</em> are the square roots of the diagonal elements of &Sigma;. Then you can express the covariance matrix as &Sigma; = D*R*D, where the matrix <em>R</em> has 1s on the diagonal. 
The matrix <em>R</em> is the correlation matrix for the original variables (and the covariance matrix for the new standardized variables).
</p>
<p>
The standardized random variables are defined by <em>Z</em> = D<sup>-1</sup>(<em>X</em> - &mu;).
<em>Z</em> is multivariate normal. To find its parameters, you can compute its expected value and variance. The expected value is simply a vector of zeros. The variance of <em>Z</em> is <em>R</em>. Thus, the transformed variable <em>Z</em> follows a standardized multivariate normal distribution with mean zero and correlation matrix <em>R</em>. In notation, <em>Z</em> ~ MVN(0, <em>R</em>). 
</p>

<p>To complete the probability transformation, apply the same linear transformation to the integration limits, <em>b</em>. 
The standardized limits are <em>c</em> = D<sup>-1</sup>(<em>b</em> - &mu;). 
A probability for <em>X</em> on the data scale (covariance scale) is equal to the probability for <em>Z</em> on the correlation scale:
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
P(X &lt; b | X ~ MVN(&mu;, &Sigma;)) = P(Z &lt; c | Z ~ MVN(0, R))
</p>
<p>
This is why, for example, <a href="https://blogs.sas.com/content/iml/2023/12/04/bivariate-normal-probability-sas.html">the PROBBNRM function in SAS</a> computes probabilities for the standardized variable Z~BVN(0, R), where BVN is the bivariate normal distribution, and R is a 2x2 correlation matrix.
The developer who implemented the function assumed that users can transform any problem to the standardized scale.
</p>

<h3>Converting from covariance to correlation in SAS IML</h3>
<p>
The SAS IML language supports <a href="https://blogs.sas.com/content/iml/2012/02/17/convert-a-covariance-matrix-to-a-correlation-matrix-in-sas.html">the COV2CORR function</a>, which returns the correlation matrix that is associated with a covariance matrix. 
By using the math in the previous section, you can write a SAS IML function that will transform the limits of integration. 
The following IML function computes
<em>c</em> = D<sup>-1</sup>(<em>b</em> - &mu;). 
The SAS IML language enables you to perform this operation efficiently by using vectorized division, rather than multiplying by 
the inverse of a diagonal matrix:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Sigma is a kxk covariance matrix and b is a row vector with k elements.
   Convert b to inv(D)*(b-mu), where D is the diagonal matrix of
   standard deviations: sqrt(vecdiag(Sigma)).
   The mu vector is optional and defaults to the zero vector.
*/</span>
start Xform_Limits_Cov2Corr<span style="color: #66cc66;">&#40;</span> b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
   D = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>vecdiag<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* extract diagonal (variances) for X */</span>
   c = <span style="color: #66cc66;">&#40;</span>b - mu<span style="color: #66cc66;">&#41;</span> / rowvec<span style="color: #66cc66;">&#40;</span>D<span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* new limits for Z */</span>
   <span style="color: #0000ff;">return</span> c;
finish;</pre></td></tr></table></div>




<p>
As we did for the 1-D case, let's show the equivalency between the CDF on the data scale and the CDF for the standardized problem.
The exact computation of multivariate normal probabilities in high dimensions is difficult. Instead, 
I will use a simple Monte Carlo simulation to estimate the probabilities. This technique is easy to code and will suffice to demonstrate 
the equivalency of the probabilities.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Use Monte Carlo simulation to estimate the probability that a 
   MVN random variable is less than a specified value in each coordinate.
   Sigma is a kxk covariance matrix; mu is an optional row vector with k elements.
   The row vector b specifies the upper limits of integration. 
   The function returns an estimate of 
   P(X1&lt;b[1] &amp; X2&lt;b[2] &amp; ... &amp; Xk&lt;b[k] | X~MVN(mu, Sigma))
   by simulating N random variates from MVN(mu, Sigma) and returning the proportion
   that are in the specified region.
*/</span>
start MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = randnormal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, mu, Sigma<span style="color: #66cc66;">&#41;</span>;
   inRegion = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
      v = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> &lt; b<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* binary variable for the i_th coordinate */</span>
      inRegion = inRegion &amp; v;  <span style="color: #006400; font-style: italic;">/* find the intersection of all conditions */</span>
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>inRegion<span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* proportion of random points that satisfy criterion */</span>
finish;</pre></td></tr></table></div>




<h3>An Example: The 5-D Min Matrix</h3>

<p>
You can use the built-in <code class="preserve-code-formatting">COV2CORR</code> function and the newly defined <code class="preserve-code-formatting">Xform_Limits_Cov2Corr</code> function to 
transform a problem on the covariance (data) scale into an equivalent standardized problem.
We will use a 5-dimensional "Min Matrix" as our covariance matrix. <a href="https://blogs.sas.com/content/iml/2026/02/16/min-matrix.html">The Min Matrix is a structured matrix</a> where the (<em>i</em>, <em>j</em>)-th element is equal to min(<em>i</em>, <em>j</em>). 
The mean vector, &mu;, and the integration limits, <em>b</em>, are arbitrary vectors.
</p>

<p>
The following program estimates the probability twice, both times using one million random variates in a Monte Carlo simulation.
First, it uses the <code class="preserve-code-formatting">MC_CDFMVN</code> function for the 5-D Min Matrix. 
Then it transforms the limits, creates the associated correlation matrix, and estimates the probability on the standardized scale.
</p>
<p>
So that the estimates are the same, and not just close to each other, you can use the <code class="preserve-code-formatting">RANDSEED</code> subroutine
to reset the random number stream prior to each simulation. This ensures that both computations use the exact same stream of pseudo-random numbers.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Example: use a 5-D &quot;Min Matrix&quot;. 
   See: https://blogs.sas.com/content/iml/2026/02/16/min-matrix.html
*/</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Define the parameters for MVN(mu, Sigma) */</span>
Sigma = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">3</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">4</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">5</span> <span style="color: #66cc66;">&#125;</span>;
mu = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">5</span><span style="color: #66cc66;">&#125;</span>;
<span style="color: #0000ff;">N</span> = 1E6;
&nbsp;
<span style="color: #006400; font-style: italic;">/* estimate the CDF probability at b for MVN(mu, Sigma)*/</span>
b = <span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">7</span><span style="color: #66cc66;">&#125;</span>;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* reset the RNG */</span>
prob_cov = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* standardize the parameters to the correlation scale */</span>
c = Xform_Limits_Cov2Corr<span style="color: #66cc66;">&#40;</span>b, Sigma, mu<span style="color: #66cc66;">&#41;</span>;
R = cov2corr<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* estimate the CDF probability at c for MVN(0, R)*/</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* reset the RNG */</span>
prob_corr = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, c, R<span style="color: #66cc66;">&#41;</span>;
&nbsp;
print prob_cov prob_corr;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/stdmvn2.png" alt="" width="150" height="65" class="alignnone size-full wp-image-58650" />

<p>The output shows that the probability estimate for the original problem is identical to the estimate for the 
standardized problem. 
</p>

<h3>Summary</h3>

<p>
You can use techniques from linear algebra to extend the concept of a Z-score to multivariate normal distributions.
You do not have to physically standardize each column of the data.
Rather, you can convert the covariance matrix to the associated correlation matrix, apply a linear transformation to the 
limits of integration, and&mdash;voil&agrave;!&mdash;you can compute all multivariate normal probabilities 
in a standardized way. 
</p>

<p>
This transformation is not purely academic. In computational statistics, algorithms that
evaluate multivariate distributions are often implemented for the standardized correlation scale to ensure 
stable numerical computations. By scaling the problem and then computing answers on the correlation scale, 
the numerical routines tend to be more accurate and reliable.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">On standardizing multivariate normal probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2023/11/probbnrm8-150x150.png" />
	</item>
		<item>
		<title>Specify knots for a spline basis</title>
		<link>https://blogs.sas.com/content/iml/2026/04/27/define-knots-spline.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/04/27/define-knots-spline.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 27 Apr 2026 09:29:05 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[Regression]]></category>
		<category><![CDATA[SAS Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58363</guid>

					<description><![CDATA[<p>A previous article discusses regression splines and how to use the EFFECT statement in SAS regression procedures to specify the location of knots for a regressor variable, X. Knots are breakpoints that partition the range of X into subintervals. The splines are defined on a set of adjacent subintervals. The [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/27/define-knots-spline.html">Specify knots for a spline basis</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A previous article discusses regression splines and <a href="https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html">how to use the EFFECT statement in SAS regression procedures to specify the location of <em>knots</em></a> for a regressor variable, X. 
Knots are breakpoints that partition the range of X into subintervals. The splines are defined on a set of
adjacent subintervals. The EFFECT statement supports several common knot-placement methods including equally spaced knots, and knots placed at percentiles of the data.
B-splines not only require placing "internal" knots within the range of the data but also use "external" or "boundary" knots. 
The EFFECT statement provides two methods for placing external knots: you can stack knots at the minimum and maximum value of X, or you can place 
the knots at equally spaced intervals outside of the data range.
</p>
<p>
The EFFECT statement provides a convenient way to specify knots and splines from within a SAS regression procedure. 
However, certain applications (such as simulation studies) require more control of knot placement.
For these applications, SAS IML provides <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/imlug/imlug_langref_sect072.htm">the BSPLINE function</a>, which enables you to evaluate B-spline basis functions
at a set of X values, given the knots. This article demonstrates the BSPLINE function and 
shows how to generate knots according to the common schemes that are supported by the EFFECT statement.
</p>


<h3>The example data</h3>
<p>
Unlike other statistical modeling methods, knots and splines can be defined without using any data. 
If you specify an interval, [a, b], and a set of knots,
you can construct a B-spline basis on the interval. However, in practice, the 
interval is [min(X), max(X)] for some data variable, X. Furthermore, a common 
method to place the knots inside the interval is to use percentiles of the data.
To accommodate these two facts, let's use the Horsepower variable in 
the Sashelp.cars data set as "X". 
</p><p>
So that you can easily run the examples on your own data, 
I have put the name of the data set and the name of the variable into macro variables (DSName and XVar, respectively). 
By modifying those two macro variables, the code in this article should work for any data set and any numerical variable.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* example data */</span>
<span style="color: #000080; font-weight: bold;">data</span> Have;
<span style="color: #0000ff;">set</span> sashelp.cars;
<span style="color: #0000ff;">keep</span> Horsepower;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* define these macro variables to point to your data set and numerical variable */</span>
<span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">DSName</span> = Have;
<span style="color: #0000ff;">%let</span> XVar = Horsepower;</pre></td></tr></table></div>




<h3>The BSPLINE function</h3>
<p>
PROC IML supports <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/imlug/imlug_langref_sect072.htm">the BSPLINE function</a>, 
which enables you to evaluate a B-spline basis function on an arbitrary set of X values.
</p>
<p>
First, let me demonstrate that you don't need any data to generate a B-spline basis.
For the Horsepower variable, X, min(X) = 73, max(X) = 500, and the 25th, 50th, and 75th percentiles of the data are 
165, 210, 255.
By using only these numbers, you can construct a B-spline basis on the interval [73, 500].
The following IML statements place internal knots at the values 165, 210, 255.
There are two "clamped" boundary knots at the endpoints of the interval. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Use the BSPLINE function in PROC IML to reproduce an example from 
   https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html
   which uses X=Horsepower variable from Sashelp.Cars
*/</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
deg = <span style="color: #2e8b57; font-weight: bold;">2</span>;
<span style="color: #006400; font-style: italic;">/*       |-min(X)-|-25th 50th 75th Pctl -|-max(X)-|  */</span>
knots = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">73</span>, <span style="color: #2e8b57; font-weight: bold;">73</span>,     <span style="color: #2e8b57; font-weight: bold;">165</span>, <span style="color: #2e8b57; font-weight: bold;">210</span>, <span style="color: #2e8b57; font-weight: bold;">255</span>,       <span style="color: #2e8b57; font-weight: bold;">500</span>, <span style="color: #2e8b57; font-weight: bold;">500</span><span style="color: #66cc66;">&#125;</span>;
t = T<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">do</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">73</span>, <span style="color: #2e8b57; font-weight: bold;">500</span>, <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">500</span>-<span style="color: #2e8b57; font-weight: bold;">73</span><span style="color: #66cc66;">&#41;</span>/<span style="color: #2e8b57; font-weight: bold;">99</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* evaluate at 100 points in range of X */</span>
basis = bspline<span style="color: #66cc66;">&#40;</span>t, deg, knots<span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* 100x6 matrix. Columns are spline functions */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* columns of BASIS are the splines evaluated at t */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Second Spline Basis&quot;</span>;
<span style="color: #0000ff;">call</span> series<span style="color: #66cc66;">&#40;</span> t, basis<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#41;</span> grid=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">x</span> y<span style="color: #66cc66;">&#125;</span> other=<span style="color: #a020f0;">&quot;refline 165 210 255/axis=x;&quot;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML1.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58518" srcset="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML1.png 640w, https://blogs.sas.com/content/iml/files/2026/04/BsplineIML1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The BASIS variable is a matrix that has 100 rows and six columns. 
Each column is the result of evaluating a B-spline basis function 
at a set of evenly spaced values in the interval [73, 500]. 
The second basis function has support (that is, nonzero values) on the interval [73, 210].
The graph of that basis function is shown above.
</p><p>
The BASIS matrix contains the spline values in "wide form." 
If you want to visualize the basis by plotting all six
functions on the same graph, you can use the WIDETOLONG function to reshape the data, 
then use the GROUP= option on the SERIES call to plot the functions, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* if you want to plot all columns, convert wide to long and use the GROUP= option 
   See https://blogs.sas.com/content/iml/2015/02/27/multiple-series-iml.html
*/</span>
<span style="color: #000080; font-weight: bold;">run</span> WideToLong<span style="color: #66cc66;">&#40;</span>tt, Value, Spline_Number,              <span style="color: #006400; font-style: italic;">/* output (long) */</span>
               basis, t, <span style="color: #a020f0;">&quot;Spl1&quot;</span>:<span style="color: #a020f0;">&quot;Spl6&quot;</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* input (Basis is wide) */</span>
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;All B-Splines (deg=2)&quot;</span>;
<span style="color: #0000ff;">call</span> series<span style="color: #66cc66;">&#40;</span>tt, Value<span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">group</span>=Spline_Number grid=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">X</span> Y<span style="color: #66cc66;">&#125;</span> 
                       other=<span style="color: #a020f0;">&quot;refline 165 210 255/axis=x;&quot;</span> <span style="color: #0000ff;">label</span>=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;&amp;XVar&quot;</span> <span style="color: #a020f0;">&quot;Spline Value&quot;</span><span style="color: #66cc66;">&#125;</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML2.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58515" srcset="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML2.png 640w, https://blogs.sas.com/content/iml/files/2026/04/BsplineIML2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
This graph shows the same B-spline basis as for <a href="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png">the PERCENTILES(3) example</a> in my previous article. 
The main difference is that the previous graph shows the splines evaluated at the observed values of X, whereas the graph in this article evaluates the splines
on an evenly spaced grid of points. The second basis function is now displayed in a red color.
</p>


<h3>A SAS IML function for common knot-placement schemes</h3>
<p>
Although B-splines can be defined without any reference to data, in practice they are often used as regression splines.
As such, the interval on which they are defined corresponds to the range of the X data. 
For some knot-placement schemes, the internal knots are placed at percentiles of the X variable.
Furthermore, to create a design matrix for fitting a regression model, you must evaluate the splines at the observed values of X.
So, in practice, it is useful to extract the knot positions from data.
</p><p>
As discussed in <a href="https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html">a previous article</a>, the EFFECT statement in SAS regression procedures handles these details automatically.
It is useful to define a SAS IML function that can generate the same knot-placement schemes that the EFFECT statement uses.
The schemes that depend on the data are shown in the following table:
</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Parameter</th>
      <th>Equivalent KNOTMETHOD= option</th>
      <th>External Knots</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>"PCTL"</td>
      <td>scalar n</td>
      <td>PERCENTILES(n)</td>
      <td>Boundary</td>
    </tr>
    <tr>
      <td>"PCTL"</td>
      <td>vector v</td>
      <td>PERCENTILELIST(v), where 0 &lt; v &lt; 100</td>
      <td>Boundary</td>
    </tr>
    <tr>
      <td>"EQUAL"</td>
      <td>scalar n</td>
      <td>EQUAL(n)</td>
      <td>Evenly spaced</td>
    </tr>
    <tr>
      <td>"RANGE"</td>
      <td>vector v</td>
      <td>RANGEFRACTIONS(v), where 0 &lt; v &lt; 1</td>
      <td>Boundary</td>
    </tr>
    <tr>
      <td>"LIST"</td>
      <td>vector v</td>
      <td>LIST(v), where min(X) &lt;= v &lt;= max(X)</td>
      <td>Boundary</td>
    </tr>
  </tbody>
</table>
<br />
<p>
It is straightforward to define a SAS IML function that returns the same set of knots as the EFFECT statement. (You can skip the "LIST" method, which assumes that the user 
already knows the positions of the knots.) The following function shows one implementation, which uses separate helper functions for generating internal and external knots. You can test the function by loading the Horsepower variable and sending that data vector into the function, along with a parameter that specifies the number and location of knots. So that the main ideas are evident, this implementation does not perform any argument validation.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* reproduce the knot-placement methods in the EFFECT statement:
   Method   Parm        Equivalent KNOTMETHOD= option            External Knots
   ======   ========    ======================================   ==============
   &quot;PCTL&quot;   scalar n    PERCENTILES(n)                           Boundary
   &quot;PCTL&quot;   vector v    PERCENTILELIST(v), where 0 &lt; v &lt; 100     Boundary
   &quot;EQUAL&quot;  scalar n    EQUAL(n)                                 Evenly spaced
   &quot;RANGE&quot;  vector v    RANGEFRACTIONS(v), where 0 &lt; v &lt; 1       Boundary
*/</span>
start Knots<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, method, parm<span style="color: #66cc66;">&#41;</span>;
    int_knots = Knots_Internal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, method, parm<span style="color: #66cc66;">&#41;</span>;
    <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;EQUAL&quot;</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;
       delta = int_knots<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> - int_knots<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>;
       knots = Ext_Knots<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;EQUAL_L&quot;</span>, deg, <span style="color: #0000ff;">min</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>, delta<span style="color: #66cc66;">&#41;</span> //
               int_knots //
               Ext_Knots<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;EQUAL_R&quot;</span>, deg, <span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>, delta<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">else</span>
      knots = Ext_Knots<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;BOUNDARY&quot;</span>, deg, <span style="color: #0000ff;">min</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> //
              int_knots //
              Ext_Knots<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;BOUNDARY&quot;</span>, deg, <span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> knots <span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Helper: Function to create internal knots for B-splines.
   Valid methods are &quot;PCTL&quot;, &quot;EQUAL&quot;, or &quot;RANGE&quot;.
*/</span>
start Knots_Internal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, method, _parm<span style="color: #66cc66;">&#41;</span>;
   parm = colvec<span style="color: #66cc66;">&#40;</span>_parm<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;PCTL&quot;</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;
      <span style="color: #0000ff;">if</span> nrow<span style="color: #66cc66;">&#40;</span>parm<span style="color: #66cc66;">&#41;</span>=<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">then</span> 
         <span style="color: #0000ff;">call</span> qntl<span style="color: #66cc66;">&#40;</span>knots, <span style="color: #0000ff;">x</span>, T<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:parm<span style="color: #66cc66;">&#41;</span>/<span style="color: #66cc66;">&#40;</span>parm+<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* PERCENTILES(n) */</span>
      <span style="color: #0000ff;">else</span> 
         <span style="color: #0000ff;">call</span> qntl<span style="color: #66cc66;">&#40;</span>knots, <span style="color: #0000ff;">x</span>, parm/<span style="color: #2e8b57; font-weight: bold;">100</span><span style="color: #66cc66;">&#41;</span>;             <span style="color: #006400; font-style: italic;">/* PERCENTILELIST(v) */</span>
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;EQUAL&quot;</span> <span style="color: #0000ff;">then</span>
      knots = <span style="color: #0000ff;">min</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span> + <span style="color: #0000ff;">range</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #006400; font-style: italic;">*T(1:parm)/(parm+1);</span> <span style="color: #006400; font-style: italic;">/* EQUAL(n) */</span>
   <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;RANGE&quot;</span> <span style="color: #0000ff;">then</span> 
      knots = <span style="color: #0000ff;">min</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span> + <span style="color: #0000ff;">range</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #006400; font-style: italic;">*parm;</span>               <span style="color: #006400; font-style: italic;">/* RANGEFRACTIONS(v) */</span>
   <span style="color: #0000ff;">else</span>
      knots = .;                                    <span style="color: #006400; font-style: italic;">/* unsupported method */</span>
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> knots <span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Helper: Function to create external or boundary knots for B-splines.
   Valid methods are &quot;BOUNDARY&quot;, &quot;EQUAL_L&quot;, &quot;EQUAL_R&quot; */</span>
start Ext_Knots<span style="color: #66cc66;">&#40;</span>method, deg, value, delta=<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;BOUNDARY&quot;</span> <span style="color: #0000ff;">then</span>
      ext_knots = <span style="color: #0000ff;">repeat</span><span style="color: #66cc66;">&#40;</span>value, deg<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;EQUAL_L&quot;</span> <span style="color: #0000ff;">then</span>
      ext_knots = value - T<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span>deg-<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>:<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #006400; font-style: italic;">*delta;</span>
   <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>method<span style="color: #66cc66;">&#41;</span>=<span style="color: #a020f0;">&quot;EQUAL_R&quot;</span> <span style="color: #0000ff;">then</span>
      ext_knots = value + T<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span>:<span style="color: #66cc66;">&#40;</span>deg-<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #006400; font-style: italic;">*delta;</span>
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> ext_knots <span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* read in example data */</span>
use &amp;<span style="color: #0000ff;">DSName</span>;
   read all <span style="color: #0000ff;">var</span> <span style="color: #a020f0;">&quot;&amp;XVar&quot;</span> <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">x</span>;
<span style="color: #0000ff;">close</span>;
deg = <span style="color: #2e8b57; font-weight: bold;">2</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* get knots for common placement schemes */</span>
k_Pctl3     = Knots<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, <span style="color: #a020f0;">&quot;PCTL&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span>;
k_PctlList  = Knots<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, <span style="color: #a020f0;">&quot;PCTL&quot;</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">20</span>, <span style="color: #2e8b57; font-weight: bold;">50</span>, <span style="color: #2e8b57; font-weight: bold;">80</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#41;</span>;
k_eq3       = Knots<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, <span style="color: #a020f0;">&quot;EQUAL&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span>;
k_RangeList = Knots<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, <span style="color: #a020f0;">&quot;RANGE&quot;</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0.25</span>, <span style="color: #2e8b57; font-weight: bold;">0.50</span>, <span style="color: #2e8b57; font-weight: bold;">0.75</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* display the results in a single table */</span>
results = k_Pctl3 || k_PctlList || k_eq3 || k_RangeList;
rowHeader = <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'L Ext 1'</span>,<span style="color: #a020f0;">'L Ext 2'</span>,<span style="color: #a020f0;">'Int 1'</span>,<span style="color: #a020f0;">'Int 2'</span>,<span style="color: #a020f0;">'Int 3'</span>,<span style="color: #a020f0;">'R Ext 1'</span>,<span style="color: #a020f0;">'R Ext 2'</span><span style="color: #66cc66;">&#125;</span>;
colHeader = <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Pctl(3)'</span>,<span style="color: #a020f0;">'Pctl List'</span>,<span style="color: #a020f0;">'Equal(3)'</span>,<span style="color: #a020f0;">'Range List'</span><span style="color: #66cc66;">&#125;</span>;
print results<span style="color: #66cc66;">&#91;</span>c=colHeader r=rowHeader L=<span style="color: #a020f0;">&quot;Knot-Placement Methods&quot;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML3.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML3.png" alt="" width="317" height="256" class="alignnone size-full wp-image-58545" srcset="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML3.png 317w, https://blogs.sas.com/content/iml/files/2026/04/BsplineIML3-300x242.png 300w, https://blogs.sas.com/content/iml/files/2026/04/BsplineIML3-168x137.png 168w" sizes="(max-width: 317px) 100vw, 317px" /></a>

<p>
I reviewed the previous article and verified that these are the same knot positions that are generated by the EFFECT statement.
</p>

<h3>A design matrix for B-splines in SAS IML</h3>
<p>
If you plan to use B-splines in a regression model, you need to create the design matrix for the X variable.
To generate a design matrix, you first specify a knot-placement scheme, which gives you a vector of knot locations. 
You can then call the BSPLINE function to produce the spline basis and evaluate the 
basis function at each observed value of X. The following statements show an example:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* to generate a design matrix, first get the knots, then evaluate the splines on the X data */</span>
knots = Knots<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, <span style="color: #a020f0;">&quot;PCTL&quot;</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">20</span>, <span style="color: #2e8b57; font-weight: bold;">50</span>, <span style="color: #2e8b57; font-weight: bold;">80</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#41;</span>;
basis = bspline<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, deg, knots<span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* 428x6 matrix. Columns are spline functions */</span></pre></td></tr></table></div>




<p>
The BASIS matrix is the design matrix for the splines. It does not include an intercept column.
For a multivariable regression model, you would horizontally concatenate this design matrix with the data vectors or design matrices
for additional regressors.
</p>


<h3>Visualize the B-spline basis</h3>
<p>
If you want to visualize the spline basis,
you can add X to the design matrix, sort the columns, and then repeat the wide-to-long visualization from earlier in this article:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* if you want to visualize, sort the data by the X variable, then convert wide to long */</span>
M = <span style="color: #0000ff;">x</span> || basis;   <span style="color: #006400; font-style: italic;">/* concatenate design matrix with X, then sort by X */</span>
<span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span>M, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
t = M<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>;
basis = M<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span>:ncol<span style="color: #66cc66;">&#40;</span>M<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;
<span style="color: #000080; font-weight: bold;">run</span> WideToLong<span style="color: #66cc66;">&#40;</span>tt, Value, Spline_Number,              <span style="color: #006400; font-style: italic;">/* output (long) */</span>
               basis, t, <span style="color: #a020f0;">&quot;Spl1&quot;</span>:<span style="color: #a020f0;">&quot;Spl6&quot;</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* input (Basis is wide) */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;B-Splines (deg=2)&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Knots Placed by using PERCENTILELIST({20, 50, 80})&quot;</span>;
<span style="color: #0000ff;">call</span> series<span style="color: #66cc66;">&#40;</span>tt, Value<span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">group</span>=Spline_Number grid=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">X</span> Y<span style="color: #66cc66;">&#125;</span> 
                       <span style="color: #0000ff;">label</span>=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;&amp;XVar&quot;</span> <span style="color: #a020f0;">&quot;Spline Value&quot;</span><span style="color: #66cc66;">&#125;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML4.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML4.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58554" srcset="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML4.png 640w, https://blogs.sas.com/content/iml/files/2026/04/BsplineIML4-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<h3>Summary</h3>
<p>
This article shows how to write a function in SAS IML that reproduces common knot-placement methods for B-splines.
The B-splines use both Internal and external (or boundary) knots. Internal knots are placed 
within an interval, which is often the observed range of the data variable.
Boundary knots are stacked at the endpoints
of the interval. True external knots are placed outside the interval. 
After you generate the knots, you can then call the built-in BSPLINE function in IML to generate a spline design matrix.
</p><p>
Since most SAS regression procedures support the EFFECT statement, why would you want to generate knots and splines in IML?
I can think of a few reasons:
</p>
<ul>
<li>Varying the knot-placement scheme is useful for understanding whether the
parameter estimates in 
a spline regression model are sensitive to the knot-placement method.
</li>
<li>
If you want to generate your own knot-placement scheme, you can modify the IML functions in this article.
</li>
<li>By moving the generation of the spline design matrix into IML, you can easily run
simulation studies and bootstrap analyses. 
</li>
<li>B-splines are not the only type of spline. If you want to build a 
regression model that uses a new type of spline, you can reuse the standard knot-placement schemes for the new spline type.
</li>
</ul>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/27/define-knots-spline.html">Specify knots for a spline basis</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/04/27/define-knots-spline.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/04/BsplineIML4-150x150.png" />
	</item>
		<item>
		<title>Visualize the placement of knots for regression splines</title>
		<link>https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 09:29:19 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Regression]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58402</guid>

					<description><![CDATA[<p>Splines are useful tools for fitting regression models to data. A spline replaces a single variable (call it X) with several other variables, which are a spline basis for X. When using a spline basis, the shape and location of the basis functions depend on the placement of knots. Knots [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html">Visualize the placement of knots for regression splines</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<a href="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png" alt="" width="360" height="270" class="alignright size-full wp-image-58437" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines04-300x225.png 300w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
Splines are useful tools for fitting regression models to data. 
A spline replaces a single variable (call it X) with several other variables,
which are a spline basis for X.
When using a spline basis, the shape and location of the basis functions depend on the placement of <em>knots</em>.
Knots are breakpoints that partition the range of a variable into subintervals. The regression model can 
fit the behavior of the data separately on each subinterval.
</p><p>
This article explains, compares, and visualizes common  
knot-placement schemes for regression splines. 

</p><p>

Previous articles about regression splines have visualized 
natural cubic splines and truncated power functions (TPF).
This article focuses on B-splines. B-splines not only require placing "internal" knots 
which are within the range of the data for X, but also use "external" or "boundary" knots. 
Consequently, this article also discusses how the placement of external knots 
affects the spline basis.
A typical visualization is shown to the right for a B-spline basis. The vertical reference lines
indicate the locations of knots.
</p><p>
For previous articles about splines, see 
</p>
<ul>
<li>
<a href="https://blogs.sas.com/content/iml/2017/04/19/restricted-cubic-splines-sas.html">"Regression with restricted cubic splines in SAS"</a>: How to use the EFFECT statement in SAS to define and use splines in regression models. This article discusses the KNOTMETHOD= option, which enables you to specify the placement of the knots.
</li>
<li>
<a href="https://blogs.sas.com/content/iml/2019/10/16/visualize-regression-splines.html">"Visualize a regression with splines"</a>: How
to visualize the spline basis and interpret the parameters in a regression model that includes splines.
</li>
</ul>


<h3>Why use splines for regression?</h3>
<p>
Why are splines useful in regression modelling? 
Briefly, splines enable you to use <em>linear</em> regression models to fit data that have local or even <em>nonlinear</em> effects.
In an elementary linear model, you model the response variable (Y) as a linear combination of 
one or more independent variables. In the simplest case of one independent variable, the model is Y = b0 + b1*X1 + error.
This model is only reasonable if Y and X are linearly related across the entire range of X, which is not always true.
<a href="https://blogs.sas.com/content/iml/2017/04/05/nonsmooth-models-spline-effects.html">Splines enable you to fit 
a piecewise-linear model where the model changes on various subintervals of the X range.</a> The knots define the subintervals.
You can also use knots and splines to fit a piecewise-polynomial model. 
This is shown in subsequent sections.
It is possible because the basis functions for B-splines (and other polynomial-based splines) are formed by piecewise polynomials of degree d that are joined together so that they are continuously differentiable up to order d-1. 
</p>

<h3>Sample data</h3>
<p>
I'll demonstrate the knot placement and the spline visualization by using the Horsepower variable in the SasHelp.Cars data set, which has 428 observations. In this article, the phrase "the X variable" refers to the Horsepower variable. 
So that you can easily run the examples on your own data, I have put the name of the data set and the name of the variable into macro variables (DSName and XVar, respectively). By modifying those two macro variables, the code in this article should work for any data set and any numerical variable.
</p><p>
The following SAS statements define the data and the macro variables:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* example data */</span>
<span style="color: #000080; font-weight: bold;">data</span> Have;
<span style="color: #0000ff;">set</span> sashelp.cars;
<span style="color: #0000ff;">keep</span> Horsepower;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* define these macro variables to point to your data set and numerical variable */</span>
<span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">DSName</span> = Have;
<span style="color: #0000ff;">%let</span> XVar = Horsepower;
&nbsp;
<span style="color: #006400; font-style: italic;">/* you should not have to modify any code below this line */</span>
<span style="color: #000080; font-weight: bold;">data</span> SplDS / <span style="color: #0000ff;">view</span>=SplDS;
   <span style="color: #0000ff;">set</span> &amp;<span style="color: #0000ff;">DSName</span>;
   _Fake = <span style="color: #2e8b57; font-weight: bold;">0</span>;    <span style="color: #006400; font-style: italic;">/* add fake variable for the MODEL statement in PROC GLIMMIX */</span>
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<p>
The code creates <a href="https://blogs.sas.com/content/iml/2016/05/09/data-step-view.html">a DATA step view</a> that adds a dummy variable  (_Fake) to the data set.
You can use PROC MEANS to obtain the five-number summary of the data:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc means</span> <span style="color: #000080; font-weight: bold;">data</span>=SplDS <span style="color: #0000ff;">min</span> Q1 median Q3 <span style="color: #0000ff;">max</span> ndec=<span style="color: #2e8b57; font-weight: bold;">0</span>;
   <span style="color: #0000ff;">var</span> <span style="color: #0000ff; font-weight: bold;">&amp;XVar</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines01.png" alt="" width="423" height="92" class="alignnone size-full wp-image-58416" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines01.png 423w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines01-300x65.png 300w" sizes="(max-width: 423px) 100vw, 423px" />

<p>
The range of the data is [73, 500]. 
</p>

<h3>Overview of knot placement methods</h3>
<p>
In SAS regression procedures, the location of knots is determined by <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_introcom_sect020.htm">the KNOTMETHOD= option in the EFFECT statement</a>. In this article, the examples always use the BASIS=BSPLINE and DEGREE=2 options. In practice, many people use DEGREE=3, which results in more basis functions.
</p>
<p>
The knot placement options are as follows:
</p>
<ul>
<li><strong>PERCENTILES(n) and PERCENTILELIST(percentile-list)</strong>: The knot locations are placed at percentiles of the X variable. 
For a B-spline basis, the external knots are placed at min(X) and max(X).
</li>
<li><strong>EQUAL(n)</strong>: The knot locations are placed at n evenly spaced points inside the interval [min(X), max(X)]. 
For a B-spline basis, the external knots are placed at evenly spaced points less than or equal to min(X) and greater or equal to max(X).
The distance between all knots (exterior and interior) is the same.
</li>
<li>
<strong>RANGEFRACTIONS(fraction-list)</strong>: If the fraction list is (f1, f2, f3,...), the knot locations are placed at min(X) + (max(X)-min(X))*f_i,  where 0 &lt; f_i &le; 1.
For a B-spline basis, the external knots are placed at min(X) and max(X).
</li>
<li>
<strong>LIST(number-list) and LISTWITHBOUNDARY(number-list)</strong>: The knot locations are placed at specified locations in data coordinates.
For a B-spline basis, the LIST option places external knots at min(X) and max(X).
The LISTWITHBOUNDARY option enables you to specify the location of the internal and external knots.
</li>
</ul>

<h3>The PERCENTILES and PERCENTILELIST methods</h3>
<p>
The knot locations are placed at percentiles of the X variable. 
For a B-spline basis, the external knots are placed at min(X) and max(X).
The following call to PROC GLIMMIX includes an EFFECT statement. The KNOTMETHOD= option specifies the PERCENTILES(3) option,
which places knots at the 25th, 50th, and 75th percentile of the X variable.
<a href="https://blogs.sas.com/content/iml/2016/02/24/create-a-design-matrix-in-sas.html">The OUTDESIGN= option in the PROC GLIMMIX statement</a> 
creates a SAS data set that contains the spline bases evaluated at the values of the X variable. For a DEGREE=2 basis, the 
columns of the data set are named Spl1--Spl6.  The NOINT option in the MODEL statement prevents the 
OUTDESIGN= data set from containing a constant column of 1s.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* PERCENTILES(n) specifies the locations for internal knots as n even percentiles of the data. 
   For a B-spline basis, the extremes of the data are used as boundary knots.
*/</span>
<span style="color: #000080; font-weight: bold;">proc glimmix</span> <span style="color: #000080; font-weight: bold;">data</span>=SplDS NOFIT
             outdesign<span style="color: #66cc66;">&#40;</span>names <span style="color: #0000ff;">X</span>=Spl<span style="color: #66cc66;">&#41;</span>=Pctl_BSpl;
   effect Spl=spline<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff; font-weight: bold;">&amp;XVar</span> / basis=BSpline details degree=<span style="color: #2e8b57; font-weight: bold;">2</span>
                      knotmethod=percentiles<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* &lt;== KNOTMETHOD */</span>
   model _Fake = Spl / NOINT;
   ods <span style="color: #0000ff;">select</span> SplineKnots BSplineDetails;
   ods <span style="color: #0000ff;">output</span> SplineKnots=KnotPositions;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines02.png" alt="" width="316" height="251" class="alignnone size-full wp-image-58443" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines02.png 316w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines02-300x238.png 300w" sizes="(max-width: 316px) 100vw, 316px" />
<br />
<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines03.png" alt="" width="315" height="224" class="alignnone size-full wp-image-58440" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines03.png 315w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines03-300x213.png 300w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines03-269x192.png 269w" sizes="(max-width: 315px) 100vw, 315px" />

<p>
The ODS SELECT statement selects two output tables, which are shown above. The SplineKnots table provides information about the placement of knots.
For the PERCENTILES(3) method, there are d knots placed at the minimum value of the X variable, 3 interior knots placed at the 25th, 50th, and 75th percentiles of the X variable, and d knots placed at the maximum value of the X variable. The exterior or boundary knots are indicated by an asterisk. Notice that it is mathematically acceptable to stack two or more knots at the same position.
</p>
<p>
The BSplineDetails was created by using the DETAILS option in the EFFECT statement. It tells you that the basis consists of 6 spline functions.
The first basis function has support (that is, nonzero values) on the interval defined by the first three knots, which is [min(X), 25_th_Pctl].
The second basis function has support  on the interval defined by the knots 1-4, which is [min(X), 50_th_Pctl].
This information continues until the sixth basis function, which has support on the interval defined by the knots 5-7, which is [75_th_Pctl, max(X)].
</p>
<p>
You can visualize the spline basis on the range of the X variable. 
If X has widely spaced values, the splines might look "chunky" when evaluated at the values of X.
However, the basis functions are piecewise polynomials in the intervals between the knots, 
so they are continuous and smooth. For a spline basis of degree d, 
the basis functions are continuously differentiable d-1 times at the knot locations. 
The following statements create the visualization by using the %SplineViz macro, which is defined in the Appendix. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Three Interior Knots: percentile(3)&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Exterior Knots on Boundary&quot;</span>;
%SplineViz<span style="color: #66cc66;">&#40;</span> Pctl_Knots <span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58437" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines04-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The six spline basis functions are shown. The vertical lines indicate the placement of knots.
The knots define subintervals on which the splines are defined. The splines look smooth on the left because X has many unique values there.
The splines look "chunky" on the right because there are only a few unique values of X that are greater than 350.
</p><p>
The shape of the interior splines might remind you of probability distributions. Some are roughly "bell-shaped" whereas others are skewed. 
Because the exterior knots are clamped at the extremes of the data, the first and last spline functions look like truncated polynomials that 
have the values 0 and 1 at the endpoint of the subinterval on which they are defined.
</p>

<h3>The EQUAL method</h3>
<p>
The KNOTMETHOD=EQUAL(<em>n</em>) option is similar to the PERCENTILES(<em>n</em>) option, 
except the knot locations for EQUAL(<em>n</em>) are placed at <em>n</em> evenly spaced points inside the interval [min(X), max(X)]. 
For a B-spline basis, the external knots are also evenly spaced points. The leftmost knots are less than or equal to min(X); the rightmost knots are greater than or equal to max(X).
The distance between all knots (exterior and interior) is the same.
</p>
<p>
For example, the following call to PROC GLIMMIX uses the EQUAL(3) option. Otherwise, the syntax to PROC GLIMMIX is unchanged.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* EQUAL(n) specifies that n equally spaced knots be positioned between the 
   extremes of the data. For a B-spline basis, any needed boundary knots 
  continue to be equally spaced. You can use the DATABOUNDARY option to override this behavior.
*/</span>
<span style="color: #000080; font-weight: bold;">proc glimmix</span> <span style="color: #000080; font-weight: bold;">data</span>=SplDS NOFIT
             outdesign<span style="color: #66cc66;">&#40;</span>names <span style="color: #0000ff;">X</span>=Spl<span style="color: #66cc66;">&#41;</span>=Equal_BSpl;
   effect Spl=spline<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff; font-weight: bold;">&amp;XVar</span> / details basis=BSpline degree=<span style="color: #2e8b57; font-weight: bold;">2</span>
                              knotmethod=EQUAL<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* &lt;== KNOTMETHOD */</span>
   model _Fake = Spl / NOINT; 
   ods <span style="color: #0000ff;">select</span> SplineKnots;
   ods <span style="color: #0000ff;">output</span> SplineKnots=KnotPositions;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines05.png" alt="" width="266" height="254" class="alignnone size-full wp-image-58446" />

<p>
The output shows that the knots are equally spaced. The distance between adjacent knots is &Delta;<em>x</em> = 106.75. The exterior knots on the left are at min(X)-&Delta;<em>x</em> and min(X). The interior knots are at min(X) + i*&Delta;<em>x</em> for i=1,2,3. The exterior knots on the right are placed at max(X) and max(X)+&Delta;<em>x</em>.
The following statements visualize the spline basis for this set of knots:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Three Interior Knots: equal(3)&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Exterior Knots Are Equally Spaced&quot;</span>;
%SplineViz<span style="color: #66cc66;">&#40;</span> Equal_BSpl <span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/04/Bsplines06.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines06.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58449" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines06.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines06-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
These basis functions are essentially identical, ignoring the "chunky" artifacts on the right. The functions differ by a shift. 
</p>



<h3>The RANGEFRACTIONS method</h3>
<p>
The RANGEFRACTIONS method is similar to EQUAL method but enables you to set knots at unequal positions within the range of the X variable. 
If the fraction list is (f1, f2, f3,...), the knot locations are placed at min(X) + (max(X)-min(X))*f_i,  where 0 &lt; f_i &le; 1.
So, you can recreate the EQUAL(3) spacing by using RANGEFRACTIONS(0.25 0.5 0.75). If you want unequal spacing, you can use a syntax such as 
RANGEFRACTIONS(0.2 0.5 0.8) to position the knots at 20%, 50%, and 80% of the range.
</p><p>
For a B-spline basis, the RANGEFRACTIONS method places the external knots at min(X) and max(X). Let's place three interior knots at evenly spaced positions within the range but clamp the exterior knots at the extremes of the data. This will enable us to compare spline bases that differ only by the exterior knots:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* RANGEFRACTIONS(fraction-list) specifies that internal knots be placed at each fraction of the range of X.
   For a B-spline basis, the data extremes are used as boundary knots.
*/</span>
<span style="color: #000080; font-weight: bold;">proc glimmix</span> <span style="color: #000080; font-weight: bold;">data</span>=SplDS nofit
             outdesign<span style="color: #66cc66;">&#40;</span>names <span style="color: #0000ff;">X</span>=Spl<span style="color: #66cc66;">&#41;</span>=Range_BSpl;
   effect Spl =spline<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff; font-weight: bold;">&amp;XVar</span> / basis=BSpline details degree=<span style="color: #2e8b57; font-weight: bold;">2</span>
                                  knotmethod=rangefractions<span style="color: #66cc66;">&#40;</span>.25 .5 .75<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
   model _Fake = Spl / NOINT; 
   ods <span style="color: #0000ff;">select</span> SplineKnots;
   ods <span style="color: #0000ff;">output</span> SplineKnots=KnotPositions;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines07.png" alt="" width="269" height="253" class="alignnone size-full wp-image-58452" />

<p>
The output confirms that the interior knots are placed the same as for the EQUAL(3) option, but the exterior knots are clamped at the extreme values of the data. The following statements visualize the spline basis:


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Three Interior Knots: rangefractions(.25 .5 .75)&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Exterior Knots Are Clamped at Min/Max&quot;</span>;
%SplineViz<span style="color: #66cc66;">&#40;</span> Range_BSpl <span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/04/Bsplines08.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Bsplines08.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58455" srcset="https://blogs.sas.com/content/iml/files/2026/04/Bsplines08.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Bsplines08-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

</p><p>
This spline basis differs from the EQUAL(3) basis, which was highly symmetric.
Placing the exterior knots at the boundary of the data means that the range of the first and last spline is [0,1]. The interior splines are not the same size and shape. (Or, perhaps I should quip that they are KNOT the same shape!)
</p>


<h3>The LIST and LISTWITHBOUNDARY methods</h3>
<p>
The LIST and LISTWITHBOUNDARY methods enable you to place the knots by using the data scale. I will not show an example. You merely list the data values at which you want the knots.
</p>


<h3>Summary</h3>
<p>
This article shows how to position knots for regression splines by using the KNOTMETHOD= option in the EFFECT statement, which is supported by many SAS regression procedures. You can use the OUTDESIGN= option in the PROC GLIMMIX statement to output the splines (evaluated at the data locations) to a data set. The output from PROC GLIMMIX includes the location of the knots. This article showed several common ways to place knots and, for each method, visualized the resulting set of spline basis functions. 
</p>


<h3>Appendix: The SplineViz macro</h3>
<p>
The following macro visualizes the spline bases that are generated by each call to PROC GLIMMIX. 
The name of the design matrix is passed in; inside the macro it is referred to as &amp;DS.
The columns for the spline basis are named Spl1, Spl2, Spl3, ....
The positions of the knots are read from the KnotPositions data set, which was created from the 
ODS OUTPUT statement, which writes the SplineKnots table to a data set.
The macro creates a series plot that shows the spline basis functions evaluated on the values of the data variable, X.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* macro to visualize the spline bases that are generated by each call to PROC GLIMMIX. 
   The original data should be sorted by the X variable.
   The name of the design matrix is passed in as &amp;DS.
   The columns for the spline basis are named Spl1, Spl2, Spl3, ...
*/</span>
<span style="color: #0000ff;">%macro</span> SplineViz<span style="color: #66cc66;">&#40;</span> DS <span style="color: #66cc66;">&#41;</span>;
<span style="color: #006400; font-style: italic;">/* 1. Transpose design matrix from wide to long. The new variables are 
      SplNum : identifies the i_th basis function, i=1, 2, 3, ...
      Value  : the values of the X variable at which the i_th spline is evaluated
      Y      : the value of the i_th spline function evaluated at X
*/</span>
<span style="color: #000080; font-weight: bold;">data</span> BSplineLong;
   <span style="color: #0000ff;">set</span> <span style="color: #0000ff; font-weight: bold;">&amp;DS</span>;
   <span style="color: #0000ff;">array</span> basis Spl:;
   value = <span style="color: #0000ff; font-weight: bold;">&amp;XVar</span>;
   <span style="color: #0000ff;">do</span> SplNum = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #0000ff;">dim</span><span style="color: #66cc66;">&#40;</span>basis<span style="color: #66cc66;">&#41;</span>;
      Y = basis<span style="color: #66cc66;">&#91;</span>SplNum<span style="color: #66cc66;">&#93;</span>;
      <span style="color: #0000ff;">output</span>;
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">keep</span> SplNum Value Y;
   <span style="color: #0000ff;">label</span> SplNum=<span style="color: #a020f0;">&quot;Spline Number&quot;</span> Value=<span style="color: #a020f0;">&quot;&amp;XVar&quot;</span> Y=<span style="color: #a020f0;">&quot;Spline Value&quot;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc sort</span> <span style="color: #000080; font-weight: bold;">data</span>=BSplineLong;
   <span style="color: #0000ff;">by</span> SplNum Value;
<span style="color: #000080; font-weight: bold;">run</span>;
<span style="color: #006400; font-style: italic;">/* add a label variable. You will need to change the format of these values
   if the scale of your data is larger or smaller than the example data.
*/</span>
<span style="color: #000080; font-weight: bold;">data</span> SplineViz;
   <span style="color: #0000ff;">set</span> BSplineLong KnotPositions<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">keep</span>=<span style="color: #0000ff; font-weight: bold;">&amp;XVar</span> <span style="color: #0000ff;">rename</span>=<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff; font-weight: bold;">&amp;XVar</span>=KnotPos<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">length</span> KnotPosLabel $3;
   KnotPosLabel = <span style="color: #0000ff;">putn</span><span style="color: #66cc66;">&#40;</span>KnotPos, <span style="color: #a020f0;">&quot;3.0&quot;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=SplineViz;
   series <span style="color: #0000ff;">x</span>=Value y=Y / <span style="color: #0000ff;">group</span>=SplNum nomissinggroup;
   fringe Value;
   refline KnotPos / axis=<span style="color: #0000ff;">x</span> labelloc=inside <span style="color: #0000ff;">label</span>=KnotPosLabel labelattrs=<span style="color: #66cc66;">&#40;</span>size=<span style="color: #2e8b57; font-weight: bold;">8</span><span style="color: #66cc66;">&#41;</span>;
   yaxis <span style="color: #0000ff;">max</span>=<span style="color: #2e8b57; font-weight: bold;">1</span>;     <span style="color: #006400; font-style: italic;">* for Viz, put all plots on same vertical scale;</span>
<span style="color: #000080; font-weight: bold;">run</span>;
<span style="color: #0000ff;">%mend</span>;</pre></td></tr></table></div>


<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html">Visualize the placement of knots for regression splines</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/04/22/viz-knots-splines.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/04/Bsplines04-150x150.png" />
	</item>
		<item>
		<title>The sample median is a biased estimator for skewed distributions</title>
		<link>https://blogs.sas.com/content/iml/2026/04/20/sample-median-bias.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/04/20/sample-median-bias.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 20 Apr 2026 09:21:21 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[SAS Programming]]></category>
		<category><![CDATA[Simulation]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58330</guid>

					<description><![CDATA[<p>Did you know that, for skewed distributions, the sample median is a biased estimator of the population median? You can show this in two ways. For any distribution, you can run a Monte Carlo simulation that reveals the bias. And, for some distributions (such as the exponential distribution), you can [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/20/sample-median-bias.html">The sample median is a biased estimator for skewed distributions</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
Did you know that, 
for skewed distributions, the sample median is a biased estimator of the population median?
You can show this in two ways. For any distribution, you can run a Monte Carlo simulation that 
reveals the bias. And, for some distributions (such as the exponential distribution), you can explicitly write a formula
for the expected value of the sample median. The formula proves that the sample median is a biased estimate, and 
it reveals the size of the bias.
</p>

<h3>Is bias a big deal?</h3>
<p>
For data analysis tasks, I don't often worry about biased statistics. 
<a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">Bias is defined as the difference</a> between the expected value of an estimator and the population parameter.
For most statistics, 
the standard error of the statistic is much larger than its bias, thus 
bias isn't very important in practice. 
Furthermore, many statistics are asymptotically unbiased, so the bias is small for large samples. 
</p><p>
However, knowing whether a statistic is biased is important for Monte Carlo simulation studies. 
In a simulation study, you know the exact value of a population parameter. 
When I develop a simulation, I often verify its correctness by
running many simulations and checking that the Monte Carlo estimate converges to the population parameter.
However, <em>a biased statistic does not converge to the population parameter</em>! The Monte Carlo estimate converges to the 
<em>expected value</em> of the statistic. For a biased estimator, the expected value is different from the population parameter,
by definition.
As the number of Monte Carlo iterations (B) increases, the standard error of the simulation shrinks toward zero, but the bias remains constant.
</p>

<h3>The median of an exponential distribution</h3>
<p>
I recently wrote <a href="https://blogs.sas.com/content/iml/2026/04/13/monitor-convergence-monte-carlo.html">an article about the convergence of Monte Carlo estimates</a>.
The example in that article studied the sample mean for random samples drawn from an exponential distribution
that has a scale parameter (&lambda;) with the value 10.
You can look up in Wikipedia or a textbook that the median of the exponential distribution is 
log(2)*&lambda;. So, when &lambda;=10, the median of the Expo(10) distribution is <strong>6.9315</strong>.
</p><p>
In the previous article, the sample size was N=100. 
This article uses N=99 instead. Since the sample median is the "middle value" in the sample,
using an odd sample size makes the theoretical formulas a little easier to understand. 
</p>

<h3>A Monte Carlo simulation for the median</h3>
<p>
To demonstrate that the sample median is a biased statistic, let's run the following Monte Carlo simulation:
</p>
<ul>
<li>Choose B=10,000 as the number of simulations.</li>
<li>Draw B random samples of size N=99 from an Expo(&lambda;=10) distribution.</li>
<li>For each random sample, compute the sample median.</li>
<li>Average the sample statistics to obtain the Monte Carlo estimate of the population median.</li>
<li>Compare the Monte Carlo estimate to the known value of the distribution's median. The difference is the bias of the sample median in samples of size 99.</li>
</ul>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Monte Carlo simulation for median of EXPO(10) distribution. Estimate bias in the 
   sample median for samples of size N=99.
   The true median is log(2)*lambda ~ 6.93147. */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
lambda = <span style="color: #2e8b57; font-weight: bold;">10</span>;                   <span style="color: #006400; font-style: italic;">/* generate random sample of size N from Expo(lambda) */</span>
expo_median = <span style="color: #0000ff;">log</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span><span style="color: #006400; font-style: italic;">*lambda;</span>
&nbsp;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">12345</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">99</span>;                        <span style="color: #006400; font-style: italic;">/* sample size is odd ==&gt; median is middle order statistic */</span>
B = <span style="color: #2e8b57; font-weight: bold;">10000</span>;                     <span style="color: #006400; font-style: italic;">/* number of Monte Carlo samples */</span>
&nbsp;
sample_median = j<span style="color: #66cc66;">&#40;</span>B,<span style="color: #2e8b57; font-weight: bold;">1</span>,.<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">x</span> = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to B;
   <span style="color: #0000ff;">call</span> randgen<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, <span style="color: #a020f0;">&quot;Expo&quot;</span>, lambda<span style="color: #66cc66;">&#41;</span>;
   sample_median<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span> = median<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">end</span>;
&nbsp;
MC_est = <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>sample_median<span style="color: #66cc66;">&#41;</span>;
MC_SE_est = <span style="color: #0000ff;">std</span><span style="color: #66cc66;">&#40;</span>sample_median<span style="color: #66cc66;">&#41;</span> / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>B<span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* std error for Monte Carlo estimate */</span>
width = <span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #006400; font-style: italic;">*MC_SE_est;</span>                      <span style="color: #006400; font-style: italic;">/* 3*SE ==&gt; 99.7% CI */</span>
Lower_CL = MC_est - width;
Upper_CL = MC_est + width;
print MC_est expo_median <span style="color: #66cc66;">&#40;</span>MC_est-expo_median<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Est Bias&quot;</span><span style="color: #66cc66;">&#93;</span>;
print MC_est MC_SE_est width Lower_CL Upper_CL;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/medianBias1.png" alt="" width="380" height="135" class="alignnone size-full wp-image-58339" srcset="https://blogs.sas.com/content/iml/files/2026/04/medianBias1.png 380w, https://blogs.sas.com/content/iml/files/2026/04/medianBias1-300x107.png 300w" sizes="(max-width: 380px) 100vw, 380px" />

<p>The simulation shows two facts. 
First, the Monte Carlo estimate is larger than the median of the distribution. The estimated bias is approximately 0.0457.
Is that bias large? Well, 
"large" or "small" must be compared to some standard unit, and for a statistic the "unit" is the standard error of the statistic.
The asymptotic standard error of the median statistic is approximately &lambda;/sqrt(N) ~ 1.005.
So, the bias is less than 5% of the standard error. In other words, if you are analyzing one single data set of size N=99, the 
standard error of the median is 20 times larger than the bias. This is why we typically can ignore bias. 
</p><p>
The second fact is the accuracy of the Monte Carlo estimate. After 10,000 simulations, a 
99.7% confidence interval for the estimate has a half-width of 0.03. We are fairly confident that the expected value of the 
sample median is within 0.03 of the computed Monte Carlo estimate. 
The next section shows that the expected value for the sample median is 6.982, which is indeed inside the confidence interval for this set of random trials.
</p>

<h3>Order statistics for the exponential distribution</h3>
<p>
For the exponential distribution, you can use order statistics to explicitly compute the expected value of the sample median.
To simplify matters, let's assume that N=2*k - 1 is an odd number. 
To simplify notation, let's sort the data and denote the values by 
<br />
X<sub>1</sub> &le; X<sub>2</sub> &le; ...&le; X<sub>k</sub> &le; ...&le; X<sub>N</sub>.
</p><p>
The median is the "k_th order statistic," which is simply the k_th element in the sorted list of values.
As shown in <a href="https://en.wikipedia.org/wiki/Order_statistic#Order_statistics_sampled_from_an_exponential_distribution">a Wikipedia article on order statistics</a>, 
the expected value of the <em>k</em>-th smallest observation is known to be:
<br />
E[X<sub>k</sub>] = &lambda; &sum;<sub>i=1</sub><sup>k</sup>  1 / (N &ndash; i + 1) 
</p>

<p>
For N=99, k=50. The following SAS IML function evaluates the expected value of the sample median for an Expo(10) sample:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* For a sample of size N drawn from an Expo distribution,
   the expected value of the k_th smallest obs (the k_th order statistic)
   is given by an explicit formula. 
*/</span>
start EExpOrderStat<span style="color: #66cc66;">&#40;</span>k, <span style="color: #0000ff;">N</span>, lambda=<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   i = T<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:k<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> lambda <span style="color: #006400; font-style: italic;">* sum( 1 / (N - i + 1) ) );</span>
finish;
&nbsp;
<span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">99</span>;
lambda = <span style="color: #2e8b57; font-weight: bold;">10</span>;
<span style="color: #006400; font-style: italic;">/* expected value of the median for an ODD sample ~ Expo(lambda) */</span>
k = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>+<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>/<span style="color: #2e8b57; font-weight: bold;">2</span>;
Emedian_N99 = EExpOrderStat<span style="color: #66cc66;">&#40;</span>k, <span style="color: #0000ff;">N</span>, lambda<span style="color: #66cc66;">&#41;</span>;
print Emedian_N99;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/medianBias2.png" alt="" width="110" height="66" class="alignnone size-full wp-image-58345" />

<p>The expected value is <strong>6.98172</strong>. Notice that this value is inside the 99.7% confidence interval from our Monte Carlo simulation.
</p>

<h3>Convergence of a biased statistic</h3>
<p>
What does the convergence look like if you graph it? 
You will see an estimate that converges, but not to the parameter value. Instead, it converges to the expected value of the statistic.
The following graph shows the Monte Carlo estimate of the sample median after k simulations, k=1..B.
You can see that the estimate converges, but there is a big gap (the bias) between the estimate and the population parameter.
</p>
<a href="https://blogs.sas.com/content/iml/files/2026/04/medianBias3.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/medianBias3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58354" srcset="https://blogs.sas.com/content/iml/files/2026/04/medianBias3.png 640w, https://blogs.sas.com/content/iml/files/2026/04/medianBias3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>


<h3>Conclusion</h3>

<p>
For symmetric distributions,
the sample median is an unbiased estimator of the population median. 
Furthermore, the estimator is <em>asymptotically</em> unbiased, which means the bias is negligible for very large samples.
However, for right-skewed distributions like the exponential distribution, the sample median is biased upwards. 
This bias is important for Monte Carlo simulations because the Monte Carlo estimate converges to the expected value 
of a statistic. When the statistic is biased, the Monte Carlo estimate converges to a value that is different from 
the population parameter.  This is most apparent when you are simulating small samples.
</p>


<h3>Appendix: The standard error of the median for an exponential distribution</h3>
<p>
In the text, I claimed that the asymptotic standard error of the sample median of Expo(&lambda;) data is 
approximately &lambda;/sqrt(N). But wait, isn't that the well-known formula for the standard error of the <em>mean</em>? We want the standard error of the <em>median</em>!
</p>
<p>
The formula is correct. The formula relies on the fact that the sampling distribution of the median is asymptotically normal for continuous distributions.
This asymptotic normality is true for all percentiles. It is the basis for the CIPCTLNORMAL option in PROC UNIVARIATE,
which provides confidence intervals, assuming the sampling distribution of the percentiles are normal.
If you like math, you can <a href="https://www.math.mcgill.ca/dstephens/OldCourses/556-2006/Math556-Median.pdf">read the proof in these lecture notes from McGill University</a>.
</p>
<p>
Asymptotically, the distribution of the sample median is normally distributed with standard error
<br />
SE<sub>med</sub> = 1 / (2*sqrt(N)*f(M))
<br />
where f is the PDF of the continuous distribution and M is the true population median. 
(Obvious, we need f(M) &ne; 0.)
For the exponential distribution,
f(x) = (1/&lambda;) exp(-x/&lambda;) and M = &lambda;*log(2). 
Thus, a several terms cancel and 
<br />
f(M) = 1/(2&lambda;)
</p>
<p>
Consequently, the asymptotic standard error is SE<sub>med</sub> = &lambda;/sqrt(N), which is the same formula as the standard error of the mean.
</p>


<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/20/sample-median-bias.html">The sample median is a biased estimator for skewed distributions</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/04/20/sample-median-bias.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/04/medianBias3-150x150.png" />
	</item>
		<item>
		<title>On monitoring the convergence of Monte Carlo simulations</title>
		<link>https://blogs.sas.com/content/iml/2026/04/13/monitor-convergence-monte-carlo.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/04/13/monitor-convergence-monte-carlo.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 13 Apr 2026 09:27:00 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58261</guid>

					<description><![CDATA[<p>A previous article discusses Welford's one-pass method for computing the sample mean and variance. The article mentions that a useful application of Welford's formulas is to monitor the convergence of a Monte Carlo simulation. This article shows how to implement that strategy in SAS. The Goldilocks Principle for Monte Carlo [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/13/monitor-convergence-monte-carlo.html">On monitoring the convergence of Monte Carlo simulations</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A previous article discusses <a href="https://blogs.sas.com/content/iml/2026/04/06/welford-mean-var.html">Welford's one-pass method for computing the sample mean and variance</a>. 
The article mentions that a useful application of Welford's formulas is to monitor the 
convergence of a Monte Carlo simulation.
This article shows how to implement that strategy in SAS.
</p>

<h3>The Goldilocks Principle for Monte Carlo methods</h3>
<p>

<a href="https://en.wikipedia.org/wiki/Goldilocks_principle">The Goldilocks Principle</a>
is a name given to efforts to achieve a balance between extremes.
For example, suppose my goal is to ensure that my house is clean. 
There are two extremes. 
If I spend every hour of every weekend cleaning, dusting, and vacuuming, 
my house will look pristine. However, I expend a lot of time and effort. 
In contrast, if I spend no time cleaning, the house will be dirty.
Somewhere between those two extremes is the Goldilocks strategy: I exert enough effort to make the house look nice, but the effort
doesn't consume all my resources.  The optimal amount of effort depends on how clean I want the house to be.
</p><p>
The same trade-offs apply in many areas of computational statistics.
In Monte Carlo simulations, you can get better estimates if you run more simulations, but more simulations require more time and computing resources.
A traditional Monte Carlo simulation looks like this:
</p>

<ol>
   <li>Choose the number of simulations, <em>B</em>, that you plan to run. This is essentially a GUESS!</li>
   <li>Generate a random sample of size <em>N</em> and compute statistics for that sample.</li>
   <li>Repeat this <em>B</em> times. Store all sample statistics in an array.</li>
   <li>Compute the Monte Carlo mean and report a measure of accuracy, such as the standard error or a confidence interval.</li>
</ol>

<p>
The problem is we do not know <em>a priori</em> how large to choose <em>B</em>.
If <em>B</em> is too small, the simulation study is underpowered, and the 
final Monte Carlo estimate is not very accurate. Consequently, you might need to repeat the simulation 
with a larger value of B to obtain a more accurate estimate. 
On the other hand, if <em>B</em> is too large, the Monte Carlo estimate is very accurate, but you wasted time 
and CPU costs needlessly.
</p>

<p>
By using Welford's update formula inside a simulation loop, you don't have to guess a value for B, number of Monte Carlo simulations.
Instead, you treat the statistic from each random sample as a piece of streaming data. You keep track of the mean and variance of the Monte Carlo estimate.
When a confidence interval for the estimate is sufficiently small, you stop the simulation loop.
This is sometimes called an <em>adaptive</em> Monte Carlo simulation. 
</p>

<h3>Example of the traditional Monte Carlo simulation in SAS</h3>
<p>
Before looking at the adaptive Monte Carlo method, let's run a traditional simulation study.
Suppose you have data (N=100) that you believe to be exponentially distributed. You compute the mean of the sample
to estimate the population mean. A classic problem in statistics is to infer
how close the estimate is to the true population mean.
To do that, you need to understand the sampling distribution of the mean in a random exponential sample of size N=100.
There are several ways to do that, but I want to use a Monte Carlo simulation.
</p><p>
The following Monte Carlo simulation uses PROC IML to generate B=10,000 random samples from 
the Expo(&lambda;=10) distribution, where &lambda; is a scale parameter. 
The program computes the sample mean for each random sample. 
It then computes the Monte Carlo estimate of the statistic by taking the mean of the 10,000
samples statistics. 
Lastly, it estimates the standard error of the Monte Carlo estimate and uses that to 
form a 99.7% confidence interval for the estimate.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* traditional Monte Carlo simulation: Compute the Monte Carlo estimate of the 
   mean of a random sample of size N. Do this a large number of times (B) to ensure
   that the estimate is accurate. 
&nbsp;
   For this example, we sample from the Exp(lambda) distribution where lambda=10 
   is a scale parameter. The true mean is lambda. 
*/</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">100</span>;                       <span style="color: #006400; font-style: italic;">/* sample size */</span>
B = <span style="color: #2e8b57; font-weight: bold;">10000</span>;                     <span style="color: #006400; font-style: italic;">/* number of Monte Carlo samples */</span>
lambda = <span style="color: #2e8b57; font-weight: bold;">10</span>;                   <span style="color: #006400; font-style: italic;">/* scale parameter for Expo distrib */</span>
&nbsp;
sample_mean = j<span style="color: #66cc66;">&#40;</span>B,<span style="color: #2e8b57; font-weight: bold;">1</span>,.<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">x</span> = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to B;
   <span style="color: #0000ff;">call</span> randgen<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, <span style="color: #a020f0;">&quot;Expo&quot;</span>, lambda<span style="color: #66cc66;">&#41;</span>;
   sample_mean<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span> = <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">end</span>;
&nbsp;
MC_est = <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>sample_mean<span style="color: #66cc66;">&#41;</span>;           <span style="color: #006400; font-style: italic;">/* MC estimate of statistic */</span>
SE_est = <span style="color: #0000ff;">std</span><span style="color: #66cc66;">&#40;</span>sample_mean<span style="color: #66cc66;">&#41;</span> / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>B<span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* MC standard error */</span>
width = <span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #006400; font-style: italic;">*SE_est;</span>                     <span style="color: #006400; font-style: italic;">/* half-width of CI: 3*SE ==&gt; 99.7% CI */</span>
Lower_CL = MC_est - width;
Upper_CL = MC_est + width;
print MC_est SE_est width Lower_CL Upper_CL;
&nbsp;
<span style="color: #006400; font-style: italic;">/* we saved all the Monte Carlo estimates, so plot the sampling distribution */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;MC Distribution&quot;</span>;
title2 <span style="color: #a020f0;">&quot;N=100;  B=10,000&quot;</span>;
<span style="color: #0000ff;">call</span> histogram<span style="color: #66cc66;">&#40;</span>sample_mean<span style="color: #66cc66;">&#41;</span> other=<span style="color: #a020f0;">&quot;refline 9.9860983 9.9560683	10.016128/axis=x;&quot;</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford4.png" alt="" width="370" height="66" class="alignnone size-full wp-image-58288" srcset="https://blogs.sas.com/content/iml/files/2026/04/Welford4.png 370w, https://blogs.sas.com/content/iml/files/2026/04/Welford4-300x54.png 300w" sizes="(max-width: 370px) 100vw, 370px" />
<br />
<a href="https://blogs.sas.com/content/iml/files/2026/04/Welford5.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford5.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58291" srcset="https://blogs.sas.com/content/iml/files/2026/04/Welford5.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Welford5-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
What do we learn from this output about the accuracy of the Monte Carlo estimate? 
First, the histogram shows that the sample mean in random samples of size N=100 is quite variable, but approximately normally distributed.
The Monte Carlo estimate of the population mean is the average of all these sample values.
The Monte Carlo estimate and a 99.7% confidence interval are overlaid on the histogram.
After 
B=10,000 simulations, a 99.7% confidence interval for the Monte Carlo estimate has a half-width of about 0.03. 
</p><p>
If we want the Monte Carlo estimate to be more accurate, we can run additional 
simulations. How many simulations would we need for the 99.7% CI to have a half-width of 0.01 units?
Theory tells us that the Monte Carlo standard error is proportional 
to 1/sqrt(B). The standard interpretation of this result is that if you want to cut the standard error in half, 
you must quadruple the number of simulations. In this case, we want to reduce the standard error from 
0.03 to 0.01 (a ratio of 1/3), so we expect this to happen if we perform 9 times as many simulations, or roughly B=90,000.
</p>
<p>
These calculations are based on having run an initial simulation study with low power.
This is often a good idea.
But if you monitor the standard error during the simulation loop, you might not need the initial study.
Instead, you can keep iterating until a confidence interval for the standard error is below some pre-determined tolerance.
</p>
<p>

<h3>Adaptive Monte Carlo simulation in SAS</h3>
</p><p>
This section implements an adaptive Monte Carlo simulation in SAS IML. 
The main difference between an adaptive program and a standard Monte Carlo simulation is that the 
adaptive program uses a DO-UNTIL statement to repeat the trials until 
a confidence interval for the estimate is sufficiently small.
</p>
<p>In the previous example, the accuracy of the MC estimate after 10,000 iterations is about 0.03, as estimated by the half-width of a 99.7% confidence interval. 
Instead of running the entire simulation and then checking the accuracy at the end, 
you can use Welford's formula to provide a running standard error.
You can use DO-UNTIL logic to loop until the Monte Carlo estimate is sufficiently accurate
</p>
<p>The following IML program implements this adaptive method. Inside the iteration loop, Welford's formula updates the mean and variance. 
The program always generates at least 1,000 random samples (minB) but never computes more than 100,000 samples (maxB).
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Adaptive Monte Carlo simulation using Welford's formula */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">100</span>;                       <span style="color: #006400; font-style: italic;">/* sample size */</span>
maxB = <span style="color: #2e8b57; font-weight: bold;">100000</span>;                 <span style="color: #006400; font-style: italic;">/* maximum number of Monte Carlo samples */</span>
minB =   <span style="color: #2e8b57; font-weight: bold;">1000</span>;                 <span style="color: #006400; font-style: italic;">/* run at least this many trials */</span>
lambda = <span style="color: #2e8b57; font-weight: bold;">10</span>;                   <span style="color: #006400; font-style: italic;">/* scale parameter for Expo distrib */</span>
&nbsp;
<span style="color: #0000ff;">x</span> = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
m = <span style="color: #2e8b57; font-weight: bold;">0</span>; varsum = <span style="color: #2e8b57; font-weight: bold;">0</span>;             <span style="color: #006400; font-style: italic;">/* Welford's running totals */</span>
tol = <span style="color: #2e8b57; font-weight: bold;">0.01</span>;                    <span style="color: #006400; font-style: italic;">/* desired accuracy */</span>
err = <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #006400; font-style: italic;">*tol;</span>                   <span style="color: #006400; font-style: italic;">/* make sure err &gt; tol to enter loop */</span>
<span style="color: #0000ff;">do</span> k = <span style="color: #2e8b57; font-weight: bold;">1</span> to maxB <span style="color: #0000ff;">until</span><span style="color: #66cc66;">&#40;</span>err &lt; tol<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #006400; font-style: italic;">/* you can put ANY simulation here */</span>
   <span style="color: #0000ff;">call</span> randgen<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, <span style="color: #a020f0;">&quot;Expo&quot;</span>, lambda<span style="color: #66cc66;">&#41;</span>;
   y = <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #006400; font-style: italic;">/* use Welford's online mean and variance formulas to monitor convergence */</span>
   varsum = varsum + <span style="color: #66cc66;">&#40;</span>k-<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #006400; font-style: italic;">* (y - m)**2 / k;</span>  <span style="color: #006400; font-style: italic;">/* update the sum of squared differences */</span>
   m = m + <span style="color: #66cc66;">&#40;</span>y - m<span style="color: #66cc66;">&#41;</span> / k;                       <span style="color: #006400; font-style: italic;">/* update the mean */</span>
   <span style="color: #0000ff;">if</span> k &gt; minB <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;
      sd = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>varsum/<span style="color: #66cc66;">&#40;</span>k-<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
      err = <span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #006400; font-style: italic;">*sd / sqrt(k);</span>                 <span style="color: #006400; font-style: italic;">/* 3*SE ==&gt; 99.7% confidence */</span>
    <span style="color: #0000ff;">end</span>;
<span style="color: #0000ff;">end</span>;
&nbsp;
print k m err<span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">'error'</span><span style="color: #66cc66;">&#93;</span>;
<span style="color: #000080; font-weight: bold;">quit</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford6.png" alt="" width="199" height="68" class="alignnone size-full wp-image-58294" />

<p>
By using Welford's streaming mean and variance formulas inside the DO-UNTIL loop, the simulation runs for exactly as long as required. 
When the tolerance is 0.01, the simulation (for this random-number seed) generates 90,374 samples.
This is an amazingly close agreement with the theoretical estimate, which predicted 90,000 samples. 
</p>

<h3>Summary</h3>
<p>
This article shows a modern application of Welford's formulas for the streaming mean and variance of data.
The application is to iterate an algorithm until the accuracy of an estimate is less than some specified tolerance.
Here, the algorithm is a Monte Carlo simulation, and the estimate is the Monte Carlo estimate of a sample mean.
</p><p>
I recently used this approach while computing the Monte Carlo approximation of an integral. 
I wanted to ensure that my integral estimate was accurate to within a certain tolerance. By monitoring the 
estimate and its standard error, I was able to guarantee the accuracy of the integral calculation.
</p>

<h3>Appendix: The width of the confidence interval</h3>
<p>
I didn't want this article to be too long, but I wanted to show a graph of the running Monte Carlo estimate and its 99.7% confidence interval.
The Monte Carlo estimate converges quickly, as shown in the graph below. However, the estimate bounces around.
You can see that when B=5000, the MC estimate is very close to the population mean, which is 10.
However, the confidence interval for the estimate (shown by the gray-blue band) has a half-width of about 0.025.
Indeed, as we add more simulations, the MC estimate jitters up and down, so we cannot say with confidence that we have an accurate
estimate. 
</p><p>

<a href="https://blogs.sas.com/content/iml/files/2026/04/Welford7.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford7.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58324" srcset="https://blogs.sas.com/content/iml/files/2026/04/Welford7.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Welford7-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>
</p>
<p>
You can make a similar graph all the way out to the 90,000th simulation, but the half-width changes so slowly that it
becomes hard to discern that it is decreasing at all. Instead, the following graph shows ONLY the half-width of the CI versus the 
number of simulations. In this graph, you can see the slow rate at which the CI decreases. The half-width doesn't drop below 0.01 until after 90,000 simulations.
</p>

<a href="https://blogs.sas.com/content/iml/files/2026/04/Welford8.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford8.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58375" srcset="https://blogs.sas.com/content/iml/files/2026/04/Welford8.png 640w, https://blogs.sas.com/content/iml/files/2026/04/Welford8-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a><p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/13/monitor-convergence-monte-carlo.html">On monitoring the convergence of Monte Carlo simulations</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/04/13/monitor-convergence-monte-carlo.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/04/Welford7-150x150.png" />
	</item>
		<item>
		<title>One-pass formulas for mean and variance</title>
		<link>https://blogs.sas.com/content/iml/2026/04/06/welford-mean-var.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/04/06/welford-mean-var.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 06 Apr 2026 09:23:31 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58201</guid>

					<description><![CDATA[<p>In statistical programming, we often assume that the data are stored in a matrix or a data set and can be read at will. For example, if you want to compute a sample mean or standard deviation, you simply pass a vector to a built-in function, and the software spits [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/06/welford-mean-var.html">One-pass formulas for mean and variance</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<a href="https://blogs.sas.com/content/iml/files/2026/04/Welford1.jpg"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford1.jpg" alt="" width="358" height="358" class="alignright size-full wp-image-58204" srcset="https://blogs.sas.com/content/iml/files/2026/04/Welford1.jpg 358w, https://blogs.sas.com/content/iml/files/2026/04/Welford1-300x300.jpg 300w, https://blogs.sas.com/content/iml/files/2026/04/Welford1-150x150.jpg 150w" sizes="(max-width: 358px) 100vw, 358px" /></a>

<p>
In statistical programming, we often assume that the data are stored in a matrix or a data set and can be read at will.
For example, if you want to compute a sample mean or standard deviation, you simply pass a vector to a built-in function, 
and the software spits out the statistics. The function that computes the mean passes through the data once. The textbook
formula for the standard deviation requires two passes through the data. 
</p>

<p>
Modern computers are so fast, and RAM is so cheap and plentiful, that 
today's data analysts probably don't think twice about
using multi-pass algorithms to compute statistics.
But it didn't used to be that way! 
</p><p>
This year SAS is turning 50 years old. In the 1970s, 
RAM was expensive. It was also small: A mainframe computer might have 64KB or less! 
Furthermore, if the data was on a physical media such as a tape, rewinding the tape to read the data again was a time-consuming operation.
Consequently, many research papers in the 1960s and '70s focused on how to compute statistics with one pass through the data.
One such paper was by 
Welford (1962, <em>Technometrics</em>), who showed a simple method for calculating corrected sums of squares and products. 
This one-pass method could be used to compute the mean, variance, and other central moments without using 
any arrays or extra memory.
</p>

<p>
An old proverb states, "everything old is new again," so perhaps it is not surprising that Welford's one-pass formula 
for the mean and standard deviation has applications in the modern era. 
One application is streaming data. If you want to compute statistics for a utility meter or a weather station that monitors conditions continuously,
it would be inefficient to store all the data. It is better to compute statistics on-the-fly, if you can, and maybe store the hourly mean value.
</p>
<p>
Another modern application is for Monte Carlo simulations. Often, we guess how many iterations we need, run the simulation, and hope that the resulting estimate is accurate enough to use. Welford's method provides a way to compute the standard error while the simulation runs, 
which means that we can program logic to exit the simulation when the Monte Carlo estimate satisfies a predetermined level of accuracy.
</p>

<p>This article discusses how to use Welford's one-pass formula to compute moments for streaming data. 
A subsequent article shows how to use this technique to monitor and stop a Monte Carlo simulation.
</p>

<h3>The traditional "offline" mean and variance</h3>
<p>
In the old days, algorithms were sometimes described as "online" or "offline." 
Since these terms are before the creation of the internet, they sound a little strange to the modern programmer.
In computer science, an "offline" algorithm 
requires that the complete data are available before the algorithm is run.
In contrast, an "online" algorithm processes its input serially, one datum at a time, and doesn't store the raw data. 
I prefer to call this type of algorithm a one-pass algorithm that requires little or no auxiliary storage.
</p>
<p>
The familiar "textbook" method for computing means and variances is an offline algorithm.
To compute a mean, you sum the full set of N numbers then divide by N. For the variance, 
you pass through the data again, this time computing the sum of squared deviances (SSE) from the mean.
You divide the SSE by N-1 to obtain the variance. If desired, take the square root to obtain the standard deviation.
For example, the following SAS IML program shows the traditional offline algorithm:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
y = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">48</span>, <span style="color: #2e8b57; font-weight: bold;">51</span>, <span style="color: #2e8b57; font-weight: bold;">46</span>, <span style="color: #2e8b57; font-weight: bold;">37</span>, <span style="color: #2e8b57; font-weight: bold;">45</span>, <span style="color: #2e8b57; font-weight: bold;">50</span>, <span style="color: #2e8b57; font-weight: bold;">46</span>, <span style="color: #2e8b57; font-weight: bold;">51</span>, <span style="color: #2e8b57; font-weight: bold;">51</span>, <span style="color: #2e8b57; font-weight: bold;">55</span><span style="color: #66cc66;">&#125;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* A traditional or &quot;offline&quot; calculation of the mean and standard deviation
   uses two passes through the data: 
   One for the mean, and a second pass for the sum of squared deviations, 
   which gives the variance or standard deviation. */</span>
trad_m = <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>y<span style="color: #66cc66;">&#41;</span>;
trad_sse = ssq<span style="color: #66cc66;">&#40;</span>y - trad_m<span style="color: #66cc66;">&#41;</span>;
trad_sd = <span style="color: #0000ff;">std</span><span style="color: #66cc66;">&#40;</span>y<span style="color: #66cc66;">&#41;</span>;
print trad_m trad_sse trad_sd;   <span style="color: #006400; font-style: italic;">/* sample mean = 48; StdDev ~ 5 */</span></pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford2.png" alt="" width="198" height="65" class="alignnone size-full wp-image-58237" />

<p>This two-pass approach is perfectly fine for everyday data analysis. 
It requires that the data vector <code class="preserve-code-formatting">y</code> is known and fits in memory. 
But it is possible to compute these statistics without using extra memory. 
The next section shows how to compute the 
mean and standard deviation by using a one-pass algorithm.
</p>

<h3>The Welford update formula for "online" data</h3>

<p>Welford (1962) published a one-pass algorithm for computing the mean and variance. 
This method was popularized when Knuth featured it in his seminal book,
<em>The Art of Computer Programming, Volume II: Seminumerical Algorithms</em> (1981). 
The Welford method maintains two running totals: one for the mean, and another for the sum of squared differences. 
For each new datum, the algorithm updates these running totals.
The following IML program implements Welford's algorithm by explicitly looping over the elements of the <code class="preserve-code-formatting">y</code> vector:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* The Welford update formula:
   First, initialize the Welford variables for the one-pass mean and std dev */</span>
m = <span style="color: #2e8b57; font-weight: bold;">0</span>;
varsum = <span style="color: #2e8b57; font-weight: bold;">0</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Update the running totals for each new datum y[k]. */</span> 
<span style="color: #0000ff;">do</span> k = <span style="color: #2e8b57; font-weight: bold;">1</span> to nrow<span style="color: #66cc66;">&#40;</span>y<span style="color: #66cc66;">&#41;</span>;
   varsum = varsum + <span style="color: #66cc66;">&#40;</span>k-<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #006400; font-style: italic;">* (y[k] - m)**2 / k;</span>  <span style="color: #006400; font-style: italic;">/* update the sum of squared differences */</span>
   <span style="color: #0000ff;">if</span> k &gt; <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">then</span> sd = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>varsum/<span style="color: #66cc66;">&#40;</span>k-<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   m = m + <span style="color: #66cc66;">&#40;</span>y<span style="color: #66cc66;">&#91;</span>k<span style="color: #66cc66;">&#93;</span> - m<span style="color: #66cc66;">&#41;</span> / k;                       <span style="color: #006400; font-style: italic;">/* update the mean */</span>
<span style="color: #0000ff;">end</span>;
print m varsum sd;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/04/Welford3.png" alt="" width="160" height="66" class="alignnone size-full wp-image-58234" />

<p>
In the program, the value of m on the right side of the mean-update formula 
is the sample mean after seeing k-1 observations, so denote it by m<sub>k-1</sub>.
The formula overwrites m with the mean of k observations, so denote the left side as m<sub>k</sub>.
It's not hard to see why Welford's update formula is true.
You can rewrite the mean of k observations by 
writing the sum of k observations as the k_th observation plus the sum of the earlier k-1 observations, as follows:
<br />
<span class='MathJax_Preview'>\(
\begin{aligned}
m_k &= \frac{1}{k} \sum_{i=1}^{k} x_i \\
    &= \frac{1}{k} \left( x_k + \sum_{i=1}^{k-1} x_i \right) \\
    &= \frac{1}{k} \left( x_k + \frac{k-1}{k-1} \sum_{i=1}^{k-1} x_i \right) \\
    &= \frac{1}{k} \left( x_k + (k-1) m_{k-1} \right) \\
    &= m_{k-1} + ( x_k - m_{k-1} ) / k
\end{aligned}
\)</span><script type='math/tex'>
\begin{aligned}
m_k &= \frac{1}{k} \sum_{i=1}^{k} x_i \\
    &= \frac{1}{k} \left( x_k + \sum_{i=1}^{k-1} x_i \right) \\
    &= \frac{1}{k} \left( x_k + \frac{k-1}{k-1} \sum_{i=1}^{k-1} x_i \right) \\
    &= \frac{1}{k} \left( x_k + (k-1) m_{k-1} \right) \\
    &= m_{k-1} + ( x_k - m_{k-1} ) / k
\end{aligned}
</script>
</p>
<p>
The derivation of the formula for the variance is only slightly more complicated. 
<a href="https://dev.to/kylepena/derivation-of-recurrence-relation-of-variance-based-on-second-central-moments-welfords-algorithm-305n">
You can find the derivation online.</a>
</p>

<h3>Other useful one-pass methods</h3>
<p>
Since SAS is celebrating its 50th birthday, I should mention 
the importance of one-pass methods in early SAS procedures. 
In the 1970s, SAS was able to leverage one-pass methods to compute basic statistics, regression analyses, and multivariate analyses on data sets that had 
a huge number of observations.
The regression and multivariate analyses were possible because 
<a href="https://blogs.sas.com/content/iml/2019/04/10/compute-sscp-matrix.html">SAS computes the 
uncorrected sum of squares and crossproducts (USSCP or X`X) matrix in one pass through the data.</a>. 
From the USSCP matrix, <a href="https://blogs.sas.com/content/iml/2018/04/18/sweep-operator-sas.html">you can use the 
SWEEP operator</a> to obtain the centered SSCP, which then yields the covariance and correlation matrices.
SAS used these and other one-pass methods to serve customers who needed to analyze Big Data quickly.
</p>


<h3>Summary</h3>
<p>
This article discusses the value of Welford's method for the mean and variance.
Welford's method is an example of a one-pass formula.
Today, memory is cheap and plentiful, so one-pass formulas are not as necessary as they were
in the 1970s when SAS was founded.
However, they are still useful. Modern sensors generate hundreds of kilobytes of data every minute,
so it makes sense to compute statistics "on the fly" rather than storing all the raw data.
In addition, the next article shows how you can use Welford's formulas in Monte Carlo simulations 
to monitor convergence.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/04/06/welford-mean-var.html">One-pass formulas for mean and variance</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/04/06/welford-mean-var.html/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/04/Welford1-150x150.jpg" />
	</item>
		<item>
		<title>An NMF analysis of Scotch whiskies</title>
		<link>https://blogs.sas.com/content/iml/2026/03/30/nmf-whisky.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/03/30/nmf-whisky.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 30 Mar 2026 09:28:33 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Matrix Computations]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58087</guid>

					<description><![CDATA[<p>This is the last article in a series about the nonnegative matrix factorization (NMF). In this article, I run and visualize an NMF analysis of the Scotch whisky data and compare it to a principal component analysis (PCA). Previous articles in the series provide information about the whisky data, the [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/03/30/nmf-whisky.html">An NMF analysis of Scotch whiskies</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
This is the last article in a series about the nonnegative matrix factorization (NMF).
In this article, I run and visualize an NMF analysis of the Scotch whisky data and compare it to a principal component analysis (PCA).
</p><p>
Previous articles in the series provide information about the whisky data, the PCA analysis, and issues related to two-factor low-rank factorizations such as the NMF:
</p>
<ul>
<li><a href="https://blogs.sas.com/content/iml/2026/02/23/whisky-pca.html">PCA Analysis of whisky</a>: What is the Scotch whisky data? How can you use a PCA to identify whiskies that have similar flavor characteristics?
</li>
<li><a href="https://blogs.sas.com/content/iml/2026/03/16/on-the-nonuniqueness-of-low-rank-matrix-factors.html">A two-factor rank-k decomposition has inherent ambiguities</a>: You can arbitrarily scale and permute the order of the factors.
</li>
<li><a href="https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html">A comparison of the PCA and NMF decompositions</a> on simulated data, which reveals why the NMF might be preferred for nonnegative data.
</li>
</ul>

<p>
In SAS, you can obtain the NMF by using PROC NMF in SAS Viya, or by implementing the algorithm directly in SAS IML. 
Sharad Saxena has written three blog posts about how to use PROC NMF in SAS Viya: 
</p>
<ol>
<li>
<a href="https://communities.sas.com/t5/SAS-Communities-Library/Non-negative-Matrix-Factorization-Part-1-Understanding-Its/ta-p/973592">How to Understand the NMF</a>
</li>
<li>
<a href="https://communities.sas.com/t5/SAS-Communities-Library/Non-negative-Matrix-Factorization-Part-2-Discovering-Topics-from/ta-p/976933">How to Extract Topics from a Set of Documents</a>
</li>
<li>
<a href="https://communities.sas.com/t5/SAS-Communities-Library/Non-negative-Matrix-Factorization-Part-3-Making-Recommendations/ta-p/980752">How to Use the NMF to Build a Recommender System</a>
</li>
</ol>

<h3>Dimension reduction by using the NMF</h3>
<p>
A previous article uses <a href="https://blogs.sas.com/content/iml/2026/02/23/whisky-pca.html">classical principal component analysis (PCA) to analyze the flavor profiles of 86 Scotch whiskies</a>. 
Each whisky is assigned a value 0-4 according to 12 flavors:  
Tobacco, Medicinal, Smoky, Body, Spicy, Winey, Nutty, Honey, Malty, Fruity, Sweetness, and Floral.
The PCA and NMF analyses were discussed in an article by Young, Fogel, and Hawkins (2006, <a href="https://www.niss.org/sites/default/files/ScotchWhisky.pdf">"Clustering scotch whiskies using non-negative matrix factorization"</a>, <em>SPES Newsletter</em>). 
</p>
<p>
By using the PCA, you can find a low-dimensional linear subspace in the 12-dimensional variable space
such that the subspace captures most of the variance in the data. 
This is called <em>dimension reduction</em>. For the whisky data, 
you can use PCA to choose a four-dimensional subspace that explains  67% of the variability in the data.
The PCA factorization has two important properties: it maximizes the explained variance of the data within a lower-dimensional linear subspace,
and it provides orthogonal basis vectors for the subspace.
</p><p>
Whereas the PCA can analyze any centered data matrix, 
the NMF factorization is designed for dimension-reduction of a nonnegative data matrix.
The NMF seeks to decompose the data matrix into profile vectors (which are basis vectors for a lower-dimensional subspace)
and a matrix of weights so that the product approximates the data matrix. In addition, the NMF requires that 
the profile vectors and weights be nonnegative. This last condition is why the NMF factors 
are often more interpretable than the PC vectors. 
However, the NMF decomposition does not have the maximum-variance or orthogonal-basis properties that
make the PCA useful.
</p>

<h3>How to interpret the NMF factors</h3>
<p>
The whisky data can be represented as an
86x12 nonnegative data matrix, X.
Each row represents flavor characteristics of a whisky.
Each whisky is assigned a flavor profile that consists of the values 0-4 for the 12 flavors.
</p><p>
The NMF creates a rank-4 approximation to X.
The approximation is represented as a product W*H,
where W is 86x4 and H is 4x12.
The rows of H form a basis for a 4-D linear subspace.
These are the "flavor profile" vectors that are analogous to the principal components (eigenvectors) 
in a PCA analysis.
</p><p>
The rows of the W matrix are weights (or coefficients) for the 4-factor approximation of the rows of X. 
If W<sub>i</sub> is the i_th row of W, the product 
W<sub>i</sub> * H approximates the flavor profile for the i_th whisky.
Equivalently, if  H<sub>1</sub>-H<sub>4</sub> are the rows of H, then the linear combination
<br />
W<sub>i1</sub>H<sub>1</sub> +  W<sub>i2</sub>H<sub>2</sub> + W<sub>i3</sub>H<sub>3</sub> + W<sub>i4</sub>H<sub>4</sub> 
<br />
is an approximation of the flavor profile for the i_th whisky.
In this sense, W provides the weights for the approximation of observations in terms of the NMF profile vectors.
</p>
<p>
You can also interpret H as a set of weights for an approximation to the variables of X.
The columns of W can be considered "latent factors" for X. The linear combination W*H[,j] is an approximation 
of the j_th variable, 
X<sub>j</sub>.
</p>
<p>
Be aware that the role played by W and H depends on how the data are arranged in X.
For example, it is equally valid to represent the data by using a 12x86 data matrix, Y, where the rows are the flavors and the columns are the whiskies. In that case,
Y` = X, and Y = H`*W` is a valid NMF factorization. 
</p>


<h3>An NMF analysis of the whisky data</h3>
<p>
I'm a statistician, not a Scotch connoisseur. Nevertheless, let's perform an NMF analysis of the whisky data.
I used SAS IML to implement a simple version of the NMF algorithm.
You can <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/NMF/whisky_NMF.sas">download the SAS programs in this article from GitHub</a>
and run the analysis yourself.
After reading the whisky data into a matrix, the following IML statements 
perform an NMF analysis and visualize the H matrix.
Since the rows of H are the basis vectors for a 4-D subspace, I transpose the matrix 
and display the columns. 
As shown in a previous article, 
<a href="https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html">you can scale the NMF factors so 
that they are approximately on the same scale as the PCA factors</a>.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">%let</span> varNames = Tobacco Medicinal Smoky Body Spicy Winey Nutty Honey Malty Fruity Sweetness Floral;
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Apply NMF to the Scotch whisky data */</span>
varNames = propcase<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;varNames</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#41;</span>;
use Whisky;
   read all <span style="color: #0000ff;">var</span> varNames <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">X</span>;
   read all <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Distillery&quot;</span> <span style="color: #a020f0;">&quot;Selected&quot;</span><span style="color: #66cc66;">&#125;</span>;
<span style="color: #0000ff;">close</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* The PCA analysis of the whisky data kept four PCs.
   Load the nmf_mult function and compute a four-factor NMF */</span>
load module = <span style="color: #0000ff;">_ALL_</span>;
names = <span style="color: #a020f0;">'NMF 1'</span>:<span style="color: #a020f0;">'NMF 4'</span>;
<span style="color: #000080; font-weight: bold;">run</span> nmf_mult<span style="color: #66cc66;">&#40;</span>W1, H1, <span style="color: #0000ff;">X</span>, ncol<span style="color: #66cc66;">&#40;</span>names<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* scale the H matrix to use the same scale used in the PCA article */</span>
maxValPCA = <span style="color: #2e8b57; font-weight: bold;">0.8506</span>;             <span style="color: #006400; font-style: italic;">/* max value in the PCA eigenbasis */</span>
scale = maxValPCA / H1<span style="color: #66cc66;">&#91;</span>,&lt;&gt;<span style="color: #66cc66;">&#93;</span>;    <span style="color: #006400; font-style: italic;">/* scale each row by maximum, then multiply by maxVal */</span>
W = W1 / scale`;
H = H1 # scale;
print  <span style="color: #66cc66;">&#40;</span>W<span style="color: #66cc66;">&#91;</span>&lt;&gt;,<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Max W Cols&quot;</span><span style="color: #66cc66;">&#93;</span>,   <span style="color: #66cc66;">&#40;</span>H<span style="color: #66cc66;">&#91;</span>,&lt;&gt;<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Max H Rows&quot;</span><span style="color: #66cc66;">&#93;</span>;
&nbsp;
<span style="color: #0000ff;">%let</span> Reds = CXF7F7F7 CXFDDBC7 CXF4A582 CXD6604D CXB2182B ;
ods graphics / width=360px height=480px;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>H`<span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;NMF Factors (H`) for Whiskies&quot;</span>
                     colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;Reds</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span>||<span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span>H<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
                     xvalues=names yvalues=varNames;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF1.png" alt="" width="360" height="480" class="alignnone size-full wp-image-58123" srcset="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF1.png 360w, https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF1-225x300.png 225w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<a href="https://blogs.sas.com/content/iml/files/2026/02/whisky_heat2.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/02/whisky_heat2.png" alt="" width="360" height="480" class="alignnone size-full wp-image-57782" srcset="https://blogs.sas.com/content/iml/files/2026/02/whisky_heat2.png 360w, https://blogs.sas.com/content/iml/files/2026/02/whisky_heat2-225x300.png 225w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
For comparison, I also show <a href="https://blogs.sas.com/content/iml/2026/03/02/heatmap-pca.html">a heat map of the factors (basis elements) for the four-component PCA analysis from my previous article</a>.
The colors in the left image correspond to the interval [0, 0.8506]. 
The colors in the right image correspond to the interval [-0.8506, 0.8506], where white represents 0, so that the red colors represent the same range in each image. 
The images enable you to compare the NMF and PCA factors. Here are some comparisons:
</p>
<ul>
<li>Both images show latent factors for a four-factor approximation of the whisky flavors. That is, the flavor characteristics of all whiskies are approximated by 
linear combinations of these four factors.</li>
<li>The factors for the NMF are all positive. Because the weights are also positive, the flavor of each whisky is approximated by using an additive sum of these NMF components. 
In contrast, the PCA factors have both positive and negative values. The flavor of each whisky is approximated by using a sum of these contrasts. 
</li>
<li>The first NMF factor is a strong combination of the Fruity, Sweetness, and Floral variables, with smaller amounts of Smoky, Spicy, and Malty. 
At the risk of promoting stereotypes, these dominant flavors are sometimes characterized as "feminine" flavors. 
</li>
<li>The second NMF factor is a combination of 10 flavors, with an emphasis on Body. 
Unlike the first factor, it includes the Winey, Nutty, and Honey flavors.
This factor might be described as modeling "overall" flavors. Or perhaps as a "base" or "average" profile for Scotch whiskies. 
In a subsequent section, you will see that the second factor has a substantial weight for almost all whiskies.
</li>
<li>The third NMF factor emphasizes "masculine" flavors: Tobacco, Medicinal, Smoky. This factor also includes the Body and Spicy variables.
</li>
<li>The fourth NMF factor emphasizes the Nutty flavors, but also includes the Malty and Sweetness variables.
</li>
<li>It is difficult to compare the NMF factors with the PCA factors. The first PCA factor is a contrast between the masculine and feminine flavors. The second PCA factor 
seems to emphasize overall flavor. The fourth factor is primarily a contrast between Nutty and Sweetness.  The fact that the PCA factors are harder to interpret is one reason why NMF 
is used in applications where interpretability is important.
</li>
</ul>

<h3>Approximating whisky flavors in terms of factors</h3>
<p>
The goal of the PCA and NMF analyses is to choose a small number of factors that 
can be used to reduce the dimensionality of the data.
Instead of associating each whisky with 12 variables (flavors), the PCA and NMF analyses 
both choose four new "compound" factors. 
The flavor profile of each whisky is approximated as a linear combination of these factors.
A previous article discusses 
<a href="https://blogs.sas.com/content/iml/2026/03/02/heatmap-pca.html">how to visualize the weights for the four-component PCA analysis</a>.
In a PCA analysis, the weights can be positive or negative.
</p>
<p>
For the NMF analysis, the weights are always nonnegative. The following heat map visualizes 
a selected subset of whiskies in terms of the NMF components.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;">ods graphics / width=480px height=640px;
<span style="color: #006400; font-style: italic;">/* choose a subset of whiskies */</span>
selIdx = loc<span style="color: #66cc66;">&#40;</span>Selected<span style="color: #66cc66;">&#41;</span>;
W_sel = W<span style="color: #66cc66;">&#91;</span>selIdx, <span style="color: #66cc66;">&#93;</span>;
Whisky_name = Distillery<span style="color: #66cc66;">&#91;</span>selIdx<span style="color: #66cc66;">&#93;</span>;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>W_sel<span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;NMF Characteristics of Selected Whiskies (W)&quot;</span>
         colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;Reds</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span>||<span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span>W<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
         xvalues=names yvalues=Whisky_name;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF3.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF3.png" alt="" width="360" height="480" class="alignnone size-full wp-image-58138" srcset="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF3.png 480w, https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF3-225x300.png 225w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
For each whisky (each row), the heat map shows the weights for the approximation of the whisky's flavor in terms of the four NMF factors.
You can use this heat map in place of scatter plots to visualize the four-dimensional scores for each whisky.
A few features are apparent:
</p>
<ul>
<li>The second NMF factor is used for almost all whiskies. In that sense, the second factor is a "base" set of flavors.
Most individual whiskies start with this flavor profile and then add additional features by using other factors.
The Aberlour, Balvenie, Glenlivet, Linkwood, and Macallan whiskies have similar flavors, which is based on the second factor.
The similarity in these whiskies can be explained by noting that these distilleries are all located in the same region (Speyside) Scotland. 
</li>
<li>
The first NMF factor is dominant in the Glenfiddich whisky and five others (Auchentoshan, GlenGarioch,...).
Recall that the first NMF factor emphasizes "feminine" flavors such as Fruity, Sweetness, and Floral.
</li>
<li>
The third NMF factor is dominant in the Ardbeg, Lagavulin, and Laphroaig whiskies.
Recall that the third NMF factor emphasizes "masculine" flavors. Indeed, all three whiskies are often described as "earthy" or "peaty,"
and the distilleries are all on the island of Islay.
</li>
<li>
The fourth NMF factor is primarily related to the Nutty variable.
This factor is not the sole component of whisky.
Rather, this factor is added to other components. 
The Edradour whisky is primarily composed of the "base" flavor with Nutty enhancements. 
</li>
</ul>


<h3>The orthogonality (not!) of NMF factors</h3>
<p>
Unlike the PCA factors, the NMF factors are not orthogonal to each other. 
The NMF factors are correlated, as shown by the following computation:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* show that the profile vectors are not orthogonal */</span>
R = corr<span style="color: #66cc66;">&#40;</span>H`<span style="color: #66cc66;">&#41;</span>;
print R<span style="color: #66cc66;">&#91;</span>c=names r=names F=Bestd5.<span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF2.png" alt="" width="239" height="172" class="alignnone size-full wp-image-58129" />


<h3>How much variance is captured by the NMF factors?</h3>
<p>
When you use a k-factor PCA for dimension-reduction, the factors form a basis for a k-dimensional linear 
subspace that explains the most variance in the data. 
Since the NMF factors are not orthogonal, the subspace spanned by the NMF factors must explain less variance.
But how much less? That answer depends on the data, but you can write a short IML function that provides the answer.
</p><p>
The math is standard linear algebra, but a little long to explain. 
Let the columns of A represent a set of linearly independent basis vectors for an arbitrary subspace.
To compute the variance explained by the subspace, 
you can project the data onto that subspace, then compute the covariance matrix of the projected data.
</p><p>
Let S be the covariance matrix of the original data.
Briefly, you can use the following method to compute
the proportion of the variance in S that is explained by the column space of A.
</p>
<ul>
<li>
Let Q be an orthonormal basis for the column space of A.
</li>
<li>
The projection matrix onto this subspace is P = Q*Q`.
</li>
<li>
The covariance of the projected data is P*S*P.
Therefore, the explained variance is trace(P*S*P).
</li>
<li>
The <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)#Cyclic_property">trace of a product of square matrices is invariant under a cyclic shift</a>
of the matrices in the product. If you substitute P = Q*Q`, shift the order, and simplify, you discover that
the explained variance is trace(Q`*S*Q).
</li>
</ul>

<p>
This method is implemented in the following SAS IML function, which accepts the covariance matrix, S, and a matrix of basis elements, A.
The function returns the proportion of the variance in S that is explained by the subspace defined by the columns of A.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Let S be an nxn covariance matrix. 
   Let A be an nxk matrix with k linearly indep columns.
   This function returns the proportion of the variance in S that is explained by column space of A.
*/</span>
start VarExplained<span style="color: #66cc66;">&#40;</span>A, S<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">call</span> svd<span style="color: #66cc66;">&#40;</span>U, D, V, A<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* Find orthonormal basis, Q, for the span of A */</span>
   k = ncol<span style="color: #66cc66;">&#40;</span>A<span style="color: #66cc66;">&#41;</span>;
   Q = U<span style="color: #66cc66;">&#91;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>:k<span style="color: #66cc66;">&#93;</span>; 
&nbsp;
   var_explained = trace<span style="color: #66cc66;">&#40;</span>Q`<span style="color: #006400; font-style: italic;">*S*Q);</span>   <span style="color: #006400; font-style: italic;">/* variance explained by colspace(A) */</span>
   total_var = trace<span style="color: #66cc66;">&#40;</span>S<span style="color: #66cc66;">&#41;</span>;            <span style="color: #006400; font-style: italic;">/* total variance in S */</span>
   prop_var = var_explained / total_var; <span style="color: #006400; font-style: italic;">/* proportion of var explained */</span>
   <span style="color: #0000ff;">return</span> prop_var;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* For the PCA, find the proportion of variance explained by the first 4 PCs */</span>
<span style="color: #0000ff;">call</span> eigen<span style="color: #66cc66;">&#40;</span>D, U, S<span style="color: #66cc66;">&#41;</span>;                 <span style="color: #006400; font-style: italic;">/* PCA decomp */</span>
U = U<span style="color: #66cc66;">&#91;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #2e8b57; font-weight: bold;">4</span><span style="color: #66cc66;">&#93;</span>;
prop_var_PCA = VarExplained<span style="color: #66cc66;">&#40;</span>U, S<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* somewhat redundant, since Q=U */</span>
<span style="color: #006400; font-style: italic;">/* For the NMF, find the proportion of variance explained by the rowspace of H */</span>
prop_var_NMF = VarExplained<span style="color: #66cc66;">&#40;</span>H`, S<span style="color: #66cc66;">&#41;</span>;
print prop_var_PCA<span style="color: #66cc66;">&#91;</span>F=PERCENT7.1<span style="color: #66cc66;">&#93;</span> prop_var_NMF<span style="color: #66cc66;">&#91;</span>F=PERCENT7.1<span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF4.png" alt="" width="279" height="83" class="alignnone size-full wp-image-58174" />

<p>
The output shows that the maximum variance that can be captured in the 4-D linear subspace is 67.1%.
By comparison, the NMF factors explain 60.3% of the variance.
</p>



<h3>Summary</h3>
<p>
The purpose of this article is to use the NMF factorization on the Scotch whisky data set, which classifies the flavors in 86 Scotch whiskies according to 12 flavor characteristics. 
These data were analyzed by Young, Fogel, and Hawkins (2006).
The NMF decomposes the data matrix into two matrices: A matrix of factors (H) and a matrix of weights (W).
The NMF enables you to reduce the dimensionality of the data by replacing the 12 original variables with four factors that
are nonnegative combinations of the variables.
The rows of the weights matrix provides the coefficients for a rank-4 approximation of the data.
The flavor profile of each whisky is approximated by using a linear combination of factors. 
If you visualize the weights matrix, you can identify which whiskies have similar flavor profiles. 
</p><p>
You can <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/NMF/whisky_NMF.sas">download the SAS programs in this article from GitHub</a>.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/03/30/nmf-whisky.html">An NMF analysis of Scotch whiskies</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/03/30/nmf-whisky.html/feed</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/03/whisky_NMF1-150x150.png" />
	</item>
		<item>
		<title>A comparison of SVD and NMF factorizations</title>
		<link>https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 23 Mar 2026 09:25:52 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=57980</guid>

					<description><![CDATA[<p>The classical singular value decomposition (SVD) has a long and venerable history. I have described it as a fundamental theorem of linear algebra. Nevertheless, the mathematical property that makes it so useful in general (namely, the orthogonality of the matrix factors) also makes it less useful for certain applications. In [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html">A comparison of SVD and NMF factorizations</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
<a href="https://blogs.sas.com/content/iml/2017/08/28/singular-value-decomposition-svd-sas.html">The classical singular value decomposition (SVD)</a> has a long and venerable history.
I have described it as a fundamental theorem of linear algebra. 
Nevertheless, the mathematical property that makes it so useful in general (namely, the orthogonality of the matrix factors) 
also makes it less useful for certain applications. 
In response, researchers developed the nonnegative matrix factorization (NMF), which 
removes the requirement of orthogonality and replaces it with a requirement that 
the elements in the matrix factors be nonnegative.
</p>
<p>
The purpose of this article is to use an example to compare 
the classical singular value decomposition (SVD) and the nonnegative matrix factorization (NMF).
An excellent paper that presented the NMF to the statistical community is
Fogel, Hawkins, Beecher, Luta, and Young (2013),
<a href="https://doi.org/10.1080/00031305.2013.845607">"A tale of two matrix factorizations."</a>
One section of the paper 
that caught my eye is Section 4.2, "Realistic synthetic mixture,"
which compares and contrasts the NMF and the classical SVD.
This article presents a similar example. It also improves the visualizations in the Fogel et al. paper.
The images in this blog post are better for visualizing how the SVD and the NMF methods differ.
</p>

<h3>Simulating patient-gene data for a disease</h3>
<p>
The Fogel et al. example is based on 
clinical study that involves gene expression levels for 30 genes across 40 patient samples.
(Actually, the paper uses 40 genes, but the example is more enlightening if the number of variables and observations are different.)
Assume a disease (such as diabetes) has two distinct genetic etiologies.
The L matrix ("L" for "left side") represents 40 patients.
The first column of L represents a measure of one etiology.
The second column of L represents a measure of a second etiology.
In this example, 10 patients have only the first etiology,
10 have only the second etiology, and 10 have both.
</p>
<p>
The R matrix ("R" for "right side") represents expression levels for 30 genes. 
This example assumes the first 10 genes are related to the first etiology, and the next 10 genes are related to the second etiology. 
The last 10 genes are not related to the disease.
</p>
<p>
You can use SAS IML to simulate these factors and to construct the data matrix as their product.
Because the data are simulated, we know that the data matrix is exactly rank-2, and we know a true set of rank-2 matrices whose product equals the data matrix.
You can simulate the L and R matrices by randomly assigning a value between 1 and 2 to specific cells. 
Notice that these factors are entirely nonnegative, so the product matrix, X = L*R, is also nonnegative.
The following SAS IML statements simulate the L and R matrices and create heat maps to visualize them.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #0000ff;">reset</span> <span style="color: #0000ff;">fuzz</span>;
<span style="color: #006400; font-style: italic;">/*** Simulation of L and R for clinical example of two etiologies ***/</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">123</span><span style="color: #66cc66;">&#41;</span>;
nrow = <span style="color: #2e8b57; font-weight: bold;">40</span>;
ncol = <span style="color: #2e8b57; font-weight: bold;">30</span>;
L = j<span style="color: #66cc66;">&#40;</span>nrow, <span style="color: #2e8b57; font-weight: bold;">2</span>, <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;
idx1 = <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">11</span>:<span style="color: #2e8b57; font-weight: bold;">20</span><span style="color: #66cc66;">&#41;</span> || <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">31</span>:<span style="color: #2e8b57; font-weight: bold;">40</span><span style="color: #66cc66;">&#41;</span>;
L<span style="color: #66cc66;">&#91;</span>idx1, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span> = randfun<span style="color: #66cc66;">&#40;</span> ncol<span style="color: #66cc66;">&#40;</span>idx1<span style="color: #66cc66;">&#41;</span>, <span style="color: #a020f0;">&quot;Uniform&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 1st col */</span>
idx2 =  <span style="color: #2e8b57; font-weight: bold;">21</span>:<span style="color: #2e8b57; font-weight: bold;">40</span>;
L<span style="color: #66cc66;">&#91;</span>idx2, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> = randfun<span style="color: #66cc66;">&#40;</span> ncol<span style="color: #66cc66;">&#40;</span>idx2<span style="color: #66cc66;">&#41;</span>, <span style="color: #a020f0;">&quot;Uniform&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 2nd col */</span>
&nbsp;
R = j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span>, ncol, <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;
idx1 = <span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #2e8b57; font-weight: bold;">10</span>;
R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>, idx1<span style="color: #66cc66;">&#93;</span> = randfun<span style="color: #66cc66;">&#40;</span> <span style="color: #2e8b57; font-weight: bold;">1</span>|| ncol<span style="color: #66cc66;">&#40;</span>idx1<span style="color: #66cc66;">&#41;</span>, <span style="color: #a020f0;">&quot;Uniform&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 1st row */</span>
idx2 = <span style="color: #2e8b57; font-weight: bold;">11</span>:<span style="color: #2e8b57; font-weight: bold;">20</span>;
R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span>, idx2<span style="color: #66cc66;">&#93;</span> = randfun<span style="color: #66cc66;">&#40;</span> <span style="color: #2e8b57; font-weight: bold;">1</span>||ncol<span style="color: #66cc66;">&#40;</span>idx2<span style="color: #66cc66;">&#41;</span>, <span style="color: #a020f0;">&quot;Uniform&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* 2nd row */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* simplify L and R by rounding to 2 decimal places */</span>
L = <span style="color: #0000ff;">round</span><span style="color: #66cc66;">&#40;</span>L, <span style="color: #2e8b57; font-weight: bold;">0.01</span><span style="color: #66cc66;">&#41;</span>;
R = <span style="color: #0000ff;">round</span><span style="color: #66cc66;">&#40;</span>R, <span style="color: #2e8b57; font-weight: bold;">0.01</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #006400; font-style: italic;">/*** end simulation of L and R ***/</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* create blue-white-red color ramp from ColorBrewer */</span>
<span style="color: #0000ff;">%let</span> BlueReds = CX2166AC CX4393C3 CX92C5DE CXD1E5F0 CXF7F7F7 CXFDDBC7 CXF4A582 CXD6604D CXB2182B ;
<span style="color: #0000ff;">%let</span> Reds = CXF7F7F7 CXFDDBC7 CXF4A582 CXD6604D CXB2182B ;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Visualize L (40 rows, transposed for viewing) and R (30 columns) */</span>
ods graphics / width=640px height=84px; 
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>L`<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;BlueReds</span><span style="color: #66cc66;">&#125;</span> yvalues=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Real 1'</span> <span style="color: #a020f0;">'Real 2'</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">2.2</span> <span style="color: #2e8b57; font-weight: bold;">2.2</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;Original L`&quot;</span>;
&nbsp;
ods graphics / width=480px height=84px;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;BlueReds</span><span style="color: #66cc66;">&#125;</span> yvalues=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Real 1'</span> <span style="color: #a020f0;">'Real 2'</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">2.2</span> <span style="color: #2e8b57; font-weight: bold;">2.2</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;Original R&quot;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD01.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD01.png" alt="" width="640" height="84" class="alignnone size-full wp-image-58007" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD01.png 640w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD01-300x39.png 300w" sizes="(max-width: 640px) 100vw, 640px" /></a>
<br />
<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD02.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD02.png" alt="" width="480" height="84" class="alignnone size-full wp-image-58004" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD02.png 480w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD02-300x53.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The columns of the L matrix are the factors for gene activation. 
The rows of the L matrix represent patients. (Note: the image is for the transposed matrix, L`).
You can visualize the etiologies for the four groups.
For the R matrix, the rows are the factors and the columns are the genes.
The heat maps are entirely red or white because all values of 
L and R are nonnegative. The product, X = L*R, represents the gene expression levels for 
30 genes across 40 patient samples. The following statements visualize the data matrix:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">X</span> = L<span style="color: #006400; font-style: italic;">*R;</span>
ods graphics / width=360px height=480px; 
<span style="color: #006400; font-style: italic;">/* Note: Dark red now represents 4, not 2.2 */</span>
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;Reds</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">0.01</span> <span style="color: #2e8b57; font-weight: bold;">4.04</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;Original (Synthetic) Data Matrix&quot;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD03.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD03.png" alt="" width="360" height="480" class="alignnone size-full wp-image-58001" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD03.png 360w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD03-225x300.png 225w" sizes="(max-width: 360px) 100vw, 360px" /></a>


<p>
In the real world, the order of the patients and the order of the genes would be permuted arbitrarily. 
Also, there would be random noise that appears in the 40x30 panel of measurements. 
Regardless, when presented with these data, a researcher would like to recover the underlying factors that generated them. 
If researchers know or suspect that the disease has two etiologies, they might try to approximate the matrix 
by using a rank-2 decomposition of factors. 
In this example, we know the true factors whose product forms the X matrix, but in the real world, 
those factors are unknown and must be estimated from X.
</p>

<h3>The SVD factorization</h3>
<p>
The classical way to decompose X into rank-2 factors is to use the singular value decomposition (SVD). 
A newer decomposition is the nonnegative matrix factorization (NMF). 
Following Fogel et al. (2013), the remainder of this article compares these two factorizations.
</p>
<p>
A previous article discusses the fact that <a href="https://blogs.sas.com/content/iml/2026/03/16/on-the-nonuniqueness-of-low-rank-matrix-factors.html">a low-rank factorization is not unique, but can be arbitrarily rescaled and permuted</a>. 
Fogel et al. visualized the results by using heat maps.
However, their heat maps are individually scaled to use the minimum and maximum value in each matrix. 
Because the SVD factors have positive and negative elements whereas the NMF factors do not, 
it is hard to compare the two factorization methods unless the colors in the heat maps represent the same scale.
In this article, I carefully scale the factors so that red colors always represent positive values, white represents zero, 
and blue always represents negative values.
</p>
<p>
Let's look at an SVD analysis of the simulated data matrix, X. The following statements call the SVD subroutine in SAS IML to compute X = U*D*V`. 
To get the rank-2 approximation of X, the first two columns of U and V are extracted. 
For this example, the rank-2 product exactly equals X because we constructed X as the product of two rank-2 matrices
and did not add any noise.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;">k = <span style="color: #2e8b57; font-weight: bold;">2</span>;
<span style="color: #0000ff;">call</span> SVD<span style="color: #66cc66;">&#40;</span>A, D, B, <span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#41;</span>;
i = <span style="color: #2e8b57; font-weight: bold;">1</span>:k;   
L_SVD =   A<span style="color: #66cc66;">&#91;</span>, i<span style="color: #66cc66;">&#93;</span> # <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>D<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>`;   <span style="color: #006400; font-style: italic;">/* scale each factor evenly */</span>
R_SVD = T<span style="color: #66cc66;">&#40;</span>B<span style="color: #66cc66;">&#91;</span>, i<span style="color: #66cc66;">&#93;</span> # <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>D<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>`<span style="color: #66cc66;">&#41;</span>;
X2 = L_SVD<span style="color: #006400; font-style: italic;">*R_SVD;</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* show that the SVD recovers X exactly */</span>
maxDiff = <span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">abs</span><span style="color: #66cc66;">&#40;</span>X-X2<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
print maxDiff;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD04.png" alt="" width="70" height="64" class="alignnone size-full wp-image-58013" />

<h3>Interpreting the SVD factors</h3>
<p>
Although the product of the SVD exactly equals X, the SVD factors do NOT look like the nonnegative factors L and R that were used
to create X. 
The reason is that the factors in an SVD are orthogonal, which means that the factors must contain both positive and negative elements when k &ge; 2. 
(Technically, if X is block-diagonal, then the SVD can recover the original features, but that is not the case in this example.) 
Because X contains strictly nonnegative values and isn't block-diagonal, 
<em>it is mathematically impossible for the SVD to return orthogonal factors that are entirely nonnegative</em>.
</p>
<p>
This fact is shown by using the following heat maps, which display the SVD factors on the same color scale as the true factors.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Visualize true L` (40 columns) and the left factor from the SVD */</span>
ods graphics / width=640px height=84px; 
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>L`<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;BlueReds</span><span style="color: #66cc66;">&#125;</span> yvalues=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Real 1'</span> <span style="color: #a020f0;">'Real 2'</span><span style="color: #66cc66;">&#125;</span>   <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">2.2</span> <span style="color: #2e8b57; font-weight: bold;">2.2</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;Real Factor L`&quot;</span>;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>L_SVD`<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;BlueReds</span><span style="color: #66cc66;">&#125;</span> yvalues=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'SVD 1'</span> <span style="color: #a020f0;">'SVD 2'</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">2.2</span> <span style="color: #2e8b57; font-weight: bold;">2.2</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;SVD Factor&quot;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05.png" alt="" width="640" height="84" class="alignnone size-full wp-image-58019" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05.png 640w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05-300x39.png 300w" sizes="(max-width: 640px) 100vw, 640px" /></a>
<br />
<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD06.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD06.png" alt="" width="640" height="84" class="alignnone size-full wp-image-58037" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD06.png 640w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD06-300x39.png 300w" sizes="(max-width: 640px) 100vw, 640px" /></a>

<p>
The heat maps show (the transpose of) the original factor, L, and the SVD factor, U.
Whereas the true factors for X have strictly nonnegative values, the second factor in the SVD contains negative values (the blue cells).
The negative values are required because of orthogonality: the inner product of the columns must equal 0! 
The negative values represent contrasts, which can be difficult to interpret when the true underlying components of the data are strictly additive.
</p>
<p>
You can use similar statement to visualize the right-hand SVD factor, V, and compare it to the original factor, R. The results are shown below:
</p>

<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07.png" alt="" width="480" height="84" class="alignnone size-full wp-image-58034" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07.png 480w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07-300x53.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>
<br />
<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD08.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD08.png" alt="" width="480" height="84" class="alignnone size-full wp-image-58031" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD08.png 480w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD08-300x53.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<h3>The NMF factorization</h3>
<p>
The SVD cannot return purely nonnegative factors, which makes interpretation of the factors difficult when the 
data are nonnegative. Researchers developed the NMF to address this issue. 
The NMF decomposes a nonnegative matrix into a product W*H, where the elements in W and H are constrained to be nonnegative.
The decomposition is not unique.
</p>
<p>
Let's use NMF to produce rank-2 factors of X. 
In SAS, you can obtain the NMF by using PROC NMF in SAS Viya, or by implementing the algorithm directly in SAS IML. 
This article computes the NMF by running an IML module that implements the Lee and Seung multiplicative update algorithm.
The factorization you get might depend on the way that you specify an "initial guess"  for the factors. 
Consequently, the result that I present is just one possible NMF factorization.
</p>
<p>
The NMF returns two rank-2 matrices, W and H, such that X &asymp; W * H. 
As with the SVD, the product is a rank-2 approximation to X. 
However, due to the algorithm and initialization that I used, the 
product of the NMF factors does not exactly equal X.
The NMF algorithm attempts to optimize a non-convex objective function. It can converge to local minima.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Run the Lee and Seung &quot;multiplicative update&quot; algorithm for NMF.
   The IML implementation can be downloaded from GitHub. */</span>
<span style="color: #000080; font-weight: bold;">run</span> nmf_mult<span style="color: #66cc66;">&#40;</span>W1, H1, <span style="color: #0000ff;">X</span>, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Let's see whether the product W1*H1 is a reasonable approximation of X. */</span>
X_est = W1<span style="color: #006400; font-style: italic;">*H1;</span>
ods graphics / width=360px height=480px; 
<span style="color: #006400; font-style: italic;">/* Note: Dark red now represents ~4, not 2.2 */</span>
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>X_est<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;Reds</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">0.01</span> <span style="color: #2e8b57; font-weight: bold;">4.04</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;NMF Estimate of Matrix&quot;</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD09.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD09.png" alt="" width="360" height="480" class="alignnone size-full wp-image-58046" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD09.png 360w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD09-225x300.png 225w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
According to the heat map, the product W1*H1 does a very good job approximating X. An obvious difference is seen for Patients 32 and 39. 
Whereas these patients actually have positive gene expression levels for genes 1-10, 
the rank-2 NMF approximation does not fully recover that fact.
</p>

<h3>Comparing the NMF factors</h3>
<p>
As explained in the <a href="https://blogs.sas.com/content/iml/2026/03/16/on-the-nonuniqueness-of-low-rank-matrix-factors.html">previous article</a>, there are two inherent ambiguities in the factors of the NMF:
</p>
<ol>
<li>The columns of W and the rows of H can be arbitrarily scaled, provided they are scaled reciprocally.</li>
<li>The columns of W and the rows of H can be arbitrarily permuted, provided they are permuted together.</li>
</ol>
<p>
To compare the factors from the NMF with the original factors, it is best that the colors represent the same scale. 
We could standardize the factors L and R, but since we have already visualized them, I will instead adjust the 
scaling of W and H so that their maximum elements are approximately equal to 2, 
which was the maximum value in the original L and R factors. This is not a standard way to standardize (pun intended!),
but it is useful for this example.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* The following statements rescale W and H to try to put all values in [0,2]. */</span>
maxVal = <span style="color: #2e8b57; font-weight: bold;">2</span>;
scale = maxVal / W1<span style="color: #66cc66;">&#91;</span>&lt;&gt;,<span style="color: #66cc66;">&#93;</span>;    <span style="color: #006400; font-style: italic;">/* scale each col by maximum, then multiply by maxVal */</span>
W = W1 # scale;              <span style="color: #006400; font-style: italic;">/* scale columns of W */</span>
H = H1 / scale`;             <span style="color: #006400; font-style: italic;">/* reciprocally scale rows of H */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Now, visualize the NMF factors, W and H, and compare with L and R from the original simulation. */</span>
ods graphics / width=640px height=70px; 
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>L`<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;BlueReds</span><span style="color: #66cc66;">&#125;</span> yvalues=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Real 1'</span> <span style="color: #a020f0;">'Real 2'</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">2.2</span> <span style="color: #2e8b57; font-weight: bold;">2.2</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;Real Factor L`&quot;</span>;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>W`<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff; font-weight: bold;">&amp;BlueReds</span><span style="color: #66cc66;">&#125;</span> yvalues=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'NMF 1'</span> <span style="color: #a020f0;">'NMF 2'</span><span style="color: #66cc66;">&#125;</span>   <span style="color: #0000ff;">range</span>=<span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">2.2</span> <span style="color: #2e8b57; font-weight: bold;">2.2</span><span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">title</span>=<span style="color: #a020f0;">&quot;NMF Factor W`&quot;</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05.png" alt="" width="640" height="84" class="alignnone size-full wp-image-58019" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05.png 640w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD05-300x39.png 300w" sizes="(max-width: 640px) 100vw, 640px" /></a>
<br />
<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD10.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD10.png" alt="" width="640" height="70" class="alignnone size-full wp-image-58055" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD10.png 640w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD10-300x33.png 300w" sizes="(max-width: 640px) 100vw, 640px" /></a>

<p>
For some data, you might need to permute the columns of the left NMF factor and the rows of the right factor before you can compare them to 
real factors.
(An easy way to compare: sort the columns of all left factors from largest to smallest by using the Euclidean norm.)
In this example, we got lucky and can immediately compare the NMF and real factors. 
The similarity between the left factors is impressive. The NMF factor captures the structure of the real factor except for Patients 32 and 39.
In this case, the NMF algorithm converged to a solution that is not the true factor. 
This happens frequently. One of the drawbacks of NMF is that the solution you get depend on an initial guess for the factors,
and there are many competing ideas about how to provide an initial guess.
</p><p>
The agreement between the right factors is also impressive, as shown below:
</p>

<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07.png" alt="" width="480" height="84" class="alignnone size-full wp-image-58034" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07.png 480w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD07-300x53.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>
<br />
<a href="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD11.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD11.png" alt="" width="480" height="84" class="alignnone size-full wp-image-58058" srcset="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD11.png 480w, https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD11-300x53.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
This output highlights two critical features of the NMF algorithm:
</p>
<ol>
<li>Both factors contain only nonnegative elements. It immediately follows that the product W*H must also be nonnegative.</li>
<li>The NMF decomposition does an excellent job of recovering the true underlying factors. 
If you compare the true factors with the NMF factors, you can see that the patterns are remarkably similar. 
The most obvious difference is that the first row of the transposed W factor does not correctly estimate 
the values for Patients 32 and 39. The NMF estimate indicates that these patients only exhibit 
the second etiology, whereas in reality, they exhibit both.
</li>
</ol>

<h3>Summary</h3>
<p>
This article uses SAS IML to present an example that is similar to Section 4.2 in Fogel, et al. (2013).
The example simulates a data matrix that represents the expression profiles of 30 target genes for 40 clinical study participants. 
A classical SVD analysis produces factors whose product perfectly equals X. 
The SVD uses deterministic linear algebra, and is guaranteed to find 
an optimal rank-k approximation.
However, the factors contain both positive and negative values 
because the SVD enforces orthogonality of the factors. 
</p>
<p>
In contrast, the NMF produces factors that contain strictly nonnegative values.
When the true underlying features of the data involve positive quantities whose effects are additive, 
the NMF factors are more natural and interpretable.
However, the NMF attempts to optimize a non-convex objective function. The algorithm's solution depends on how the 
algorithm is initialized. Some results might correspond to local minimum instead of a global one.
</p>
<p>
You can <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/NMF/nmf_vs_svd.sas">download the complete SAS program that I used to simulate the data, run the NMF algorithm, and visualize the results</a>.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html">A comparison of SVD and NMF factorizations</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/03/23/nmf-vs-svd.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/03/NMFvsSVD09-150x150.png" />
	</item>
	</channel>
</rss>
