<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The DO Loop</title>
	<atom:link href="https://blogs.sas.com/content/iml/feed" rel="self" type="application/rss+xml" />
	<link>https://blogs.sas.com/content/iml/</link>
	<description>Statistical programming in SAS with an emphasis on SAS/IML programs</description>
	<lastBuildDate>Thu, 25 Jun 2026 17:11:03 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.6</generator>
	<item>
		<title>Confidence limits for counts in a multinomial distribution</title>
		<link>https://blogs.sas.com/content/iml/2026/06/29/confidence-limits-multinomial.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/06/29/confidence-limits-multinomial.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 29 Jun 2026 09:23:49 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=59246</guid>

					<description><![CDATA[<p>A recent article discussed Pareto charts and how to create them in SAS. A Pareto chart is used in quality control to identify the most likely reasons that a manufacturing process produces defective items. Suppose there are k different reasons why a process can fail, and you intend to randomly [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/29/confidence-limits-multinomial.html">Confidence limits for counts in a multinomial distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A recent article discussed Pareto charts and how to create them in SAS.
A Pareto chart is used in quality control to identify the most likely reasons that 
a manufacturing process produces defective items.
Suppose there are <em>k</em> different reasons why a process can fail, and you intend to 
randomly select <em>N</em> defective items to examine.
The number of defective items for each reason can be modeled by using a <a href="https://blogs.sas.com/content/iml/2015/10/02/balls-and-urns2.html">multinomial distribution</a>.
</p><p>
A variation of the Pareto chart (<a href="https://www.tandfonline.com/doi/abs/10.1198/000313006X152243">Wilkinson, <em>TAS</em>, 2006</a>) uses
confidence intervals for multinomial counts
to enhance the information on the chart.
This article supplies the necessary background for 
how to estimate confidence intervals for multinomial counts. A subsequent article shows how to construct Wilkinson's Pareto chart in SAS.
</p>

<h3>The multinomial distribution</h3>
<p>
Suppose a categorical variable, X, can take on <em>k</em> different values.
A statistical model of the variable is to assume that X can be modeled by a multinomial distribution.
A multinomial distribution assumes that
the probability of the i_th value is <em>p<sub>i</sub></em>, where &Sigma;<sub>i</sub> <em>p<sub>i</sub></em> = 1.
</p><p>
In a sample of size <em>N</em>, there will be <em>N</em><sub>1</sub> items of the first type,
<em>N</em><sub>2</sub> items of the second type, and so forth, where &Sigma;<sub>i</sub> <em>N<sub>i</sub></em> = <em>N</em>. 
The <em>k</em>-tuple of counts
(<em>N</em><sub>1</sub>, <em>N</em><sub>2</sub>, ..., <em>N</em><sub><em>k</em></sub>) is a single observation from the multinomial distribution.  
</p><p>
The expected value of <em>N</em><sub><em>i</em></sub> is <em>N</em>*<em>p</em><sub><em>i</em></sub>,
but in an actual sample the count could be higher or lower than that value.
Confidence intervals enable you to answer the question, "what is the likely range of each count?"
</p><p>
A previous article describes <a href="https://blogs.sas.com/content/iml/2017/02/15/confidence-intervals-multinomial-proportions.html">how to construct simultaneous confidence intervals for the proportions</a>. But the Wilkinson variation of the Pareto chart
uses <em>marginal confidence limits</em>, which are different. 
Marginal confidence limits can be easily generated by running a simulation of the multinomial distribution and using percentiles of the results.
In SAS, the easiest way to simulate from the multinomial distribution is to use <a href="https://blogs.sas.com/content/iml/2013/08/05/simulate-from-multinomial-distribution.html">the RandMultinomial function in SAS IML</a>.
</p>


<h3>Simulating multinomial frequencies</h3>
<p>
Suppose a quality engineer samples 40 defective items. She finds that there are six causes for the defective items.
In the sample,  
<em>N</em><sub>1</sub>=17 items are defective because of the first cause, <em>N</em><sub>2</sub>=10 are defective because of the second cause, and the remaining causes 
are responsible for 5, 3, 3, and 2 counts, respectively.
The best estimate of the unknown probabilities is obtained by dividing the counts by <em>N</em>=40.
Thus, the empirical probabilities are 
p = {0.425, 0.25, 0.125, 0.075, 0.075, 0.05}.
</p><p>
The following call to PROC IML specifies these parameters and generates five random samples. Each sample is a random draw from the multinomial distribution.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* CI for marginal probability parameters in a multinomial distribution */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* observed frequencies */</span>
freq = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">17</span>, <span style="color: #2e8b57; font-weight: bold;">10</span>, <span style="color: #2e8b57; font-weight: bold;">5</span>, <span style="color: #2e8b57; font-weight: bold;">3</span>, <span style="color: #2e8b57; font-weight: bold;">3</span>, <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#125;</span>;
<span style="color: #0000ff;">N</span> = <span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#40;</span>freq<span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* sample size */</span>
p = freq / <span style="color: #0000ff;">N</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* 5 random samples of size N from Multinom(p) */</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">123</span><span style="color: #66cc66;">&#41;</span>;
X5 = randmultinomial<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">5</span>, <span style="color: #0000ff;">N</span>, p<span style="color: #66cc66;">&#41;</span>;
print X5<span style="color: #66cc66;">&#91;</span>c=<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'C1'</span>:<span style="color: #a020f0;">'C6'</span><span style="color: #66cc66;">&#41;</span> r=<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'S1'</span>:<span style="color: #a020f0;">'S5'</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img fetchpriority="high" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL1.png" alt="" width="251" height="250" class="alignnone size-full wp-image-59264" srcset="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL1.png 251w, https://blogs.sas.com/content/iml/files/2026/06/multinomialCL1-150x150.png 150w" sizes="(max-width: 251px) 100vw, 251px" />

<p>
Each row shows the count in a random sample.
The i_th column shows the counts for the i_th defective cause across all five samples.
In this small simulation, the counts for the first cause range from 11-18, 
the counts for the second cause range from 7-11, and so forth.
Because the probabilities for the 4th-6th causes are relatively small, some samples did not contain any items
for those causes.
</p><p>
The small output gives us the intuition behind confidence intervals.
We'd like to know ranges that cover the counts for 95% of the samples.
You can do this by generating a large number of samples and calculating percentiles.
</p>


<h3>Confidence intervals for the multinomial counts</h3>
<p>
The previous output shows the range of counts for each variable for five random samples.
Now imagine generating many more random samples from the same multinomial distribution.
If you compute the 2.5th and 97.5th percentiles for each column, you obtain 
a simulation-based estimate of a 95% confidence interval for each column's values.
The following IML statements perform these calculations for a simulation that uses 10,000 random draws.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Use simulation to estimate (1-alpha)*100% confidence intervals 
   for counts for each category.
   - N = sample size
   - p = parameter vector for Multinomial(N, p) distribution
   - B is the number of Monte Carlo simulations
   - alpha = significance level (default 0.05) ==&gt; 95% CL   
*/</span>
start MultinomialCL<span style="color: #66cc66;">&#40;</span>B, <span style="color: #0000ff;">N</span>, p, alpha=<span style="color: #2e8b57; font-weight: bold;">0.05</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = RandMultinomial<span style="color: #66cc66;">&#40;</span>B, <span style="color: #0000ff;">N</span>, p<span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* generate B random samples */</span>
   <span style="color: #006400; font-style: italic;">/* Find the (alpha / 2) and (1 - alpha / 2) percentiles.
      For alpha = 0.05, this estimates the 2.5th and 97.5th percentiles. */</span>
   w = <span style="color: #66cc66;">&#40;</span>alpha/<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span> // <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>-alpha/<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">call</span> qntl<span style="color: #66cc66;">&#40;</span>CL, <span style="color: #0000ff;">X</span>, w<span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* QNTL computes the quantiles of each column */</span>
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> CL <span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* call function and print the results in a table */</span>
limits = MultinomialCL<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">10000</span>, <span style="color: #0000ff;">N</span>, p<span style="color: #66cc66;">&#41;</span>;
Category = <span style="color: #a020f0;">'C1'</span>:<span style="color: #a020f0;">'C6'</span>;
print limits<span style="color: #66cc66;">&#91;</span>r=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Lower95'</span> <span style="color: #a020f0;">'Upper95'</span><span style="color: #66cc66;">&#125;</span> c=Category L=<span style="color: #a020f0;">&quot;Marginal Limits (N=40)&quot;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL2.png" alt="" width="299" height="148" class="alignnone size-full wp-image-59261" srcset="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL2.png 299w, https://blogs.sas.com/content/iml/files/2026/06/multinomialCL2-164x82.png 164w" sizes="(max-width: 299px) 100vw, 299px" />

<p>
The table shows 95% confidence limits 
for the counts from a multinomial sample of size <em>N</em>=40 that has a vector of probability parameters <em>p</em>.
For the most probable category (C1), you can expect the count to be between 11 and 23 for 95% of samples.
For the least probable category (C6), you can expect the count to be between 0 and 5.
</p>
<p>
Although the table provides all relevant information, I like to graph the confidence intervals as 
bars in a chart that graphs the observed frequency and confidence intervals versus the categories, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* graph the empirical frequencies and the marginal CLs */</span>
Lower95 = limits<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #66cc66;">&#93;</span>;
Upper95 = limits<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span>,<span style="color: #66cc66;">&#93;</span>;
<span style="color: #0000ff;">create</span> MultinomialLimits <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'Category'</span> <span style="color: #a020f0;">'Freq'</span> <span style="color: #a020f0;">'Lower95'</span> <span style="color: #a020f0;">'Upper95'</span><span style="color: #66cc66;">&#125;</span>;
   append;
<span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;95% Confidence Limits for Frequencies&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=MultinomialLimits noautolegend;
   highlow <span style="color: #0000ff;">x</span>=Category low=Lower95 high=Upper95 / 
           type=bar barwidth=<span style="color: #2e8b57; font-weight: bold;">0.8</span>;<span style="color: #006400; font-style: italic;">* primary=true;</span>
   scatter <span style="color: #0000ff;">x</span>=Category y=Freq;
   yaxis offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> grid;
   xaxis type=discrete discreteorder=<span style="color: #000080; font-weight: bold;">data</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL3.png"><img decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59267" srcset="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL3.png 640w, https://blogs.sas.com/content/iml/files/2026/06/multinomialCL3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The graph enables you to see at a glance the ranges of counts for a sample of size <em>N</em> from the multinomial(N, p) distribution.  
</p>

<h3>Summary</h3>
<p>
The statistics of small random samples can be highly variable. This article looks at the counts in a multinomial distribution.
By running a simple simulation study, you can 
construct 95% confidence intervals for the counts. This is helpful for quality engineers who need to know whether an increase or decrease in an observed count is associated with a change in the manufacturing process or is merely random variation.
These confidence intervals are used in constructing Wilkinson's variation of the Pareto chart, which I will construct in a future article.
</p>

<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/29/confidence-limits-multinomial.html">Confidence limits for counts in a multinomial distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/06/29/confidence-limits-multinomial.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/06/multinomialCL3-150x150.png" />
	</item>
		<item>
		<title>Sort the rows or columns of a matrix independently</title>
		<link>https://blogs.sas.com/content/iml/2026/06/24/sort-rows-cols-matrix.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/06/24/sort-rows-cols-matrix.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 24 Jun 2026 09:27:46 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[SAS Programming]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=59179</guid>

					<description><![CDATA[<p>When you have a data matrix, the rows represent observations and the columns represent variables. If you sort the matrix by one or more columns, the sorting occurs in a way that preserves the elements within rows. Although the rows of the sorted matrix are permuted by the sort, the [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/24/sort-rows-cols-matrix.html">Sort the rows or columns of a matrix independently</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
When you have a data matrix, the rows represent observations and the columns represent variables.
If you sort the matrix by one or more columns, the sorting occurs in a way that preserves the elements within rows. 
Although the rows of the sorted matrix are permuted by the sort, 
the sorted matrix is still a data matrix, but now the observations are in a different order.
</p><p>
However, there are instances in which a numerical matrix is merely a way to organize a bunch of numbers.
For example, in a simulation study, you might construct a matrix whose i_th column contains
the generated data that results from the i_th independent random sample.
In this instance, sorting each column of the matrix is useful in analyzing the sampling distribution of the 
quantiles. For example, if you sort all columns, then the first row of the sorted matrix contains the distribution of the minimum statistic,
and the last row contains the distribution of the maximum statistic.
</p>
<p>
This article shows how to sort the rows and columns of a matrix independently in the SAS IML matrix language.
</p>

<h3>A matrix from a simulation study</h3>
<p>
Suppose you want to run a simulation study to approximate the sampling distribution of quantiles in small,
normally distributed, random samples. You decide to  study samples of size N=25 and you want to graph the 
approximate distribution of the quantiles based on generating B samples from N(0,1).
</p><p>
In a matrix language like SAS IML, it is efficient to simulate all random samples by using a single 
call to the RANDFUN functions. You can fill an N&nbsp;x&nbsp;B matrix with N(0,1) variates and interpret each column as a random sample of size N.
In a real simulation study, you would choose a large number for B (such as 10,000), but for this article I will choose only B=200
samples because I want to show some visualizations by using a heat map.
The following SAS IML program generates the data for the simulation study:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* simulation study: If X ~ N(0,1), 
   what is the distribution of the sample quantiles? */</span>
<span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">25</span>;                      <span style="color: #006400; font-style: italic;">/* sample size */</span>
B = <span style="color: #2e8b57; font-weight: bold;">200</span>;                     <span style="color: #006400; font-style: italic;">/* number of simulations of X ~ N(0,1) */</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
M = randfun<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>//B, <span style="color: #a020f0;">&quot;Normal&quot;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* NxB matrix; each col is random sample */</span></pre></td></tr></table></div>




<p>
If you sort each column in the matrix, then it is easier to compute certain rank-based statistics and quantiles.
For example, in the sorted matrix, the first row will contain the minimum values for each sample.
Similarly, the last row contains the maximum values, and the 13th row contains the median values.
(You could also use <a href="https://blogs.sas.com/content/iml/2012/03/12/compute-sample-quantiles-by-using-the-qntl-call.html">the QNTL function</a>, which computes quantiles for each column in a matrix.)
</p>


<h3>Sorting each column of a matrix</h3>
<p>
If you use CALL SORT in IML (or the SORT procedure in Base SAS), the rows of the matrix are arranged in order according to 
one or more key columns. To sort the columns independently, you need to loop over the columns, sort each one, and then overwrite the column with the sorted version.
</p><p>
The following IML module, SortMat, can sort the columns or rows of the matrix in either ascending or descending order.
The first argument should be either "row" or "col". The second argument is the matrix to sort. The third argument is optional. If specified, use 0 to sort the rows or columns in ascending order and use 1 to sort in descending order.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* sort the columns of a matrix independently */</span>
start SortCols<span style="color: #66cc66;">&#40;</span>M, descend=<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;
  S = M;
  <span style="color: #0000ff;">if</span> ^descend <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;
     <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>S<span style="color: #66cc66;">&#41;</span>;
        <span style="color: #0000ff;">x</span> = S<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span>;
        <span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;        <span style="color: #006400; font-style: italic;">/* sort each column in ascending order */</span>
        S<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> = <span style="color: #0000ff;">x</span>;
     <span style="color: #0000ff;">end</span>;
  <span style="color: #0000ff;">end</span>;
  <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">do</span>;
     <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>S<span style="color: #66cc66;">&#41;</span>;
        <span style="color: #0000ff;">x</span> = S<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span>;
        <span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* sort each column in descending order */</span>
        S<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> = <span style="color: #0000ff;">x</span>;
     <span style="color: #0000ff;">end</span>;
  <span style="color: #0000ff;">end</span>;
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> S <span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* sort the rows or columns of a matrix independently.
   SYNTAX:
   A_row = SortMat(&quot;row&quot;, A &lt;,descend=0&gt;)
   A_col = SortMat(&quot;col&quot;, A &lt;,descend=0&gt;)
*/</span>
start SortMat<span style="color: #66cc66;">&#40;</span>direction, M, descend=<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;
  isRow = <span style="color: #66cc66;">&#40;</span>ksubstr<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">upcase</span><span style="color: #66cc66;">&#40;</span>direction<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span> = <span style="color: #a020f0;">&quot;ROW&quot;</span><span style="color: #66cc66;">&#41;</span>;
  <span style="color: #0000ff;">if</span> isRow <span style="color: #0000ff;">then</span> 
     <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> T<span style="color: #66cc66;">&#40;</span>SortCols<span style="color: #66cc66;">&#40;</span>M`, descend<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
  <span style="color: #0000ff;">else</span> 
     <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> SortCols<span style="color: #66cc66;">&#40;</span>M, descend<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
finish;
store module=<span style="color: #66cc66;">&#40;</span>SortCols SortMat<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* sort matrix by cols and make a heat map to visualize the result */</span>
t = <span style="color: #a020f0;">&quot;Matrix Sorted Ascending by Cols&quot;</span>;
A = SortMat<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;col&quot;</span>, M<span style="color: #66cc66;">&#41;</span>;
ods graphics / width=480px height=240px;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>A<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #a020f0;">&quot;ThreeColor&quot;</span> displayoutlines=<span style="color: #2e8b57; font-weight: bold;">0</span> xaxistop=<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">title</span>=t;
<span style="color: #006400; font-style: italic;">/* Note: To sort the columns in descending order, use 
   A = SortMat(&quot;col&quot;, M, 1);
*/</span></pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/06/SortRows1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/SortRows1.png" alt="" width="480" height="240" class="alignnone size-full wp-image-59198" srcset="https://blogs.sas.com/content/iml/files/2026/06/SortRows1.png 480w, https://blogs.sas.com/content/iml/files/2026/06/SortRows1-300x150.png 300w, https://blogs.sas.com/content/iml/files/2026/06/SortRows1-164x82.png 164w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The heat map visualizes the column-wise sorting. For each column, the values in the sorted matrix are 
stored in increasing order. Thus, the first row contains the smallest value in each sample, the second row contains the 
second smallest value, and so on, until the last row, which contains the largest value in each sample.
Equivalently, each column contains the order statistics for the sample.
</p>
<p>
If you want to analyze, for example, the distribution of the minimum values in the samples, 
you can extract the first row. The following statements extract the first row and create a histogram of the 
sampling distribution of the minimum statistic:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* The smallest values are in the first row of the sorted matrix */</span>
sample_min = A<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #66cc66;">&#93;</span>;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Distribution of the Sample Min for N(0,1) Data&quot;</span>;
<span style="color: #0000ff;">call</span> histogram<span style="color: #66cc66;">&#40;</span>sample_min<span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2026/06/SortRows2.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/SortRows2.png" alt="" width="480" height="240" class="alignnone size-full wp-image-59195" srcset="https://blogs.sas.com/content/iml/files/2026/06/SortRows2.png 480w, https://blogs.sas.com/content/iml/files/2026/06/SortRows2-300x150.png 300w, https://blogs.sas.com/content/iml/files/2026/06/SortRows2-164x82.png 164w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The histogram shows the distribution of the minimum statistic. This distribution is <a href="https://blogs.sas.com/content/iml/2019/07/22/extreme-value-normal-data.html">one of the extreme-value distributions, known as the Gumbel distribution</a>. 
For a standardized normal sample of this size, it is common that the minimum value is near -2, but values less than -3 are possible.
</p>


<h3>Sorting each row of a matrix</h3>
<p>
In the same way, you can pass "row" as the first argument of the SortMat function to independently sort each row of a matrix:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Suppose you organize the simulation so that each row is a random sample. Then the
   simulated samples are the rows of M`. You can sort the rows by calling the SortMat(&quot;row&quot;,...) function. */</span>
M = M`;
t = <span style="color: #a020f0;">&quot;Matrix Sorted Ascending by Rows&quot;</span>;
B = SortMat<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;row&quot;</span>, M<span style="color: #66cc66;">&#41;</span>;
ods graphics / width=280px height=480px;
<span style="color: #0000ff;">call</span> heatmapcont<span style="color: #66cc66;">&#40;</span>B<span style="color: #66cc66;">&#41;</span> colorramp=<span style="color: #a020f0;">&quot;ThreeColor&quot;</span> displayoutlines=<span style="color: #2e8b57; font-weight: bold;">0</span> xaxistop=<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">title</span>=t;
<span style="color: #006400; font-style: italic;">/* To sort in descending order, use
B = SortMat(&quot;row&quot;, M, 1);  */</span></pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/06/SortRows3-1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/SortRows3-1.png" alt="" width="280" height="480" class="alignnone size-full wp-image-59237" srcset="https://blogs.sas.com/content/iml/files/2026/06/SortRows3-1.png 280w, https://blogs.sas.com/content/iml/files/2026/06/SortRows3-1-175x300.png 175w" sizes="(max-width: 280px) 100vw, 280px" /></a>


<p>
Instead of generating new data, I simply transposed the previous simulated matrix.
For the transposed matrix, each row is a random sample.
If you sort by rows, the columns contain the distributions of the quantiles.
The i_th column contains the i_th smallest value in each sample. 
</p>

<h3>Summary</h3>
<p>
This article creates a SAS IML function that can independently sort each column or each row of a matrix. 
One application is that you can easily analyze the distribution of order statistics, including minima, maxima,
and other quantiles. 
Another application is to <a href="https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html">sort the columns of a lasagna plot</a>, 
which can visualize changes to the distribution of a response variable over time.
</p>

<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/24/sort-rows-cols-matrix.html">Sort the rows or columns of a matrix independently</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/06/24/sort-rows-cols-matrix.html/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/06/SortRows1-150x150.png" />
	</item>
		<item>
		<title>Create two types of Pareto charts in SAS</title>
		<link>https://blogs.sas.com/content/iml/2026/06/22/pareto-charts-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/06/22/pareto-charts-sas.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 22 Jun 2026 09:20:27 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=59116</guid>

					<description><![CDATA[<p>A Pareto chart is a popular chart for statistical quality control. It is often used to display the relative frequencies of issues that affect the quality of a manufacturing process. A bar chart displays the frequency of each issue that causes a defect. The bars are ordered by height: the [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/22/pareto-charts-sas.html">Create two types of Pareto charts in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A Pareto chart is a popular chart for statistical quality control.
It is often used to display the relative frequencies of issues that affect the 
quality of a manufacturing process. A bar chart displays the frequency of each issue that causes a defect. 
The bars are ordered by height:
the most common issues (tall bars) are displayed to the left of the chart, whereas
less common issues (short bars) are displayed to the right. 
In many cases, the prevalence of issues follows a <a href="https://blogs.sas.com/content/iml/2018/04/23/80-20-rule-for-blogs.html">Pareto principle</a>.
The classical Pareto principle is that 80% of the problems are caused by 20% of the issues. 
Therefore, the left side of a Pareto chart reveals which issue (or issues) should be addressed to have the most impact on the quality of
the process.
</p>
<a href="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart1.png" alt="" width="480" height="360" class="alignright size-full wp-image-59161" srcset="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart1.png 640w, https://blogs.sas.com/content/iml/files/2026/06/ParetoChart1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>


<p> 
There are two common variations of the Pareto plot.
The most familiar is a bar chart of frequencies that is overlaid with a cumulative frequency  curve.
This graph is shown to the right.
The other is <a href="https://blogs.sas.com/content/iml/2015/04/27/cascade-chart.html">a "cascade chart" (sometimes called a waterfall chart)</a>
that displays the cumulative frequencies as a staircase of bars. 
</p><p>
These charts are produced automatically by <a href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/qcug/qcug_pareto_overview.htm">the PARETO procedure in SAS/QC software</a>.
The PARETO procedure has many options for displaying Pareto charts. 
However, not every SAS user has a license for SAS/QC. This article shows how to create basic
Pareto charts by using Base SAS procedures and the SGPLOT procedure.
</p>

<h3>The data</h3>
<p>
To illustrate a Pareto chart, let's use a modification of data from the documentation of the PARETO procedure.
A random sample of 40 defective parts are examined, and the cause of each defect is recorded, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Modification of PROC PARETO example */</span>
<span style="color: #000080; font-weight: bold;">data</span> Failure;
<span style="color: #0000ff;">INFILE</span> DATALINES delimiter=<span style="color: #a020f0;">','</span>;
<span style="color: #0000ff;">length</span> Cause $ <span style="color: #2e8b57; font-weight: bold;">16</span>;
<span style="color: #0000ff;">label</span> Cause = <span style="color: #a020f0;">'Cause of Failure'</span>;
<span style="color: #0000ff;">input</span> Cause @@;
datalines;
Corrosion, Oxide Defect, Contamination, Oxide Defect
Oxide Defect, Oxide Defect, Contamination, Metallization
Oxide Defect, Contamination, Contamination, Oxide Defect
Contamination, Contamination, Contamination, Corrosion
Silicon Defect, Contamination, Contamination, Contamination
Contamination, Contamination, Doping, Oxide Defect
Oxide Defect, Metallization, Contamination, Corrosion
Silicon Defect, Contamination, Corrosion, Corrosion
Metallization, Oxide Defect, Contamination, Contamination
Oxide Defect, Doping, Doping, Contamination
;</pre></td></tr></table></div>




<p>
If you have a license for SAS/QC software, you can create a basic Pareto chart with an overlaid cumulative curve by using the PARETO procedure. So that you can easily reuse the SAS code for your own data, I have defined macro variables for the name of the data set and the name of the categorical variable.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">DSName</span> = Failure;
<span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">VarName</span> = Cause;
<span style="color: #000080; font-weight: bold;">proc pareto</span> <span style="color: #000080; font-weight: bold;">data</span>=&amp;<span style="color: #0000ff;">DSName</span>;
   vbar &amp;<span style="color: #0000ff;">VarName</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<p>
The graph is shown at the top of this article. It shows that "Contamination" is responsible for 42.5% of the failures. 
The first three categories account for 80% of the defects, so those are the issues that quality engineers should strive to improve.
Whereas the classic "Pareto rule" is that 20% of the issues result in 80% of the defects, the cumulative curve 
shows that, for this process, 50% of the issues (the first three categories) are responsible for about 80% of the defects. 
</p>

<h3>A Pareto chart in Base SAS</h3>
<p>
If you do not have a license for SAS/QC software, you can still create a basic Pareto chart by using PROC FREQ to
compute the frequencies and PROC SGPLOT to display a sorted bar chart and a cumulative curve.
There are a few useful options to know about:
</p>
<ol>
<li>The PROC FREQ statement supports the ORDER=FREQ option, which sorts the frequencies in descending order.
</li>
<li>The TABLE statement supports the OUT= option, which you can use to 
create a data set that contains the data for the Pareto chart. Use the OUTCUM option to output the cumulative statistics.
</li>
<li>The SGPLOT procedure supports the VBAR and VLINE statements, which enables you to overlay a bar chart and a cumulative curve.
You can use the Y2AXIS option on the VLINE statement to display a separate axis for the cumulative scale.
</li>
</ol>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* create the Pareto charts by hand in SGPLOT */</span>
<span style="color: #000080; font-weight: bold;">proc freq</span> <span style="color: #000080; font-weight: bold;">data</span>=&amp;<span style="color: #0000ff;">DSName</span> <span style="color: #0000ff;">order</span>=Freq noprint;
   <span style="color: #0000ff;">where</span> <span style="color: #0000ff;">not</span> <span style="color: #0000ff;">missing</span><span style="color: #66cc66;">&#40;</span>&amp;<span style="color: #0000ff;">VarName</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">table</span> &amp;<span style="color: #0000ff;">VarName</span> / out=FreqOut outcum;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Standard Pareto Chart&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=FreqOut noautolegend;
   vbar &amp;<span style="color: #0000ff;">VarName</span> / response=Percent;
   xaxis type=discrete discreteorder=<span style="color: #000080; font-weight: bold;">data</span>;
   yaxis grid <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">max</span>=<span style="color: #2e8b57; font-weight: bold;">100</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Percent&quot;</span>;
   <span style="color: #006400; font-style: italic;">/* overlay the cumulative percentage on the Y2 axis */</span>
   vline &amp;<span style="color: #0000ff;">VarName</span> /response=cum_Pct markers datalabel y2axis;
   y2axis <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">max</span>=<span style="color: #2e8b57; font-weight: bold;">100</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Cumulative Percent&quot;</span>;
   <span style="color: #0000ff;">format</span> cum_Pct best4.;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart2.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59158" srcset="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart2.png 640w, https://blogs.sas.com/content/iml/files/2026/06/ParetoChart2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>


<p>
The output is similar to the previous Pareto chart.
I decided to add grid lines and to add markers to the cumulative curve. Otherwise,
the information in this chart is the same as the one produced by PROC PARETO.
</p>

<h3>Problem with the classical Pareto chart</h3>
<p>
The classical Pareto chart suffers from three problems <a href="https://www.tandfonline.com/doi/abs/10.1198/000313006X152243">(Wilkinson, TAS, 2006)</a>:
</p>
<ol>
<li>Dual scale: Mathematically, the cumulative curve is the integral (or sum) of the bar heights.
It can be confusing to plot both quantities in the same graph. The "dual scale" problem is even worse if you show the scale of the bars as counts instead of percentages, as is often done.
</li>
<li>Range: When there are many categories (or when no category is responsible for most of the defects),
the bars are short, whereas the cumulative curve will always range to 100%. 
In that situation, it is difficult to judge the height of the bars, which get squashed down to the bottom of the frame.
</li>
<li>Interpolation: It is wrong to use line segments to connect the cumulative frequencies.
It makes it seem like the cumulative probability is a continuous function that increases linearly between categories.
It is not. For each category, it is a single number. The number increases discontinuously. The
categories are discrete so there is nothing "between" them.  
</li>
<li>
Reference distribution: Implicitly, calling something a "Pareto chart" implies that the data follows a Pareto distribution.
In a Pareto distribution, the frequencies follow a power law. The assumption of an underlying power law 
justifies focusing on the issue that has the largest frequency. If the categories are equally likely
to occur, you might as well focus on the issues that are cheapest or quickest to resolve. 
</li>
</ol>


<h3>A Pareto cascade chart</h3>
<p>
An alternative to the classical Pareto chart addresses 
the first three problems. I call the revised chart <a href="https://blogs.sas.com/content/iml/2015/04/27/cascade-chart.html">a cascade chart</a> or a <em>cumulative Pareto chart</em>.
A bar chart places the base of each bar along a common baseline at zero.
In contrast, a cascade chart aligns the base of each bar with the
top of the preceding bar. This results in a staircase-shaped graph.
The height of each "step" is proportional to the relative frequency of each category.
The tops of the steps show the cumulative distribution of frequencies.
</p>
<p>
If you have access to the PARETO procedure, you can use the CHARTTYPE=CUMULATIVE option
to create a cascade chart, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc pareto</span> <span style="color: #000080; font-weight: bold;">data</span>=&amp;<span style="color: #0000ff;">DSName</span>;
   vbar &amp;<span style="color: #0000ff;">VarName</span> / charttype=cumulative;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart3.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59155" srcset="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart3.png 640w, https://blogs.sas.com/content/iml/files/2026/06/ParetoChart3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
In this graph, the tops of the "steps" show the cumulative distribution of frequencies.
The heights of the steps show each category's contributions.
</p>
<p>
You can also create the chart in Base SAS. First, run the PROC FREQ step that was shown earlier.
This creates the FreqOut data set. The following DATA step uses the LAG function to compute the 
base level for each step.
You can then use a high-low plot to display the staircase-shaped cascade chart:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* add upper and lower variables for the HIGHLOW plot */</span>
<span style="color: #000080; font-weight: bold;">data</span> CumPareto;
<span style="color: #0000ff;">set</span> FreqOut;
_Lower = <span style="color: #0000ff;">lag</span><span style="color: #66cc66;">&#40;</span>cum_Pct<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">if</span> <span style="color: #0000ff;">_N_</span> = <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">then</span> 
   _Lower = <span style="color: #2e8b57; font-weight: bold;">0</span>; 
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Cumulative Pareto Chart&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=CumPareto;
   highlow <span style="color: #0000ff;">x</span>=&amp;<span style="color: #0000ff;">VarName</span> low=_Lower high=cum_Pct / 
           type=bar barwidth=<span style="color: #2e8b57; font-weight: bold;">1</span> highlabel=Percent;
   yaxis offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> grid;
   xaxis type=discrete discreteorder=<span style="color: #000080; font-weight: bold;">data</span>;
   <span style="color: #0000ff;">format</span> Percent best4.;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart4.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart4.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59152" srcset="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart4.png 640w, https://blogs.sas.com/content/iml/files/2026/06/ParetoChart4-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
Again, I have added gridlines and labels to the bars, but otherwise the information is the same as the cumulative chart that is created by PROC PARETO.
This chart has only a single axis, which shows the cumulative probability. Short bars are no longer squashed at the bottom of the chart. By adding labels, you can see the size of the relative frequencies and the cumulative distribution on a common scale.
</p>
<p>
There is a third type of Pareto chart that addresses the problem of not being able to see a reference distribution. I will discuss that variation in a separate article.
</p>

<h3>Summary</h3>
<p>
In SAS, the PARETO procedure in SAS/QC software provides options to create many types of Pareto charts. 
If you do not have a license for SAS/QC software, this article shows how to use Base SAS to create two simple Pareto charts. You can use a Pareto chart to identify the most frequent categories. In practice,
the most frequent categories are analyzed by quality engineers in hopes that addressing those issues will lead to a major improvement in the quality of a manufacturing process.
</p>
<p>
To simplify the creation of basic Pareto charts, I have encapsulated the techniques in this article into two SAS macros, which are shown in the Appendix.
</p>

<h3>Appendix: SAS macros that create basic Pareto charts</h3>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Macros to create a basic Pareto chart, written by Rick Wicklin. For details, see
   https://blogs.sas.com/content/iml/2026/06/22/pareto-charts-sas.html
*/</span>
<span style="color: #0000ff;">%macro</span> StdPareto<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">DSName</span>, <span style="color: #0000ff;">VarName</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">proc freq</span> <span style="color: #000080; font-weight: bold;">data</span>=&amp;<span style="color: #0000ff;">DSName</span> <span style="color: #0000ff;">order</span>=Freq noprint;
   <span style="color: #0000ff;">where</span> <span style="color: #0000ff;">not</span> <span style="color: #0000ff;">missing</span><span style="color: #66cc66;">&#40;</span>&amp;<span style="color: #0000ff;">VarName</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">table</span> &amp;<span style="color: #0000ff;">VarName</span> / out=_FreqOut outcum;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=_FreqOut noautolegend;
   vbar &amp;<span style="color: #0000ff;">VarName</span> / response=Percent;
   xaxis type=discrete discreteorder=<span style="color: #000080; font-weight: bold;">data</span>;
   yaxis grid <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">max</span>=<span style="color: #2e8b57; font-weight: bold;">100</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Percent&quot;</span>;
   <span style="color: #006400; font-style: italic;">/* overlay the cumulative percentage on the Y2 axis */</span>
   vline &amp;<span style="color: #0000ff;">VarName</span> /response=cum_Pct markers datalabel y2axis;
   y2axis <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">max</span>=<span style="color: #2e8b57; font-weight: bold;">100</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Cumulative Percent&quot;</span>;
   <span style="color: #0000ff;">format</span> cum_Pct best4.;
<span style="color: #000080; font-weight: bold;">run</span>;
<span style="color: #0000ff;">%mend</span>;
&nbsp;
<span style="color: #0000ff;">%macro</span> CumPareto<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">DSName</span>, <span style="color: #0000ff;">VarName</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #000080; font-weight: bold;">proc freq</span> <span style="color: #000080; font-weight: bold;">data</span>=&amp;<span style="color: #0000ff;">DSName</span> <span style="color: #0000ff;">order</span>=Freq noprint;
      <span style="color: #0000ff;">where</span> <span style="color: #0000ff;">not</span> <span style="color: #0000ff;">missing</span><span style="color: #66cc66;">&#40;</span>&amp;<span style="color: #0000ff;">VarName</span><span style="color: #66cc66;">&#41;</span>;
      <span style="color: #0000ff;">table</span> &amp;<span style="color: #0000ff;">VarName</span> / out=_FreqOut outcum;
   <span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
   <span style="color: #006400; font-style: italic;">/* add upper and lower variables for the HIGHLOW plot */</span>
   <span style="color: #000080; font-weight: bold;">data</span> _CumPareto;
   <span style="color: #0000ff;">set</span> _FreqOut;
   _Lower = <span style="color: #0000ff;">lag</span><span style="color: #66cc66;">&#40;</span>cum_Pct<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">if</span> <span style="color: #0000ff;">_N_</span> = <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">then</span> 
      _Lower = <span style="color: #2e8b57; font-weight: bold;">0</span>; 
   <span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
   <span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=_CumPareto;
      highlow <span style="color: #0000ff;">x</span>=&amp;<span style="color: #0000ff;">VarName</span> low=_Lower high=cum_Pct / 
              type=bar barwidth=<span style="color: #2e8b57; font-weight: bold;">1</span> highlabel=Percent;
      yaxis  offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> grid;
      xaxis type=discrete discreteorder=<span style="color: #000080; font-weight: bold;">data</span>;
      <span style="color: #0000ff;">format</span> Percent best4.;
   <span style="color: #000080; font-weight: bold;">run</span>;
<span style="color: #0000ff;">%mend</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* show how to call each macro. The categorical variable can be 
   numeric or character */</span>
<span style="color: #0000ff;">title</span>;
<span style="color: #006400; font-style: italic;">/* character variable */</span>
%StdPareto<span style="color: #66cc66;">&#40;</span>sashelp.cars, Type<span style="color: #66cc66;">&#41;</span>;
%CumPareto<span style="color: #66cc66;">&#40;</span>sashelp.cars, Type<span style="color: #66cc66;">&#41;</span>;
<span style="color: #006400; font-style: italic;">/* numeric variable */</span>
%StdPareto<span style="color: #66cc66;">&#40;</span>sashelp.cars, Cylinders<span style="color: #66cc66;">&#41;</span>;
%CumPareto<span style="color: #66cc66;">&#40;</span>sashelp.cars, Cylinders<span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>





<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/22/pareto-charts-sas.html">Create two types of Pareto charts in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/06/22/pareto-charts-sas.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/06/ParetoChart2-150x150.png" />
	</item>
		<item>
		<title>Create a confidence band for a normal Q-Q plot</title>
		<link>https://blogs.sas.com/content/iml/2026/06/15/confidence-band-qqplot.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/06/15/confidence-band-qqplot.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 15 Jun 2026 09:23:04 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=59026</guid>

					<description><![CDATA[<p>This article shows how to compute a confidence band for a Q-Q plot in SAS. A previous article shows how to construct confidence bands for the CDF of continuous univariate data. The bands can be added to a plot of the empirical CDF (ECDF) for the data. One of the [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/15/confidence-band-qqplot.html">Create a confidence band for a normal Q-Q plot</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
This article shows how to 
compute a confidence band for a Q-Q plot in SAS.
</p>
<p>
A previous article shows <a href="https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html">how to construct confidence bands for the CDF of continuous univariate data</a>.
The bands can be added to a plot of the empirical CDF (ECDF) for the data.
One of the drawbacks of an ECDF plot is that the cumulative distribution is an S-shaped curve. If you display 
two S-shaped curves that represent cumulative distributions, it is difficult for the human eye to detect
differences between them, which is why statisticians plot histograms (estimates of the probability density) more often than 
ECDF curves (estimates of the cumulative distribution).
</p><p>
If you want to compare the data distribution to a theoretical parametric distribution (such as normal or Weibull), 
<a href="https://blogs.sas.com/content/iml/2011/10/28/modeling-the-distribution-of-data-create-a-qq-plot.html">the quantile-quantile (Q-Q) plot</a>
is a useful alternative to the CDF plot.
A Q-Q plot is a visual representation of a statistical goodness-of-fit test: it helps you assess whether the data are a 
random sample from a theoretical distribution. 
Data that are sampled from the specified 
parametric distribution appear to be linear in a Q-Q plot.
</p><p>
More correctly, they appear to be linear for most random samples. 
As discussed in a previous article, <a href="https://blogs.sas.com/content/iml/2016/11/23/sampling-variation-small-samples.html">a Q-Q plot might look nonlinear due to random variation</a>, especially for a small sample. 
Consequently, it can be useful to add a confidence band to the Q-Q plot. 
</p>

<h3>The relationship between the CDF plot and the Q-Q plot</h3>
<p>
It turns out that you can transform the ECDF and confidence bands from the previous article to "straighten out" the cumulative distributions.
The result is a Q-Q plot with a confidence band.
</p>
<p>
Recall that if X is distributed according to some distribution, F, then F(X) is uniformly distributed. 
And if U is uniformly distributed, then F<sup>-1</sup> is distributed according to F.
Details and an example are discussed in <a href="https://blogs.sas.com/content/iml/2021/06/23/probability-integral-transform.html">a previous article on the probability integral transformation</a>.
You can use this result to "straighten out" the ECDF and the confidence bands, provided that you choose a distribution 
to use for the transformation. I will transform the simple Kolmogorov band from the previous article, but you can apply this technique to any other confidence band.
</p>

<a href="https://blogs.sas.com/content/iml/files/2026/06/normalQuantile.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/normalQuantile.png" alt="" width="360" height="270" class="alignright size-full wp-image-59062" srcset="https://blogs.sas.com/content/iml/files/2026/06/normalQuantile.png 640w, https://blogs.sas.com/content/iml/files/2026/06/normalQuantile-300x225.png 300w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
The most familiar distribution is the standard normal distribution. The standard cumulative distribution function is usually denoted as
F = &Phi;. In SAS, you can apply &Phi; by calling the function <code class="preserve-code-formatting">CDF(&quot;Normal&quot;, x)</code>.
You can apply the inverse transformation,  &Phi;<sup>-1</sup>, by calling the SAS function <code class="preserve-code-formatting">QUANTILE(&quot;Normal&quot;, p)</code>.
The inverse normal CDF transformation is shown to the right. Notice that the function is very steep when the probability is near 0 or 1.
This means that a small change in probability results in a very large change in the associated quantiles.
</p>

<h3>Constructing the Q-Q Plot Confidence Bands in SAS IML</h3>
<p>
For data, let's analyze <a href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">the breaking strength data of fiber-optic cords</a> from previous posts. 
</p>
<p>
A previous article shows <a href="https://blogs.sas.com/content/iml/2011/10/28/modeling-the-distribution-of-data-create-a-qq-plot.html">how to construct a normal Q-Q plot in SAS</a>
by using the following steps:
</p>
<ol>
<li>
Sort the data.
</li>
<li>
Compute n evenly spaced points in the interval (0,1), where n is the number of data points in your sample.
SAS procedures often use Blom's formula, where the i_th point is <em>v<sub>i</sub> = (i - 0.375) / (n + 0.25)</em>. 
</li>
<li>
Compute the quantiles (inverse CDF) of the evenly spaced points. This gives you quantiles of the standard normal distribution.
</li>
<li>
Create a scatter plot of the sorted data versus the standard quantiles computed in Step 3.  It is traditional to put the data
on the vertical axis and the theoretical quantiles on the horizontal axis. This is in contrast to histograms and ECDF plots,
which place the data on the horizontal axis. 
</li>
If you want to add confidence bands to the Q-Q plot, you can transform confidence bands for the CDF.
The transformation maps probabilities in (0,1) into standardized quantiles.
</ol>

<p>
The following SAS IML program uses the <code class="preserve-code-formatting">ECDF</code> and <code class="preserve-code-formatting">ECDF_KSCL</code> modules
from earlier articles. 
For your convenience, you can <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/ECDF/ECDF_QQ.sas">download these functions from GitHub</a>.
The program compute Kolmogorov bands for the ECDF, transforms them into quantile scale, and writes the results to a data set for graphing. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Before running this program, STORE the ECDF and ECDF_KSCL modules and define the 
   Cord data set. See
   https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html
   https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html
*/</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* In SAS Viya, the ECDF function is built into SAS IML and does not need to be loaded. */</span>
load module=<span style="color: #66cc66;">&#40;</span>ECDF ECDF_KSCL<span style="color: #66cc66;">&#41;</span>;  
&nbsp;
<span style="color: #006400; font-style: italic;">/* Read and sort the data */</span>
use Cord;  read all <span style="color: #0000ff;">var</span> <span style="color: #a020f0;">&quot;Strength&quot;</span> <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">x</span>;  <span style="color: #0000ff;">close</span>;
<span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">n</span> = countn<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Compute theoretical plotting positions and map to standardized quantiles (Blom, 1958) */</span>
v = <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">n</span><span style="color: #66cc66;">&#41;</span> - <span style="color: #2e8b57; font-weight: bold;">0.375</span><span style="color: #66cc66;">&#41;</span> / <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">n</span> + <span style="color: #2e8b57; font-weight: bold;">0.25</span><span style="color: #66cc66;">&#41;</span>;  
q = quantile<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, v<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Compute the ECDF and the Kolmogorov probability bands */</span>
y = ECDF<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
KS_band = ECDF_KSCL<span style="color: #66cc66;">&#40;</span>y<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* 95% CL */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Optionally, transform probability bands to quantile bounds.
   The domain of the QUANTILE function is the open interval (0,1),
   so clip the band values.
*/</span>
F_Lower = choose<span style="color: #66cc66;">&#40;</span>KS_band<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span> &gt; <span style="color: #2e8b57; font-weight: bold;">0</span>, KS_band<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>, .<span style="color: #66cc66;">&#41;</span>;
F_Upper = choose<span style="color: #66cc66;">&#40;</span>KS_band<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> &lt; <span style="color: #2e8b57; font-weight: bold;">1</span>, KS_band<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span>, .<span style="color: #66cc66;">&#41;</span>;
<span style="color: #006400; font-style: italic;">/* For a normal Q-Q plot, use the quantile of the normal distribution.
   Use the quantile function for other distributions (e.g., &quot;Exponential&quot;) to 
   construct confidence intervals for other Q-Q plots. */</span>
Q_Lower = quantile<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, F_Lower<span style="color: #66cc66;">&#41;</span>;
Q_Upper = quantile<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, F_Upper<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Create a dataset for plotting */</span>
<span style="color: #0000ff;">create</span> QQ_Bands <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;x&quot;</span> <span style="color: #a020f0;">&quot;q&quot;</span> <span style="color: #a020f0;">&quot;Q_Lower&quot;</span> <span style="color: #a020f0;">&quot;Q_Upper&quot;</span><span style="color: #66cc66;">&#125;</span>;
append;
<span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Create the Q-Q plot. The data is plotted on the vertical axis.
   The theoretical quantiles are plotted on the horizontal axis. */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Normal Q-Q Plot with 95% Confidence Bands&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=QQ_Bands noautolegend;
   <span style="color: #006400; font-style: italic;">/* Draw the confidence bands */</span>
   series <span style="color: #0000ff;">x</span>=Q_Lower y=<span style="color: #0000ff;">x</span> / lineattrs=<span style="color: #66cc66;">&#40;</span>color=gray<span style="color: #66cc66;">&#41;</span>;
   series <span style="color: #0000ff;">x</span>=Q_Upper y=<span style="color: #0000ff;">x</span> / lineattrs=<span style="color: #66cc66;">&#40;</span>color=gray<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #006400; font-style: italic;">/* Overlay a scatter plot of the quantiles of the data vs the standard normal quantiles. */</span>
   scatter <span style="color: #0000ff;">x</span>=q y=<span style="color: #0000ff;">x</span>;
   xaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Theoretical Normal Quantiles&quot;</span> grid;
   yaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Sample Quantiles (Strength)&quot;</span> grid;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence2.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59068" srcset="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence2.png 640w, https://blogs.sas.com/content/iml/files/2026/06/QQConfidence2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
For your convenience, I have also written <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/ECDF/ECDF_QQ.sas">an IML function (QQ_KSCL) that encapsulates the statements that generate the Q-Q plot and confidence bands</a>. 
</p>

<h3>Interpreting the Q-Q Plot Bands</h3>

<p>
Intuitively, 
the Q-Q plot and its confidence bands are a visual depiction of a hypothesis test.
The Q-Q plot indicates whether the data might be a random sample from a normal distribution.
If the scatter plot is approximately linear, then the assumption of normality seems reasonable.
</p><p>

But what does "approximately linear" mean? By how much can a Q-Q plot deviate from linearity without rejecting the assumption of normality?
The confidence bands indicate that the scatter plot can bend a little bit in the center
of the distribution and can bend more severely in the tails. 
It is said that George Box used to place a fat pencil on a Q-Q plot. 
If he could adjust the pencil so that it covered all points,
then he assumed that the data were approximately normal. The confidence band 
provides a similar visual indicator. If you can draw a straight line 
that sits entirely within the confidence bands, the data might be approximately normal. 
You can run a more formal analysis to test that hypothesis.
</p>
<p>
The widening of the confidence bands in the extreme quantiles explains why points in the tails of a Q-Q plot often deviate from the linear reference line or exhibit a slight bend. Without confidence bands, you might mistakenly conclude that these deviations are evidence of skewness, heavy tails, or non-normality. The confidence bands indicate that deviations in the tail are expected due to natural sampling variability. If the scatter points are inside the bands, there is insufficient evidence to reject the null hypothesis that the data comes from a normal distribution.
</p>

<h3>A normal Q-Q plot for non-normal data</h3>
<p>
Before concluding, let's take a quick look at a similar graph for data that are not normal. I generated a random sample of 50 data from an exponential distribution and ran the same SAS code. The Q-Q plot for the exponential data looks like the following graph:
</p>

<a href="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence1.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59059" srcset="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence1.png 640w, https://blogs.sas.com/content/iml/files/2026/06/QQConfidence1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
If you apply Box's "fat pencil" test, it is clear that you cannot cover the scatter points with a pencil. 
You should conclude that the normal distribution is not a good fit for these data. 
</p>
<p>
The Kolmogorov bands in this plot are so wide that you 
might actually be able to draw a line that stays inside these confidence bands. However, the scatter points are not near that line.
A weakness of the Kolmogorov-Smirnov test for normality (and the Kolmogorov confidence bands) is low statistical power for small samples.
The Kolmogorov bounds themselves do not rule out the possibility of fitting a normal distribution to these data, but the nonlinear shape of the data points suggest that the fit is not good.
</p>

<h3>Summary</h3>

<p>
The standard normal quantile function is a transformation that maps probabilities onto quantiles. 
By transforming the non-parametric Kolmogorov bands from an ECDF, you can generate simultaneous confidence bands for a normal Q-Q plot. 
You could use a different quantile function to obtain Q-Q plots for other distributions.
</p>
<p>
For moderately sized samples, if you can draw a straight line that stays inside the bands, you might want to perform a formal 
test for the hypothesis that the data are normally distributed. The Kolmogorov bands have low power, so be careful using them for small samples.
</p>
<p>
You can <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/ECDF/ECDF_QQ.sas">download all functions from the ECDF and Q-Q plot articles</a>. You can also <a href="https://github.com/sascommunities/the-do-loop-blog/blob/master/ECDF/ECDF_QQ_examples.sas">download a program that creates all tables and graphs in this series of blog posts</a>.</p><p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/15/confidence-band-qqplot.html">Create a confidence band for a normal Q-Q plot</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/06/15/confidence-band-qqplot.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/06/QQConfidence2-150x150.png" />
	</item>
		<item>
		<title>How to add confidence bands on the ECDF plot in SAS</title>
		<link>https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 08 Jun 2026 09:24:17 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58927</guid>

					<description><![CDATA[<p>A previous article shows how to construct an empirical cumulative distribution function (ECDF) for univariate data by using PROC UNIVARIATE or in the SAS IML language. The ECDF is a tool for visualizing the distribution of a sample and is helpful for estimating quantiles in the data. In statistics, we [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html">How to add confidence bands on the ECDF plot in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A previous article shows <a href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">how to construct an empirical cumulative distribution function (ECDF)</a> for univariate data by using PROC UNIVARIATE or in the SAS IML language. The ECDF is a tool for visualizing the distribution of a sample and is helpful for estimating quantiles in the data.
</p>
<p>
In statistics, we assume that a sample is a random draw from an underlying distribution.
Therefore, <a href="https://blogs.sas.com/content/iml/2026/06/01/visualize-var-ecdf.html">an ECDF has a sampling variability</a>. 
As with other point estimates in statistics, you can 
construct a confidence band for the ECDF. The confidence band assumes that the samples are drawn 
randomly and independently from a population. If so, the true CDF of the population is entirely
within the confidence bands for 95% of random samples.
</p>
<p>
This article shows how to use SAS to create a Kolmogorov confidence band for the ECDF.
The Kolmogorov is nonparametric, which means it does not assume that the data is generated from some well-known parametric distribution such as Normal, exponential, beta, Weibull, and so forth.
</p>

<h3>The Kolmogorov confidence bands for an ECDF</h3>
<p>
One method for constructing confidence bands for an ECDF uses the Kolmogorov-Smirnov (K-S) test statistic, often denoted as <em>D</em>. 
As shown in a previous article, <a href="https://blogs.sas.com/content/iml/2019/05/15/kolmogorov-d-statistic.html">the K-S statistic measures the maximum absolute distance between the ECDF and the true CDF</a>. By using the asymptotic distribution of the K-S statistic, 
you can calculate a 95% confidence envelope. The mathematics behind building the confidence band is presented in 
<a href="https://online.stat.psu.edu/stat415/lesson/22/22.3">lecture notes from Pennsylvania State University</a>
and in <a href="https://www.stat.umn.edu/geyer/5601/examp/kolmogorov.html">lecture notes by Charles Geyer at U. MN.</a>
</p>
<p>
To construct the 95% K-S confidence bands, you need the critical value of the Kolmogorov distribution. As noted in <a href="https://blogs.sas.com/content/iml/2019/05/20/critical-values-kolmogorov-test.html">a previous article about critical values for the K-S statistic</a>, the two-sided 95% critical value for the asymptotic K-S distribution is approximately 1.36. (More precisely, 1.358099.)
The width of the confidence band at a point on the ECDF is the critical value divided by the square root of the sample size: D<sub>crit</sub>/&radic;<em>n</em>. Because probabilities cannot be less than 0 or greater than 1, the confidence bands are truncated at these values.
(Note: for small samples and for other significance levels, you can use <a href="https://blogs.sas.com/content/iml/2020/06/24/kolmogorov-d-distribution-exact.html">a SAS IML program to compute the exact critical values for any significance level</a>.)
</p>
<p>
The following SAS IML program calculates a 95% confidence band for the ECDF of the Strength variable on the Cord data set.
The program uses 
<a href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">the <code class="preserve-code-formatting">ECDF</code> module from the previous post</a>, 
and the data from the Cord data set (N=50), which is defined in the same article. So, define the data set and STORE the 
ECDF function before running the following program.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Before running this program in SAS 9.4, define and STORE the ECDF function from 
   https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html 
   Also define the CORD data set from that blog post. */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
load module=<span style="color: #66cc66;">&#40;</span>ECDF<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* The asymptotic 95th percentile of the K-S distribution is 1.358099.
   See https://blogs.sas.com/content/iml/2019/05/20/critical-values-kolmogorov-test.html
   or the lecture notes from Charles J. Geyer at U. MN. at
   https://www.stat.umn.edu/geyer/5601/examp/kolmogorov.html
*/</span>
ks_crit = <span style="color: #2e8b57; font-weight: bold;">1.358099</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* read the data for the ECDF */</span>
use Cord;  read all <span style="color: #0000ff;">var</span> <span style="color: #a020f0;">&quot;Strength&quot;</span> <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">x</span>;  <span style="color: #0000ff;">close</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* evaluate the ECDF at all data points */</span>
<span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
ecdf = ECDF<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">n</span> = countn<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* count the nonmissing values */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Compute the standard error for the K-S bands */</span>
SE_KS = ks_crit / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">n</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* construct upper and lower 95% CI bands centered on the ECDF.
   The &gt;&lt; operator returns the minimum, ensuring the max value is 1.
   The &lt;&gt; operator returns the maximum, ensuring the min value is 0.
   See https://blogs.sas.com/content/iml/2026/02/04/clip-values.html */</span>
CL95_Upper = <span style="color: #66cc66;">&#40;</span>ecdf + SE_KS<span style="color: #66cc66;">&#41;</span> &gt;&lt; <span style="color: #2e8b57; font-weight: bold;">1</span>;
CL95_Lower = <span style="color: #66cc66;">&#40;</span>ecdf - SE_KS<span style="color: #66cc66;">&#41;</span> &lt;&gt; <span style="color: #2e8b57; font-weight: bold;">0</span>;
&nbsp;
<span style="color: #0000ff;">create</span> ECDF_bands <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;x&quot;</span> <span style="color: #a020f0;">&quot;ECDF&quot;</span> <span style="color: #a020f0;">&quot;CL95_Upper&quot;</span> <span style="color: #a020f0;">&quot;CL95_Lower&quot;</span> <span style="color: #66cc66;">&#125;</span>;
  append;
<span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Graph the step functions by using PROC SGPLOT */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Empirical CDF and 95% Confidence Bands&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=ECDF_bands noautolegend;
   <span style="color: #0000ff;">label</span> <span style="color: #0000ff;">x</span>=<span style="color: #a020f0;">&quot;Breaking Strength (psi)&quot;</span> ECDF=<span style="color: #a020f0;">&quot;Cumulative Proportion&quot;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=ECDF / lineattrs=<span style="color: #66cc66;">&#40;</span>thickness=<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=CL95_Lower / lineattrs=<span style="color: #66cc66;">&#40;</span>color=gray<span style="color: #66cc66;">&#41;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=CL95_Upper / lineattrs=<span style="color: #66cc66;">&#40;</span>color=gray<span style="color: #66cc66;">&#41;</span>;
   xaxis grid <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span>;
   yaxis grid <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0.03</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/ECDF_CL1.png" alt="" width="480" height="360" class="alignnone size-full wp-image-59005" srcset="https://blogs.sas.com/content/iml/files/2026/05/ECDF_CL1.png 640w, https://blogs.sas.com/content/iml/files/2026/05/ECDF_CL1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" />

<p>
The SGPLOT procedure overlays three separate <code class="preserve-code-formatting">STEP</code> statements to display the ECDF, the 95% lower band, and the 95% upper band.
I did not use the BAND statement (which creates a filled region) because it is designed for continuous curves. You could use a POLYGON statement if you need to display a filled region.  I leave that as an exercise.
</p>

<h3>How to interpret the confidence limits</h3>

<p>
The K-S bands are <em>simultaneous</em> confidence bands, not <em>pointwise</em> confidence intervals. There is a big difference!
</p>
<p>
If these were pointwise intervals, a 95% interval at a specific value like <em>x</em> = 7 would mean, "the true CDF value at <em>x</em> = 7
is within the vertical range for 95% of random samples." 
However, the interpretation is different for simultaneous bands.
The bands indicate that, in 95% of random samples, the <em>entire</em> CDF curve for the population will lie between these upper and lower step functions for <em>all</em> values of <em>x</em>. This is a stronger statement, which is why the Kolmogorov bands are generally wide for small samples. The Cord data set has 50 observations, so wide bands.
</p>
<p>
Incidentally, it is 
straightforward to wrap the computation in this section into a SAS IML function that computes the confidence bands for any ECDF.
Furthermore, you can compute the bands for any confidence level, not just for the 95% level.
The Appendix defines a SAS IML function that computes Kolmogorov confidence bands for an ECDF for any confidence level.
</p>


<h3>Advantages and disadvantages of Kolmogorov bands</h3>

<p>
Some advantages of Kolmogorov bands include:
</p>
<ul>
<li>
<strong>Nonparametric:</strong> The Kolmogorov bands do not assume any specific distribution for the data. (For example, it is not assumed to be normal.)
Thus, the bands are very general.
</li>
<li>
<strong>Easy to Compute</strong>: The K-S bands are easy to compute. You merely need to add and subtract a term that depends on the sample size and the critical value of the Kolmogorov D statistic.
</li>
</ul>
<p>
Kolmogorov bands have a few statistical shortcomings:
</p>
<ul>
<li><strong>Constant Width:</strong> Before the limits are truncated at 0 and 1, the K-S bands have the same vertical width at the center of the data as they do in the tails. In reality, the variance of an ECDF approaches 0 at the tails (because the CDF must exactly reach 0 and 1). Consequently, K-S bands are overly conservative (too wide) in the tails.
</li>
<li><strong>Continuous Assumption:</strong> The critical values of the standard K-S test assume the underlying population distribution is perfectly continuous. If your data contains many tied values (perhaps due to rounding or limitations of the instruments used to measure the data), the bands will be conservative, meaning the true coverage probability might be higher than the nominal 95%.
</li>
</ul>


<h3>Summary</h3>
<p>
Plotting an ECDF is one way to summarize univariate data. Like any statistic, the ECDF has uncertainty due to random sampling.
You can use the asymptotic distribution of the Kolmogorov-Smirnov statistic to calculate simultaneous 95% confidence bands 
for the ECDF. Thus, you can overlay confidence bands on an ECDF. 
The K-S bands are somewhat conservative at the extremes of the data, but they are a classic, nonparametric, and easily interpretable 
way to visualize uncertainty in an ECDF.
</p>
<p>
The ECDF is a valuable tool for theoretical purposes and for ECDF-based goodness-of-fit tests.
However, when comparing the data to a parametric distribution, many practitioners prefer Q-Q plots.
Q-Q plots are preferred because it is easier to judge whether points fall on a straight line than it is to
see the difference between two S-shaped curves. You can use the inverse-CDF transformation to map the K-S bands
from the ECDF plot onto a Q-Q plot. 
I will demonstrate this transformation in a future article.
</p>



<h3>Appendix: A SAS IML function that computes Kolmogorov confidence bands for an ECDF</h3>
<p>
It is straightforward to wrap the computation in this article into a SAS IML function. The following program defines a function named ECDF_KSCL. It computes the K-S confidence bands for any confidence level in the range [0.75, 0.99], in increments of 0.01.  If you pass in a 
level such as 0.975, it will be rounded to the nearest 0.01.
The critical values for the K-S statistic were computed by using an IML program similar to the one in 
<a href="https://www.stat.umn.edu/geyer/5601/examp/kolmogorov.html">the lecture notes by Charles Geyer at U. MN.</a>
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Compute the Kolmogorov-Smirnov confidence bands for a given ECDF.
   The CL parameter specifies the confidence level, which defaults to 0.95.
   The CL parameter should be in the range [0.75, 0.99]. */</span>
start ECDF_KSCL<span style="color: #66cc66;">&#40;</span>_ECDF, CL=<span style="color: #2e8b57; font-weight: bold;">0.95</span><span style="color: #66cc66;">&#41;</span>;
   ECDF = colvec<span style="color: #66cc66;">&#40;</span>_ECDF<span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* ensure column vector */</span>
   <span style="color: #0000ff;">n</span> = countn<span style="color: #66cc66;">&#40;</span>ECDF<span style="color: #66cc66;">&#41;</span>;          <span style="color: #006400; font-style: italic;">/* count the nonmissing values */</span>
   KS_band = j<span style="color: #66cc66;">&#40;</span>nrow<span style="color: #66cc66;">&#40;</span>ECDF<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">2</span>, .<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #006400; font-style: italic;">/* Enter a pre-tabulated set of critical values for confidence levels */</span>
   dAlpha = <span style="color: #2e8b57; font-weight: bold;">0.01</span>;
   alpha = <span style="color: #0000ff;">do</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0.01</span>, <span style="color: #2e8b57; font-weight: bold;">0.25</span>, dAlpha<span style="color: #66cc66;">&#41;</span>;
   conf_level = <span style="color: #0000ff;">round</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span> - alpha, dAlpha<span style="color: #66cc66;">&#41;</span>;
   ks_crit = <span style="color: #66cc66;">&#123;</span> <span style="color: #2e8b57; font-weight: bold;">1.6276236</span> <span style="color: #2e8b57; font-weight: bold;">1.517427</span> <span style="color: #2e8b57; font-weight: bold;">1.4490862</span> <span style="color: #2e8b57; font-weight: bold;">1.3985734</span> <span style="color: #2e8b57; font-weight: bold;">1.3580986</span> 
               <span style="color: #2e8b57; font-weight: bold;">1.3241093</span> <span style="color: #2e8b57; font-weight: bold;">1.2946745</span> <span style="color: #2e8b57; font-weight: bold;">1.2686236</span> <span style="color: #2e8b57; font-weight: bold;">1.2451911</span> <span style="color: #2e8b57; font-weight: bold;">1.2238479</span> 
               <span style="color: #2e8b57; font-weight: bold;">1.2042125</span> <span style="color: #2e8b57; font-weight: bold;">1.1860005</span> <span style="color: #2e8b57; font-weight: bold;">1.1689938</span> <span style="color: #2e8b57; font-weight: bold;">1.1530214</span> <span style="color: #2e8b57; font-weight: bold;">1.1379465</span> 
               <span style="color: #2e8b57; font-weight: bold;">1.1236583</span> <span style="color: #2e8b57; font-weight: bold;">1.110065</span> <span style="color: #2e8b57; font-weight: bold;">1.0970904</span> <span style="color: #2e8b57; font-weight: bold;">1.0846701</span> <span style="color: #2e8b57; font-weight: bold;">1.0727492</span> 
               <span style="color: #2e8b57; font-weight: bold;">1.0612805</span> <span style="color: #2e8b57; font-weight: bold;">1.0502232</span> <span style="color: #2e8b57; font-weight: bold;">1.0395418</span> <span style="color: #2e8b57; font-weight: bold;">1.0292048</span> <span style="color: #2e8b57; font-weight: bold;">1.0191847</span> <span style="color: #66cc66;">&#125;</span>;
   j = loc<span style="color: #66cc66;">&#40;</span> conf_level = <span style="color: #0000ff;">round</span><span style="color: #66cc66;">&#40;</span>CL, dAlpha<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">if</span> ncol<span style="color: #66cc66;">&#40;</span>j<span style="color: #66cc66;">&#41;</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;
      print <span style="color: #a020f0;">&quot;ERROR in ECDF_KSCL: The confidence level must be between 0.75 and 0.99&quot;</span>;
      <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> KS_band <span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">end</span>;
   <span style="color: #006400; font-style: italic;">/* Compute the standard error for the K-S bands */</span>
   SE_KS = ks_crit<span style="color: #66cc66;">&#91;</span>j<span style="color: #66cc66;">&#93;</span> / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">n</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
   <span style="color: #006400; font-style: italic;">/* construct upper and lower 95% CI bands centered on the ECDF.
      The &gt;&lt; operator returns the minimum, ensuring max value is 1.
      The &lt;&gt; operator returns the maximum, ensuring min value is 0.
      See https://blogs.sas.com/content/iml/2026/02/04/clip-values.html */</span>
   CL_Upper = <span style="color: #66cc66;">&#40;</span>ECDF + SE_KS<span style="color: #66cc66;">&#41;</span> &gt;&lt; <span style="color: #2e8b57; font-weight: bold;">1</span>;
   CL_Lower = <span style="color: #66cc66;">&#40;</span>ECDF - SE_KS<span style="color: #66cc66;">&#41;</span> &lt;&gt; <span style="color: #2e8b57; font-weight: bold;">0</span>;
   KS_band = CL_Lower || CL_Upper;
   <span style="color: #0000ff;">return</span> KS_Band;
finish;
store module=<span style="color: #66cc66;">&#40;</span>ECDF_KSCL<span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Test the ECDF_KSCL function */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
load module=<span style="color: #66cc66;">&#40;</span>ECDF ECDF_KSCL<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* read the data for the ECDF */</span>
use Cord;  read all <span style="color: #0000ff;">var</span> <span style="color: #a020f0;">&quot;Strength&quot;</span> <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">x</span>;  <span style="color: #0000ff;">close</span>;
<span style="color: #006400; font-style: italic;">/* evaluate the ECDF at all data points */</span>
<span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
ecdf = ECDF<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
CL = ECDF_KSCL<span style="color: #66cc66;">&#40;</span>ecdf, <span style="color: #2e8b57; font-weight: bold;">0.90</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* get bands for any confidence levels in [0.75, 0.99] */</span>
CL_Lower = CL<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>;
CL_Upper = CL<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span>;
<span style="color: #0000ff;">create</span> ECDF_bands <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;x&quot;</span> <span style="color: #a020f0;">&quot;ECDF&quot;</span> <span style="color: #a020f0;">&quot;CL_Lower&quot;</span> <span style="color: #a020f0;">&quot;CL_Upper&quot;</span> <span style="color: #66cc66;">&#125;</span>;
  append;
<span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Graph the step functions by using PROC SGPLOT */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Empirical CDF and 90% Confidence Bands&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=ECDF_bands noautolegend;
   <span style="color: #0000ff;">label</span> <span style="color: #0000ff;">x</span>=<span style="color: #a020f0;">&quot;Breaking Strength (psi)&quot;</span> ECDF=<span style="color: #a020f0;">&quot;Cumulative Proportion&quot;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=ECDF / lineattrs=<span style="color: #66cc66;">&#40;</span>thickness=<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=CL_Lower / lineattrs=<span style="color: #66cc66;">&#40;</span>color=gray<span style="color: #66cc66;">&#41;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=CL_Upper / lineattrs=<span style="color: #66cc66;">&#40;</span>color=gray<span style="color: #66cc66;">&#41;</span>;
   xaxis grid <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span>;
   yaxis grid <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0.03</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>


<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html">How to add confidence bands on the ECDF plot in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/06/08/confidence-bands-ecdf.html/feed</wfw:commentRss>
			<slash:comments>8</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/05/ECDF_CL1-150x150.png" />
	</item>
		<item>
		<title>How to visualize the variability in an ECDF</title>
		<link>https://blogs.sas.com/content/iml/2026/06/01/visualize-var-ecdf.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/06/01/visualize-var-ecdf.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 01 Jun 2026 09:20:29 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=42784</guid>

					<description><![CDATA[<p>A previous article shows how to construct an empirical cumulative distribution function (ECDF) in SAS. The most common way is by using PROC UNIVARIATE, but you can also call the ECDF function in the SAS IML language. The ECDF is a tool for visualizing the distribution of a univariate sample [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/01/visualize-var-ecdf.html">How to visualize the variability in an ECDF</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A previous article shows <a href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">how to construct an empirical cumulative distribution function (ECDF) in SAS</a>.
The most common way is by using PROC UNIVARIATE, but you can also call the ECDF function in the SAS IML language. The ECDF is a tool for visualizing the distribution of a univariate sample and is helpful for estimating quantiles in the data.
</p>
<p>
In statistics, we assume that a sample is a random draw from an underlying distribution.
Each random sample has a different ECDF. Therefore, you can think of an ECDF as a "point estimate" of the underlying cumulative distribution. 
If you were to draw a different sample of the same size, you would obtain a different ECDF function.
Assuming that the data came from the same data-generating process (that is, the same probability distribution)
an interesting question is: How much variability should you expect in a random ECDF?
</p><p>
I like to answer questions like this by using a simulation approach. 
A simulation builds intuition with a minimal amount of math.
After understanding the main ideas and concepts, the math usually becomes easier to understand.
</p>
<p>
This article creates a simulation in SAS that enables you to visualize the variability for the ECDF of a random normal sample.
</p>

<h3>How much should you expect an ECDF to vary?</h3>
<p>
This section uses simulation to generate many normal samples. For each sample, you can compute the ECDF plot and display them in a single graph. 
This visualizes the concept of variability of the ECDF for a random sample. 
<a href="https://blogs.sas.com/content/iml/2016/11/23/sampling-variation-small-samples.html">To see variability in histograms and Q-Q plots, you can 
read a previous article</a>. 
</p><p>
A previous article showed how to compute the ECDF of a real data set containing the breaking strength of fiber-optic cords. 
That data appears to be approximately normally distributed with mean &mu; = 7 and standard deviation &sigma; = 0.13.
</p><p>
Let's simulate 1,000 random samples from N(7, 0.13) and compute the ECDF for each. Let's then plot the ECDFs and overlay two curves:
the true normal CDF and the ECDF for the real data. 
To do this, you can use the ECDF function in SAS IML. The ECDF function is natively supported in SAS Viya, or you can load <a href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">the SAS 9.4 function that I created in the last article</a>:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Before running this program in SAS 9.4, define and STORE the ECDF function from 
   https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
load module=<span style="color: #66cc66;">&#40;</span>ECDF<span style="color: #66cc66;">&#41;</span>;
<span style="color: #006400; font-style: italic;">/* One way to visualize the variability in the ECDF estimate is 
   to run a simulation.
   Repeat B times:
     Generate a random sample of size 50 from N(mu, sigma)
     Compute the ECDF
   Then plot the B ECDFs from the simulated samples and overlay 
   the ECDF for the data. The following carries out that plan.
*/</span>
mu = <span style="color: #2e8b57; font-weight: bold;">7</span>;
sigma = <span style="color: #2e8b57; font-weight: bold;">0.13</span>;
NSamples = <span style="color: #2e8b57; font-weight: bold;">1000</span>;
NObs = <span style="color: #2e8b57; font-weight: bold;">50</span>;
<span style="color: #0000ff;">X</span> = j<span style="color: #66cc66;">&#40;</span>NObs, NSamples, .<span style="color: #66cc66;">&#41;</span>;     <span style="color: #006400; font-style: italic;">/* each column of X is a random sample */</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> randgen<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span>, <span style="color: #a020f0;">&quot;Normal&quot;</span>, mu, sigma<span style="color: #66cc66;">&#41;</span>;
&nbsp;
ECDF = j<span style="color: #66cc66;">&#40;</span>NObs, NSamples, .<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to NSamples;
  ECDF<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> = ecdf<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* the i_th column is an ECDF of X[,i] */</span>
<span style="color: #0000ff;">end</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* the CREATE/APPEND statement outputs data row-wise, so transpose.
   Now the i_th row X[i,] is a sample and ECDF[i,] is the ECDF. */</span>
<span style="color: #0000ff;">X</span> = <span style="color: #0000ff;">X</span>`;
ECDF = ECDF`;
<span style="color: #0000ff;">create</span> ECDF_Sim <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;X&quot;</span> <span style="color: #a020f0;">&quot;ECDF&quot;</span><span style="color: #66cc66;">&#125;</span>;  append;  <span style="color: #0000ff;">close</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* create some overlays */</span>
<span style="color: #006400; font-style: italic;">/* First, overlay the parent probability distribution */</span>
t = <span style="color: #0000ff;">do</span><span style="color: #66cc66;">&#40;</span>mu - <span style="color: #2e8b57; font-weight: bold;">3.5</span><span style="color: #006400; font-style: italic;">*sigma, mu + 3.5*sigma, sigma/100);</span>
<span style="color: #0000ff;">CDF</span> = <span style="color: #0000ff;">cdf</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, t, mu, sigma<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">create</span> <span style="color: #0000ff;">CDF</span> <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;t&quot;</span> <span style="color: #a020f0;">&quot;CDF&quot;</span><span style="color: #66cc66;">&#125;</span>;  append;  <span style="color: #0000ff;">close</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Next, read data, compute ECDF, write to data set */</span>
use Cord;  read all <span style="color: #0000ff;">var</span> <span style="color: #a020f0;">&quot;Strength&quot;</span>;  <span style="color: #0000ff;">close</span>;
<span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span>Strength<span style="color: #66cc66;">&#41;</span>;
ecdf_data = ECDF<span style="color: #66cc66;">&#40;</span>Strength<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">create</span> ECDF_Data <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Strength&quot;</span> <span style="color: #a020f0;">&quot;ECDF_Data&quot;</span><span style="color: #66cc66;">&#125;</span>;  append;  <span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* merge the ECDF, the simulated ECDFs, and the CDF from N(mu, sigma) */</span> 
<span style="color: #000080; font-weight: bold;">data</span> ECDF_All;
<span style="color: #0000ff;">set</span> ECDF_Sim <span style="color: #0000ff;">CDF</span> ECDF_Data;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* overlay the 1,000 simulated ECDFs and the real ECDF */</span> 
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;1,000 Random Normal ECDFs&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=ECDF_All noautolegend;
   <span style="color: #0000ff;">label</span> <span style="color: #0000ff;">x</span>=<span style="color: #a020f0;">&quot;Breaking Strength (psi)&quot;</span> ECDF=<span style="color: #a020f0;">&quot;Cumulative Proportion&quot;</span>;
   scatter <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">X</span> y=ECDF / transparency=<span style="color: #2e8b57; font-weight: bold;">0.98</span>;
   series <span style="color: #0000ff;">x</span>=t y=<span style="color: #0000ff;">CDF</span> / lineattrs=<span style="color: #66cc66;">&#40;</span>color=black thickness=<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
   step <span style="color: #0000ff;">x</span>=Strength y=ECDF_Data / lineattrs=<span style="color: #66cc66;">&#40;</span>color=DarkRed thickness=<span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span>;
   xaxis grid <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span>;
   yaxis grid;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/05/CDFPlot6.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/CDFPlot6.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58969" srcset="https://blogs.sas.com/content/iml/files/2026/05/CDFPlot6.png 640w, https://blogs.sas.com/content/iml/files/2026/05/CDFPlot6-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
Let's understand this graph. The blue point cloud contains points for 1,000 ECDFs from random samples of size N=50 from a normal distribution. 
(Technically, I should have overlaid 1,000 step functions, but plotting the points with high transparency creates a cleaner "density cloud" effect.) 
On top of the cloud, I overlaid a black curve, which is the true CDF of the population distribution. 
I also overlaid a dark red step function, which is the ECDF of the Strength variable in the Cord data set.
This ECDF seems to fit inside the blue cloud, which indicates that you cannot rule out that the Strength data 
came from an N(7, 0.13) distribution.
</p>
<p>
You can use this visual cloud to answer two related questions about sampling variability:
</p>
<ul>
<li><strong>Given a specific value X, what is its likely percentile?</strong>
You can estimate the answer by drawing a vertical line on the graph. The intersection of that line with the vertical spread of the blue cloud indicates the variability of the quantiles. For example, if you draw a vertical line at the mean X=7, the quantile range spans approximately [0.35, 0.65]. This indicates that the value 7 in any given random sample of size 50 could reasonably fall anywhere between the 35th and 65th percentiles.
</li>
<li><strong>Given a specific percentile, what is the likely range of X values?</strong>
You can estimate this by drawing a horizontal line. The intersection of the horizontal line and the horizontal width of the blue cloud indicates the variability of the X values for that exact quantile. For example, if you draw a line for the median (q=0.5), the intersection with the cloud appears to span X values roughly in the range [6.95, 7.05]. This indicates where the sample median is likely to fall.
</li>
</ul>

<p>
Notice that the dark red ECDF from the Cord data fits almost entirely inside the densest part of the blue cloud. 
This suggests that it is reasonable to hypothesize that the Strength data came from an N(7, 0.13) distribution.
You might want to run a formal statistical test for this hypothesis. 
You can use PROC UNIVARIATE to run several ECDF-based goodness-of-fit tests, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=Cord <span style="color: #0000ff;">normal</span>;
  cdfplot Strength / <span style="color: #0000ff;">normal</span><span style="color: #66cc66;">&#40;</span>mu=<span style="color: #2e8b57; font-weight: bold;">7</span> sigma=<span style="color: #2e8b57; font-weight: bold;">0.13</span><span style="color: #66cc66;">&#41;</span>;
  ods <span style="color: #0000ff;">select</span> TestsForNormality CDFPlot;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/CDF_plot7.png" alt="" width="399" height="177" class="alignnone size-full wp-image-58966" srcset="https://blogs.sas.com/content/iml/files/2026/05/CDF_plot7.png 399w, https://blogs.sas.com/content/iml/files/2026/05/CDF_plot7-300x133.png 300w" sizes="(max-width: 399px) 100vw, 399px" />

<p>
The output includes the TestsForNormality table, which shows tests statistics and p-values for several tests for normality.
All tests fail to reject the null hypothesis, which tells us that the data are consistent with N(7, 0.13).
</p>

<h3>Summary</h3>
<p>
Statistics on small random samples have large variability. Graphical plots of the data (such as histograms, Q-Q plots, and ECDF plots)
will naturally vary from sample to sample. 
The simulation approach in this article visualizes a region where an ECDF is likely to be for a random sample.
If your observed data falls within this region, it is plausible the data came from the hypothetical distribution. 
This intuitive idea motivates looking at adding confidence intervals to Q-Q plots and ECDF plots, which is a topic for a future article.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/06/01/visualize-var-ecdf.html">How to visualize the variability in an ECDF</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/06/01/visualize-var-ecdf.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/05/CDFPlot6-150x150.png" />
	</item>
		<item>
		<title>Create and graph the ECDF function in SAS</title>
		<link>https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Tue, 26 May 2026 09:20:26 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58879</guid>

					<description><![CDATA[<p>The empirical cumulative distribution function (ECDF) is an important tool in statistics. It is one of several plots you can use to visualize the shape of a data distribution. It is not used as often as the histogram, the kernel density estimate, and the box plot, but it is essential [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">Create and graph the ECDF function in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
The empirical cumulative distribution function (ECDF) is an important tool in statistics. 
It is one of 
<a href="https://blogs.sas.com/content/iml/2026/01/07/9-dataviz-distrib-sas-part1.html">several plots you can use to 
visualize the shape of a data distribution</a>.
It is not used as often as the histogram, the kernel density estimate, and the box plot, 
but it is essential to visualize ECDF-based hypothesis tests that assess whether the data might have come from a specified distribution. 
</p>
<p>
This article shows how to construct and plot an ECDF in SAS. The easy way is to use PROC UNIVARIATE. 
However, for certain data analysis applications, it is useful to be able to call a function in the SAS IML matrix language that manually evaluates the ECDF.
<a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_067/casimllang/casimllang_common_sect104.htm">The ECDF function was added to the SAS IML language</a> in the 2025.01 release of SAS Viya. For customers who are still on SAS 9.4, this article provides a similar ECDF function that you can use.
</p>

<h3>What is an ECDF?</h3>
<p>
Given a set of univariate data <em>x</em><sub>1</sub> &le; <em>x</em><sub>2</sub> &le; ... &le; <em>x</em><sub>n</sub>, the empirical distribution function (ECDF) is the step function defined by:
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<em>F(t)</em> = (number of data values &le; <em>t</em>) / <em>n</em>
</p>

<p>
The ECDF is a piecewise-constant function that is flat on each half-open interval [<em>x<sub>i</sub></em>, <em>x<sub>i+1</sub></em>). 
At each unique data value, it increases by an amount proportional to the frequency of data at that value.
</p>


<h3>The ECDF in PROC UNIVARIATE</h3>
<p>
SAS provides the CDFPLOT statement in PROC UNIVARIATE, which computes and displays the ECDF and optionally overlays the CDF of a specified probability distribution. For example, the PROC UNIVARIATE documentation includes data about the breaking strength (in PSI) for 50  fiber-optic cords: 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">data</span> Cord;
   <span style="color: #0000ff;">label</span> Strength=<span style="color: #a020f0;">&quot;Breaking Strength (psi)&quot;</span>;
   <span style="color: #0000ff;">input</span> Strength @@;
datalines;
<span style="color: #2e8b57; font-weight: bold;">6.94</span> <span style="color: #2e8b57; font-weight: bold;">6.97</span> <span style="color: #2e8b57; font-weight: bold;">7.11</span> <span style="color: #2e8b57; font-weight: bold;">6.95</span> <span style="color: #2e8b57; font-weight: bold;">7.12</span> <span style="color: #2e8b57; font-weight: bold;">6.70</span> <span style="color: #2e8b57; font-weight: bold;">7.13</span> <span style="color: #2e8b57; font-weight: bold;">7.34</span> <span style="color: #2e8b57; font-weight: bold;">6.90</span> <span style="color: #2e8b57; font-weight: bold;">6.83</span>
<span style="color: #2e8b57; font-weight: bold;">7.06</span> <span style="color: #2e8b57; font-weight: bold;">6.89</span> <span style="color: #2e8b57; font-weight: bold;">7.28</span> <span style="color: #2e8b57; font-weight: bold;">6.93</span> <span style="color: #2e8b57; font-weight: bold;">7.05</span> <span style="color: #2e8b57; font-weight: bold;">7.00</span> <span style="color: #2e8b57; font-weight: bold;">7.04</span> <span style="color: #2e8b57; font-weight: bold;">7.21</span> <span style="color: #2e8b57; font-weight: bold;">7.08</span> <span style="color: #2e8b57; font-weight: bold;">7.01</span>
<span style="color: #2e8b57; font-weight: bold;">7.05</span> <span style="color: #2e8b57; font-weight: bold;">7.11</span> <span style="color: #2e8b57; font-weight: bold;">7.03</span> <span style="color: #2e8b57; font-weight: bold;">6.98</span> <span style="color: #2e8b57; font-weight: bold;">7.04</span> <span style="color: #2e8b57; font-weight: bold;">7.08</span> <span style="color: #2e8b57; font-weight: bold;">6.87</span> <span style="color: #2e8b57; font-weight: bold;">6.81</span> <span style="color: #2e8b57; font-weight: bold;">7.11</span> <span style="color: #2e8b57; font-weight: bold;">6.74</span>
<span style="color: #2e8b57; font-weight: bold;">6.95</span> <span style="color: #2e8b57; font-weight: bold;">7.05</span> <span style="color: #2e8b57; font-weight: bold;">6.98</span> <span style="color: #2e8b57; font-weight: bold;">6.94</span> <span style="color: #2e8b57; font-weight: bold;">7.06</span> <span style="color: #2e8b57; font-weight: bold;">7.12</span> <span style="color: #2e8b57; font-weight: bold;">7.19</span> <span style="color: #2e8b57; font-weight: bold;">7.12</span> <span style="color: #2e8b57; font-weight: bold;">7.01</span> <span style="color: #2e8b57; font-weight: bold;">6.84</span>
<span style="color: #2e8b57; font-weight: bold;">6.91</span> <span style="color: #2e8b57; font-weight: bold;">6.89</span> <span style="color: #2e8b57; font-weight: bold;">7.23</span> <span style="color: #2e8b57; font-weight: bold;">6.98</span> <span style="color: #2e8b57; font-weight: bold;">6.93</span> <span style="color: #2e8b57; font-weight: bold;">6.83</span> <span style="color: #2e8b57; font-weight: bold;">6.99</span> <span style="color: #2e8b57; font-weight: bold;">7.00</span> <span style="color: #2e8b57; font-weight: bold;">6.97</span> <span style="color: #2e8b57; font-weight: bold;">7.01</span>
;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">'Cumulative Distribution Function of Breaking Strength'</span>;
<span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=Cord noprint;
   cdfplot Strength / odstitle = <span style="color: #0000ff;">title</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2026/05/CDFplot1.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/CDFplot1.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58894" srcset="https://blogs.sas.com/content/iml/files/2026/05/CDFplot1.png 640w, https://blogs.sas.com/content/iml/files/2026/05/CDFplot1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
From the plot, you can see that there is about a 50% chance that a random fiber-optic cord will break if subjected to 7.0 PSI of stress.
If the stress is kept below 6.9 PSI, there is only a 20% chance of random cord will break.
The graph can be improved to answer questions like these by 
using the STATREF= option on the CDFPLOT statement to add vertical reference lines at the data value that correspond to arbitrary percentiles.
For example, the option <code class="preserve-code-formatting">statref=P 10 Q1 Q2 Q3 P 90</code> add vertical lines at the 10th, 25th, 50th, 75th, and 90th percentiles of the data.
</p>
<p>
In addition to displaying the ECDF, the CDFPLOT statement enables you to overlay the theoretical CDF of common probability distributions. 
This is useful for visualizing ECDF-based hypothesis tests.
For example, the Kolmogorov-Smirnov (K-S) test for normality uses the largest vertical gap between the ECDF and the hypothesized CDF to 
construct a test statistic. If you add the <code class="preserve-code-formatting">NORMAL</code> option to the CDFPLOT statement, you can visualize the K-S test and other ECDF-based goodness-of-fit tests
for normality.
</p>


<h3>How to construct an ECDF manually</h3>
<p>
When conducting a simulation study, a bootstrap analysis, or exploring ECDF-based statistics, it is convenient to compute ECDF values in SAS IML.
To construct the ECDF manually, do the following:
</p>
<ul>
<li>Find the unique values in the data and the number of observations for each unique value. In SAS IML, you can use the 
<code class="preserve-code-formatting">TABULATE</code> function for that. 
</li>
<li>Since the ECDF is a piecewise constant function that steps up at these locations, 
use these unique values as the endpoints of bins. You can use the 
<code class="preserve-code-formatting">BIN</code> function for that. 
</li>
<li>For each bin, associate the value of the ECDF function at the left endpoint of the bin. 
Since the ECDF is a piecewise constant function that increases at these locations, this tells you the values of the ECDF for any point in any bin.
</li>
<li>For points that are outside of any bin, the ECDF is either 0 or 1.
 If <em>t</em> is less than the smallest data value, <em>ECDF(t)</em> = 0. If <em>t</em> is greater than or equal to the largest data value, then <em>ECDF(t)</em> = 1.</li>
</ul>

<p>
In SAS Viya, the ECDF function is supported natively by the SAS IML language. For customers who are still on SAS 9.4, the following function manually implements the basic ECDF algorithm:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Compute the empirical distribution function (ECDF) if a column vector of data.
   Optionally, evaluate the ECDF at values in a column vector, t. 
   If t is not specified, use t=x. 
   SYNTAX:
     y = ECDF(x); 
     t = T( do( min(x), max(x), (max(x)-min(x))/100 ) );
     y = ECDF(x, t); 
*/</span>
start ECDF<span style="color: #66cc66;">&#40;</span>_x, _t=<span style="color: #66cc66;">&#41;</span>;
    <span style="color: #0000ff;">x</span> = colvec<span style="color: #66cc66;">&#40;</span>_x<span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* ensure a column vector */</span>
    <span style="color: #006400; font-style: italic;">/* Tabulate the unique values of x.
       levels = unique sorted elements of x
       freq = number of duplicates for each level */</span>
    <span style="color: #0000ff;">call</span> tabulate<span style="color: #66cc66;">&#40;</span>levels, freq, <span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
    levels = colvec<span style="color: #66cc66;">&#40;</span>levels<span style="color: #66cc66;">&#41;</span>;
    freq = colvec<span style="color: #66cc66;">&#40;</span>freq<span style="color: #66cc66;">&#41;</span>;
    y = cusum<span style="color: #66cc66;">&#40;</span>freq<span style="color: #66cc66;">&#41;</span> / <span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#40;</span>freq<span style="color: #66cc66;">&#41;</span>;
&nbsp;
    <span style="color: #0000ff;">if</span> isSkipped<span style="color: #66cc66;">&#40;</span>_t<span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">then</span>
       t = <span style="color: #0000ff;">x</span>;
    <span style="color: #0000ff;">else</span>
       t = colvec<span style="color: #66cc66;">&#40;</span>_t<span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* ensure column vector */</span>
    <span style="color: #006400; font-style: italic;">/* build a step function by assigning t to different bins, using levels as endpoints */</span>
    bins = bin<span style="color: #66cc66;">&#40;</span>t, levels<span style="color: #66cc66;">&#41;</span>;
&nbsp;
    <span style="color: #006400; font-style: italic;">/* assign ECDF[i] for nonmissing bins */</span>
    ecdf = j<span style="color: #66cc66;">&#40;</span>nrow<span style="color: #66cc66;">&#40;</span>t<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, .<span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* initialize ECDF to missing */</span>
    idx = loc<span style="color: #66cc66;">&#40;</span>bins ^= .<span style="color: #66cc66;">&#41;</span>;
    <span style="color: #0000ff;">if</span> ncol<span style="color: #66cc66;">&#40;</span>idx<span style="color: #66cc66;">&#41;</span> &gt; <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">then</span>
        ecdf<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> = y<span style="color: #66cc66;">&#91;</span> bins<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#93;</span>;
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Three reasons bins[i] could be a missing value:
       1. if t[i] is missing. No need to change ECDF[i] b/c it was initialized to missing.
       2. if t[i] &lt; min(x). Set ECDF(t[i]) = 0.
       3. if t[i] &gt; max(x). Set ECDF(t[i]) = 1.
    */</span>
    LT_idx = loc<span style="color: #66cc66;">&#40;</span>t^=. &amp; t&lt;<span style="color: #0000ff;">min</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
    <span style="color: #0000ff;">if</span> ncol<span style="color: #66cc66;">&#40;</span>LT_idx<span style="color: #66cc66;">&#41;</span> &gt; <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">then</span> 
       ecdf<span style="color: #66cc66;">&#91;</span>LT_idx<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">0</span>;
    GT_idx = loc<span style="color: #66cc66;">&#40;</span>t^=. &amp; t&gt;=<span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
    <span style="color: #0000ff;">if</span> ncol<span style="color: #66cc66;">&#40;</span>GT_idx<span style="color: #66cc66;">&#41;</span> &gt; <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">then</span> 
       ecdf<span style="color: #66cc66;">&#91;</span>GT_idx<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>;
    <span style="color: #0000ff;">return</span> ecdf;
finish;
store module=<span style="color: #66cc66;">&#40;</span>ECDF<span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;</pre></td></tr></table></div>




<p>
Let's call this function on the same data that was used by PROC UNIVARIATE earlier in this article.
Suppose you want to estimate the probability that the breaking strength of a fiber optic cord is less than certain specified test values.
You can read the Cord data set and call the ECDF function, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
load module=<span style="color: #66cc66;">&#40;</span>ECDF<span style="color: #66cc66;">&#41;</span>;
&nbsp;
use Cord;
read all <span style="color: #0000ff;">var</span> <span style="color: #a020f0;">&quot;Strength&quot;</span> <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">x</span>;
<span style="color: #0000ff;">close</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* estimate the probability that the breaking strength is less than 
   6.8, 6.9, 7.0, and 7.1 psi. Note that some of these values 
   are not observed data values.
*/</span>
t = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">6.8</span>, <span style="color: #2e8b57; font-weight: bold;">6.9</span>, <span style="color: #2e8b57; font-weight: bold;">7.0</span>, <span style="color: #2e8b57; font-weight: bold;">7.1</span><span style="color: #66cc66;">&#125;</span>;
ecdf_t = ECDF<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span>, t<span style="color: #66cc66;">&#41;</span>;
print t ecdf_t;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/CDFplot2.png" alt="" width="93" height="147" class="alignnone size-full wp-image-58900" />

<p>
The output estimates the probability that a random cord will break when subjected to the specified stresses.
About 4% will break if the stress is 6.8 PSI, whereas about 76% will break is the stress is as large as 7.1 PSI.
</p>
<p>
Notice that some of the specified locations are not data values. For each specified value,
the ECDF returns an estimate of the  probability that a random fiber-optic cord will break when subjected to 
a stress of that magnitude.
</p>

<h3>Graph the ECDF</h3>
<p>
Typically, an ECDF plot evaluates the ECDF function at the observed data values rather than arbitrary points. As discussed in a previous article, <a href="https://blogs.sas.com/content/iml/2016/09/06/graph-step-function-sas.html">it is best to use the STEP statement in PROC SGPLOT</a>
to visualize an ECDF function. 
</p>

<p>
The following IML statements sort the data, evaluate the ECDF at the data locations, and write the resulting calculations to a SAS data set.
The SGPLOT procedure then visualizes the ECDF.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* typically, an ECDF plot evaluates the ECDF at all data points */</span>
<span style="color: #0000ff;">call</span> sort<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
ecdf = ECDF<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #0000ff;">create</span> ECDF <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;x&quot;</span> <span style="color: #a020f0;">&quot;ECDF&quot;</span><span style="color: #66cc66;">&#125;</span>; append; <span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Graph a step function. See
   https://blogs.sas.com/content/iml/2016/09/06/graph-step-function-sas.html */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Empirical CDF&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=ECDF noautolegend;
   <span style="color: #0000ff;">label</span> <span style="color: #0000ff;">x</span>=<span style="color: #a020f0;">&quot;Breaking Strength (psi)&quot;</span> ECDF=<span style="color: #a020f0;">&quot;Cumulative Proportion&quot;</span>;
   step <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">x</span> y=ecdf;
   fringe <span style="color: #0000ff;">x</span>;
   xaxis grid <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0.05</span> offsetmax=<span style="color: #2e8b57; font-weight: bold;">0.05</span>;
   yaxis grid <span style="color: #0000ff;">min</span>=<span style="color: #2e8b57; font-weight: bold;">0</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0.03</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2026/05/CDFplot3.png"><img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/CDFplot3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-58903" srcset="https://blogs.sas.com/content/iml/files/2026/05/CDFplot3.png 640w, https://blogs.sas.com/content/iml/files/2026/05/CDFplot3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The resulting graph is similar to the one produced by PROC UNIVARIATE earlier. However, by using PROC SGPLOT, 
you have more freedom to enhance the plot with features such as a fringe plot, custom reference lines, grid lines, and so on.
The grid lines enable you to quickly estimate that about 50% of random cords will creak if subjected to a stress of 7.0 PSI.
</p>

<h3>Summary</h3>

<p>
The empirical CDF (ECDF) is a fundamental nonparametric tool for exploring data and comparing univariate distributions. 
In Base SAS, PROC UNIVARIATE provides a quick and easy way to generate these plots automatically.
In addition, it can be useful to be able to evaluate the ECDF function programmatically.
The ECDF function was added to SAS IML in SAS Viya in Release 2025.01.
For SAS 9.4 customers, this article provides a similar function.
</p>


<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html">Create and graph the ECDF function in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/26/create-ecdf.html/feed</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2026/05/CDFplot3-150x150.png" />
	</item>
		<item>
		<title>Compute the orthant probability for the 4-D multivariate normal distribution</title>
		<link>https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 18 May 2026 09:21:20 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58755</guid>

					<description><![CDATA[<p>This is the last article in a series about computing probabilities for the multivariate normal distribution. A previous article shows how to compute the probability of a standardized multivariate normal (MVN) distribution in an orthant for dimensions k=1, 2, and 3 by using the Childs-Sun formulas. The formulas are exact [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html">Compute the orthant probability for the 4-D multivariate normal distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
This is the last article in a series about computing probabilities for the multivariate normal distribution.
A previous article shows <a href="https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html">how to compute the probability of a standardized multivariate normal (MVN) distribution in an orthant for dimensions k=1, 2, and 3 by using the Childs-Sun formulas.</a> 
The formulas are exact and are expressed in terms of the arcsine of the correlation coefficients of a general correlation matrix. 
For example, if X~MVN(0, R) is a standard trivariate normal random variable, the probability that X is in the first orthant 
{X | X1 &gt; 0, X2 &gt; 0, X3 &gt; 0}
is given by the formula:
<br />
<span class='MathJax_Preview'>\(P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)\)</span><script type='math/tex'>P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)</script>
<br />
where R<sub>12</sub>, R<sub>13</sub>, and R<sub>23</sub> are the off-diagonal elements of a general 3x3 correlation matrix. 
</p>
<p>
You might hope that higher-dimensional Childs-Sun formulas are similarly simple.
Alas, they are not!
For the higher-dimensional formulas, the probabilities are defined recursively.
As the dimension increases, the number of terms explodes combinatorially.
This article shows a SAS IML module that implements the Childs-Sun formula for the orthant probability of a 4-D standardized normal distribution.
After implementing the 4-D case, I have no desire to attempt higher dimensions!
</p>

<h3>The Childs-Sun formula for dimension 4</h3>
<p>
Appendix 1 shows a snippet from p. 12 of Genz and Bretz (2009), which describes the Childs-Sun formula for an MVN orthant probability. 
The formula relies on the Childs-Sun reduction method (Childs, 1967; Sun, 1988). 
Similar to 
the lower-dimensional formulas, the 4-D formula contains a sum of arcsines of the correlation coefficients.
However, the formula adds additional terms that involve an integral, I<sub>4</sub>. 
The integrands are complicated expressions of the off-diagonal elements of a 4x4 correlation matrix. 
</p><p>
In 4-D, there are 16 orthants. As discussed in a previous article, if you can find the probability in the "first orthant" (the one where all components have positive signs), you can use a clever "reflection trick" to find the probability in the other 15 octants.
</p><p>
In Genz and Bretz (2009), the Childs-Sun formula for the orthant probability, which they call P<sub>4</sub>, has an unfortunate typo.
The correct formula is shown in Appendix 1, where the red "delete" symbol removes an erroneous "plus sign."
It took me many hours to decipher the notation. By consulting other references (Gehrlein and Saniga (1977); Sun and Asano (1990)), 
and with some assistance from the Google Gemini AI, I was able to decipher the formulas and create a 
SAS IML function that evaluates the orthant probability. The IML function is presented in this article. 
</p>

<h3>Conditional probability and the geometry of the formula</h3>
<p>
I usually explain the mathematical details of my SAS programs, but I'm not going to do that today. 
The formula is too complicated to describe in a short blog post.
Some details are in <a href="https://doi.org/10.1093/biomet/54.1-2.293">the original Childs (1967) paper</a>, 
which is unfortunately behind a paywall. 
</p><p>
The general formula (see Appendix 1) is very complicated. But if you restrict to k=4 dimensions (hence, n=2), the formula 
looks something like this:
<br />
<span class='MathJax_Preview'>\(P_4 = \frac{1}{16} + \frac{1}{8\pi} \sum_{i < j}^4 \sin^{-1}(R_{ij}) + \frac{1}{4\pi^2} I_4(R)\)</span><script type='math/tex'>P_4 = \frac{1}{16} + \frac{1}{8\pi} \sum_{i < j}^4 \sin^{-1}(R_{ij}) + \frac{1}{4\pi^2} I_4(R)</script>
<br />
where I<sub>4</sub> is a complicated sum of one-variable integrals. Each integral 
depends on a lower-dimensional integration, called I<sub>2</sub>, which is why this is a recursive formula.
The notation is explained further in Appendix 2.
</p><p>

The one-variable integrals are the key.
In general, to evaluate the probability of a 4-D distribution, you must solve a 4-D volume integral. 
The Childs-Sun method uses the geometry of the multivariate normal distribution to reduce this volume to a sum of
iterated integrals that accumulate lower-dimensional slices. 
</p>
<p>
This formula is an application of <em>conditional probability</em>. 
The dummy integration variable, <em>x</em>, is a continuous variable on [0,1]. 
For a fixed value of <em>x</em>, the term R<sup>1i</sup>(x) evaluates to a conditional correlation matrix. 
The variables inside the integrand (e.g., the quadratic denominator w = 1 - R<sub>1i</sub><sup>2</sup> x<sup>2</sup>) 
are <a href="https://online.stat.psu.edu/stat505/Lesson06">conditional variances and partial correlations</a>.
</p>
<p>
The I<sub>2</sub> function evaluates a 2-D orthant probability (which has an exact arcsine solution) for the remaining two variables, given that the other variables are fixed at a slice determined by <em>x</em>. You can use the QUAD subroutine in SAS IML to integrate over <em>x</em> in [0,1], 
which aggregates these 2-D conditional probabilities to sweep out the exact probability mass of the 4-D volume.
</p>

<h3>Implementing the P4 formula in SAS IML</h3>
<p>
The main SAS IML function is called P4_childs, which computes the 4-D orthant probability. 
It calls the QUAD subroutine to compute I<sub>4</sub>. 
The integrand is ProbInt4, which builds terms from the off-diagonal 
elements of the 4x4 correlation matrix. The DO loop evaluates three different conditional correlation matrices. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* The base case: P2. For a scalar correlation, rho, returning the orthant probability, P2.
   See https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html */</span>
start P2<span style="color: #66cc66;">&#40;</span>rho<span style="color: #66cc66;">&#41;</span>;
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span> <span style="color: #2e8b57; font-weight: bold;">0.25</span> + <span style="color: #0000ff;">arsin</span><span style="color: #66cc66;">&#40;</span>rho<span style="color: #66cc66;">&#41;</span> / <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #006400; font-style: italic;">*constant(&quot;pi&quot;)) );</span>
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* The sum of the I4 integrands. Instead of evaluating the three integrals separately,
   evaluate the sum of the integrals and perform one numerical integration.   
   Evaluates the Sun (1988) residual partial correlation matrix.
   Whereas Genz/Bretz Eqn 2.8-2.9 uses a general covariance matrix, this function assumes 
   R is a correlation matrix, so R[i,i]=1  */</span>
start ProbInt4<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>R_global<span style="color: #66cc66;">&#41;</span>;
  R = R_global;
  pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;pi&quot;</span><span style="color: #66cc66;">&#41;</span>;
  x2 = <span style="color: #0000ff;">x</span> # <span style="color: #0000ff;">x</span>;
  <span style="color: #0000ff;">sum</span> = <span style="color: #2e8b57; font-weight: bold;">0</span>;
&nbsp;
  <span style="color: #006400; font-style: italic;">/* The sum of three terms: i=2, 3, and 4 */</span>
  <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">2</span> to <span style="color: #2e8b57; font-weight: bold;">4</span>;
    <span style="color: #006400; font-style: italic;">/* Determine the indices (j, k) of the two remaining variables */</span>
    <span style="color: #0000ff;">if</span> i=<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;      j=<span style="color: #2e8b57; font-weight: bold;">3</span>; k=<span style="color: #2e8b57; font-weight: bold;">4</span>; <span style="color: #0000ff;">end</span>;
    <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">if</span> i=<span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>; j=<span style="color: #2e8b57; font-weight: bold;">2</span>; k=<span style="color: #2e8b57; font-weight: bold;">4</span>; <span style="color: #0000ff;">end</span>;
    <span style="color: #0000ff;">else</span> <span style="color: #0000ff;">do</span>;             j=<span style="color: #2e8b57; font-weight: bold;">2</span>; k=<span style="color: #2e8b57; font-weight: bold;">3</span>; <span style="color: #0000ff;">end</span>;
&nbsp;
    <span style="color: #006400; font-style: italic;">/* R_ii term: The quadratic denominator */</span>
    w = <span style="color: #2e8b57; font-weight: bold;">1</span> - R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,i<span style="color: #66cc66;">&#93;</span>##<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* x2;</span>          <span style="color: #006400; font-style: italic;">/* R[i,i]=1 for all i */</span>
    <span style="color: #0000ff;">if</span> w&lt; 1E-12 <span style="color: #0000ff;">then</span> w = 1E-12;      <span style="color: #006400; font-style: italic;">/* Numerical safeguard */</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Compute the conditional variances and covariance. These are a_11, a_12, etc, in Sun and Asano */</span>
    a_jj = <span style="color: #2e8b57; font-weight: bold;">1</span>      -R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,j<span style="color: #66cc66;">&#93;</span>##<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* x2   -((R[i,j] - R[1,i]*R[1,j]*x2)##2) / w;</span>
    a_kk = <span style="color: #2e8b57; font-weight: bold;">1</span>      -R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,k<span style="color: #66cc66;">&#93;</span>##<span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* x2   -((R[i,k] - R[1,i]*R[1,k]*x2)##2) / w;</span>
    a_jk = R<span style="color: #66cc66;">&#91;</span>j,k<span style="color: #66cc66;">&#93;</span> -R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,j<span style="color: #66cc66;">&#93;</span><span style="color: #006400; font-style: italic;">*R[1,k]*x2 -(R[i,j] - R[1,i]*R[1,j]*x2)*(R[i,k] - R[1,i]*R[1,k]*x2) / w;</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* rho is the off-diagonal element of the conditional correlation matrix */</span>
    rho = a_jk / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>a_jj <span style="color: #006400; font-style: italic;">* a_kk);</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* ensure rho is in [-1,1]. See
       https://blogs.sas.com/content/iml/2026/02/04/clip-values.html */</span>
    rho = <span style="color: #66cc66;">&#40;</span> -<span style="color: #2e8b57; font-weight: bold;">1</span> &lt;&gt; <span style="color: #66cc66;">&#40;</span>rho &gt;&lt; <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* clamp to [-1,1] */</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Call P2 function to get the I2 term */</span>
    I2_val = <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #006400; font-style: italic;">* pi * ( P2(rho) - 0.25 );</span>     <span style="color: #006400; font-style: italic;">/* this actually simplifies to ARSIN(rho) :-) */</span>
&nbsp;
    <span style="color: #006400; font-style: italic;">/* Add the i-th term to the sum */</span>
    <span style="color: #0000ff;">sum</span> = <span style="color: #0000ff;">sum</span> + <span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,i<span style="color: #66cc66;">&#93;</span> / <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>w<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #006400; font-style: italic;">* I2_val;</span>
  <span style="color: #0000ff;">end</span>;
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* The main P4 function, which evaluates the Childs-Sun formula for 4 dimensions */</span>
start P4_Childs<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>R_global<span style="color: #66cc66;">&#41;</span>;
  R_global = R; <span style="color: #006400; font-style: italic;">/* Set the global correlation matrix for QUAD */</span>
&nbsp;
  <span style="color: #006400; font-style: italic;">/* Integrate ProbInt4 over the interval [0, 1] */</span>
  <span style="color: #0000ff;">call</span> quad<span style="color: #66cc66;">&#40;</span>I4, <span style="color: #a020f0;">&quot;ProbInt4&quot;</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
  <span style="color: #006400; font-style: italic;">/* Apply the Childs (1967) formula */</span>
  pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;pi&quot;</span><span style="color: #66cc66;">&#41;</span>;
  sum_arsin = <span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">arsin</span><span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">7</span> <span style="color: #2e8b57; font-weight: bold;">8</span> <span style="color: #2e8b57; font-weight: bold;">12</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* sum of arcsine for all 6 off-diagonal elements */</span>
  prob = <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #2e8b57; font-weight: bold;">16</span> + sum_arsin / <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">8</span><span style="color: #006400; font-style: italic;">*pi) + I4 / (4 * pi**2);</span>  <span style="color: #006400; font-style: italic;">/* fixes the typo in Genz and Bretz, Eqn 2.8 */</span>
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span>prob<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* --- Simple Test --- */</span>
R = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1.0</span>  <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">0.3</span>  <span style="color: #2e8b57; font-weight: bold;">0.2</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1.0</span>  <span style="color: #2e8b57; font-weight: bold;">0.4</span>  <span style="color: #2e8b57; font-weight: bold;">0.3</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.3</span>  <span style="color: #2e8b57; font-weight: bold;">0.4</span>  <span style="color: #2e8b57; font-weight: bold;">1.0</span>  <span style="color: #2e8b57; font-weight: bold;">0.5</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.2</span>  <span style="color: #2e8b57; font-weight: bold;">0.3</span>  <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1.0</span><span style="color: #66cc66;">&#125;</span>;
&nbsp;
ans = p4_Childs<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
print ans;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb5.png" alt="" width="83" height="68" class="alignnone size-full wp-image-58803" />


<p>
When you run this code, the exact Childs-Sun formula produces an orthant probability of <code class="preserve-code-formatting">0.1611218</code> for the given matrix.
You can verify that value by using a Monte Carlo simulation. The following Monte Carlo simulation 
uses one million random points from the MVN(0, R) distribution. In addition to the 
Monte Carlo estimate, I generate a 95% confidence interval for the probability.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Use Monte Carlo simulation to estimate the probability that a 
   MVN random variable is less than a specified value in each coordinate.
   Sigma is a kxk covariance matrix; mu is an option row vector with k elements.
   The row vector b specifies the upper limits of integration. 
   The function returns an estimate of 
   P(X1&lt;b[1] &amp; X2&lt;b[2] &amp; ... &amp; Xk&lt;b[k] | X~MVN(mu, Sigma))
   by simulating N random variates from MVN(mu, Sigma) and returning the proportion
   that are in the specified region.
*/</span>
start MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = randnormal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, mu, Sigma<span style="color: #66cc66;">&#41;</span>;
   inRegion = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
      v = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> &lt; b<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;
      inRegion = inRegion &amp; v;
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>inRegion<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #0000ff;">N</span> = 1E6;
prob_MC = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#125;</span>, R<span style="color: #66cc66;">&#41;</span>;
SE_MC = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span> prob_MC <span style="color: #006400; font-style: italic;">* (1-prob_MC)/N );</span>
Lower95 = prob_MC - <span style="color: #2e8b57; font-weight: bold;">1.96</span><span style="color: #006400; font-style: italic;">*SE_MC;</span>
Upper95 = prob_MC + <span style="color: #2e8b57; font-weight: bold;">1.96</span><span style="color: #006400; font-style: italic;">*SE_MC;</span>
print prob_MC Lower95 Upper95;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb6.png" alt="" width="220" height="64" class="alignnone size-full wp-image-58800" />


<p>The output from the Monte Carlo simulation shows that the Monte Carlo estimate is close to the value returned by the Childs-Sun formula. In addition, the 95% confidence interval includes the exact value, which gives me "confidence" (pun intended!) that I implemented the Childs-Sun formula correctly.
You can repeat the computations for other 4x4 correlation matrices and obtain similar results.
</p>

<h3>Summary</h3>
<p>
As my dear mother used to say when faced with a challenging situation, "this is not for the faint of heart." 
The Childs-Sun formula for the orthant probability of a multivariate normal distribution is a complicated equation.
In 2-D and 3-D, it simplifies into a sum of arcsines. In higher dimensions, you must compute conditional correlation matrices 
and integrate them to evaluate the formula. I have shown a SAS IML function that implements the formula in 4-D. 
</p>
<p>
Despite the complexity, this is an amazing mathematical and computational result!
The Childs-Sun formula is an exact analytical solution to a 4-dimensional multivariate normal volume integral over an orthant.
And, as shown previously, you can use the P4_Childs function
to compute the MVN probability for any 4x4 correlation matrix in any orthant.
</p>


<h3>Appendix 1: The general Childs-Sun formula for orthant probabilities</h3>
<p>
The following image is from p. 12 of Genz and Bretz (2009), <em>Computation of Multivariate
Normal and t Probabilities</em>. The last term of Eqn 2.8 has a typo in Genz and Bretz. The equation contains a plus sign that should not be there. This typo caused me hours of confusion! I finally cross-referenced the equation with Sun and Asano (1990) and the original article by Childs (1967), which both display the correct equation.
</p><p>
Note the somewhat confusing notation in the Childs-Sun formula. 
The superscript on the argument to the I<sub>2j</sub> function describes the submatrix to use in computing the conditional probabilities.
This is not explained further in G&amp;B.
For the details of the implementation, see the SAS IML function in this article.

The superscript on the argument to the I<sub>4</sub> function describes the submatrix to use in computing the conditional probabilities.
This is not explained further in G&amp;B.
For the details of the implementation, see the SAS IML function in this article.
</p>

<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb3.png" alt="" width="548" height="776" class="alignnone size-full wp-image-58794" srcset="https://blogs.sas.com/content/iml/files/2026/05/orthantProb3.png 548w, https://blogs.sas.com/content/iml/files/2026/05/orthantProb3-212x300.png 212w" sizes="(max-width: 548px) 100vw, 548px" />


<h3>Appendix 2: The sum notation</h3>
<p>
The Genz and Bretz notation for the sum of integrals is very dense. Here is what it means in the 4-D case. For this case, n=2, which means that the outer sum reduces to j=2. The inner sum looks like 
<br />
<span class='MathJax_Preview'>\(\sum_{i_1 < \ldots < i_4}^4 I_4(R^{i_1,\ldots,i_4})\)</span><script type='math/tex'>\sum_{i_1 < \ldots < i_4}^4 I_4(R^{i_1,\ldots,i_4})</script>.
</p>
<p>
The notation is a combination sum. It means, "Sum over all possible combinations of the distinct indices chosen from the set {1, 2, 3, 4}
subject to the constraint that the combinations are in increasing order." Well, for four indices, there in only one combination that satisfies this constraint, and it is the tuple (1, 2, 3, 4). So the inner summation also vanishes and becomes I<sub>4</sub>(R). You can drop the superscript because you are sending in all four columns of the 4x4 matrix, R.
</p>
<p>
The I<sub>4</sub> term is defined recursively in terms of I<sub>2</sub>. Now the sum matters. When j=1, the summation is 
<br />
<span class='MathJax_Preview'>\(\sum_{i_1 < i_2}^4 I_2(R^{i_1,i_2})\)</span><script type='math/tex'>\sum_{i_1 < i_2}^4 I_2(R^{i_1,i_2})</script>.
There are six ways to choose two distinct ordered indices from the set {1,2,3,4}. Since  I<sub>2</sub>(R<sup>i1, i2</sup>) = arcsine(R[i1, i2]),
we get the sum-of-arcsine terms 
<span class='MathJax_Preview'>\(\sum_{i < j} \sin^{-1}( R_{ij} )\)</span><script type='math/tex'>\sum_{i < j} \sin^{-1}( R_{ij} )</script>.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html">Compute the orthant probability for the 4-D multivariate normal distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/18/4d-orthant-probability-mvn.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2023/11/probbnrm8-150x150.png" />
	</item>
		<item>
		<title>Compute multivariate normal orthant probabilities</title>
		<link>https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 11 May 2026 09:24:15 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Numerical Analysis]]></category>
		<category><![CDATA[Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58590</guid>

					<description><![CDATA[<p>In general, it is very difficult to compute a probability for a multivariate continuous distribution. For all continuous distributions, the probability requires solving a complicated multiple integral. For example, the probability for a bivariate normal distribution requires integrating the bivariate normal density over a two-dimensional (2-D) area. The probability for [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html">Compute multivariate normal orthant probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
In general, it is very difficult to compute a probability for a multivariate continuous distribution.
For all continuous distributions, the probability requires solving a complicated multiple integral.
For example, the probability for a bivariate normal distribution requires integrating the bivariate normal density 
over a two-dimensional (2-D) area.
The probability for a k-dimensional distribution requires integrating the density over a k-D volume.
</p><p>
The probability depends on the density (PDF) of the distribution as well as the region of integration.
One special region is the <em>orthant</em>. An orthant is a generalization of a quadrant.
In two dimensions, the Cartesian coordinate system divides the plane into four <em>quadrants</em>. 
In three dimensions, the Cartesian coordinate system divides the plane into eight <em>octants</em>. 
In k dimensions, there are 2<sup>k</sup> <em>orthants</em>. 
</p><p>
It is rare to find an exact formula for the probability of a multivariate distribution.
Amazingly, there are formulas for the probability of a standardized multivariate normal (MVN) distribution in an orthant
(Genz and Bretz (2009) p. 11-13, who cite Sheppard (1898), D. R. Childs (1967), and H. J. Sun (1988)).
</p><p>
This article shows the formulas for dimensions 1, 2, and 3. In a subsequent article, I will discuss
a SAS IML function that implements a formula for dimension k=4. 
The general case was solved by Childs (1967) and extended by Sun (1988), so 
I refer to these formulas as the Childs-Sun formulas.
</p>
<p>As shown in a previous article, <a href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">it suffices to compute a probability for the standardized MVN distribution</a>
because a probability for a general MVN distribution can always be found by standardizing the problem.
To save typing, this article deals exclusively with the standardized MVN distribution and often omits the word "standardized".
</p>  

<h3>Orthant probabilities</h3>
<p>
For a vector X, <a href="https://blogs.sas.com/content/iml/2025/06/16/find-orthant.html">the signs of X are constant within each orthant</a>.
For example, a 2-D vector has signs (+1, +1) when it is in the first quadrant, and signs (-1, -1) when it is in the third quadrant. 
The term "orthant probability" traditionally refers to the first orthant, where all signs are positive.
</p><p>
The MVN distribution is symmetric under the transformation X &rarr; -X, so the probability that X~MVN(0,R) is in the "all-signs-positive" octant 
equals the probability that X is in the "all-signs-negative" octant. 
The "all-signs-negative" octant is important in probability and statistics. 
It represents the "CDF region," which is a left-tail probability. For example, for a 2-D random vector X=(X1,X2), the CDF 
evaluated at (0,0) is &Phi;<sub>2</sub>(0,0) = P(X1&lt;0 and X2&lt;0), which is the probability that X is in the third quadrant.  In terms of an integral,
<br />
<span class='MathJax_Preview'>\(
\Phi_2(0,0; \rho)
=
\int_{-\infty}^{0} \int_{-\infty}^{0} \phi_2(x_1, x_2; \rho)\, dx_2 \, dx_1
\)</span><script type='math/tex'>
\Phi_2(0,0; \rho)
=
\int_{-\infty}^{0} \int_{-\infty}^{0} \phi_2(x_1, x_2; \rho)\, dx_2 \, dx_1
</script>
<br />
Here, &rho; is the Pearson correlation between X1 and X2, &phi;<sub>2</sub> is the bivariate standard normal density,
and &Phi;<sub>2</sub> is the bivariate standard normal CDF.
</p><p>
This concept generalizes to higher-dimensional distributions: the "CDF orthant" is the "all-signs-negative" orthant for which every coordinate of X has a negative sign.
The CDF of the standardized normal distribution at the origin is the integral of the distribution's density over that orthant.
This article shows some Childs-Sun formulas for these probabilities for a general correlation matrix.
</p>

<h3>Orthant probabilities in dimensions 1 and 2</h3>
<p>
For the one-dimensional problem, the 1-D orthant is the set {X &lt; 0}.
The CDF at zero is P<sub>1</sub> = &Phi;<sub>1</sub>(0) = 0.5. You can obtain this value by solving an integration problem, but it is also evident from 
the fact that the normal distribution is symmetric about 0.
For this article, the important fact is that there is a formula for the 1-D orthant probability: P<sub>1</sub> = 1/2.
</p>
<p>
For 2-D, the correlation matrix has one parameter, &rho; = R[1,2], which is the correlation between the bivariate normal variates X1 and X2.
The 2-D (bivariate) formula (originally found by Sheppard, 1898) is 
<br />
<span class='MathJax_Preview'>\(P_2 = \Phi_2(0,0; \rho) = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)\)</span><script type='math/tex'>P_2 = \Phi_2(0,0; \rho) = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)</script>
</p>
<p>
Let's run a simple DATA step to verify this result. In SAS, you can use the PROBBNRM function to compute a bivariate normal probability:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* 2-D orthant probability for various rho values */</span>
<span style="color: #000080; font-weight: bold;">data</span> MVN2D;
<span style="color: #0000ff;">input</span> rho @@;
pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'pi'</span><span style="color: #66cc66;">&#41;</span>;
P2_formula = <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #2e8b57; font-weight: bold;">4</span> + <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #006400; font-style: italic;">*pi) * arsin(rho);</span>
P2_integral = probbnrm<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">0</span>, rho<span style="color: #66cc66;">&#41;</span>;
diff = P2_formula - P2_integral;
<span style="color: #0000ff;">drop</span> pi;
datalines;
-<span style="color: #2e8b57; font-weight: bold;">0.8</span> -<span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0.8</span> 
;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc print</span> <span style="color: #000080; font-weight: bold;">data</span>=MVN2D noobs; 
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb1.png" alt="" width="241" height="171" class="alignnone size-full wp-image-58728" />

<p>
As shown in the output, the formula agrees with the PROBBNRM function to within machine precision.
</p>


<h3>Orthant probabilities in dimension 3</h3>
<p>
In 3-D the standard multivariate normal distribution has three parameters: the off-diagonal elements of the 3x3 correlation matrix, R.
The formula for the orthant probability in dimension 3 is given by 
<br />
<span class='MathJax_Preview'>\(P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)
\)</span><script type='math/tex'>P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)
</script>
</p><p>
Notice that you can write P3 by using a summation formula in which you sum over the indices 
of the upper triangular elements of the correlation matrix, R.
</p><p>
To validate this formula, let's use <a href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">a Monte Carlo estimation routine</a>, which was developed in a previous article. The following SAS IML program 
evaluates the formula for a few correlation matrices, and compares the formula and the Monte Carlo estimate. 
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* 3-D orthant probability for various correlation matrices, R */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Use Monte Carlo simulation to estimate the probability that a 
   MVN random variable is less than a specified value in each coordinate.
   See https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html
   The function estimates CDF(b).
*/</span>
start MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = randnormal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, mu, Sigma<span style="color: #66cc66;">&#41;</span>;
   inRegion = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
      v = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> &lt; b<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* binary variable for the i_th coordinate */</span>
      inRegion = inRegion &amp; v;  <span style="color: #006400; font-style: italic;">/* find the intersection of all conditions */</span>
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>inRegion<span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* proportion of random points that satisfy criterion */</span>
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Evaluate the MVN CDF at the origin (=&quot;orthant probability&quot;) by using the exact P3 formula */</span>
start Orthant_P3<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
  pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;pi&quot;</span><span style="color: #66cc66;">&#41;</span>;
  upper_tri_corrs = R<span style="color: #66cc66;">&#91;</span> <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">6</span><span style="color: #66cc66;">&#125;</span> <span style="color: #66cc66;">&#93;</span>;
  sum_arsin = <span style="color: #0000ff;">sum</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">arsin</span><span style="color: #66cc66;">&#40;</span> upper_tri_corrs <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
  prob = <span style="color: #2e8b57; font-weight: bold;">1</span>/<span style="color: #2e8b57; font-weight: bold;">8</span> + sum_arsin / <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #006400; font-style: italic;">* pi);</span>
  <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span>prob<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #0000ff;">N</span> = 1E6;     <span style="color: #006400; font-style: italic;">/* size of Monte Carlo sample */</span>
b = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#125;</span>; <span style="color: #006400; font-style: italic;">/* the origin */</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Example 1: AR(1) correlation matrix */</span>
R = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0.25</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">0.5</span>,
     <span style="color: #2e8b57; font-weight: bold;">0.25</span> <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #66cc66;">&#125;</span>;
P3_formula = Orthant_P3<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
P3_MC_est  = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, R<span style="color: #66cc66;">&#41;</span>;
diff       = P3_formula - P3_MC_est;
print P3_formula P3_MC_est diff;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Example 2: positive and negative correlations */</span>
R = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span>   -<span style="color: #2e8b57; font-weight: bold;">0.5</span> -<span style="color: #2e8b57; font-weight: bold;">0.1</span>,
    -<span style="color: #2e8b57; font-weight: bold;">0.5</span>  <span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #2e8b57; font-weight: bold;">0.8</span>,
    -<span style="color: #2e8b57; font-weight: bold;">0.1</span>  <span style="color: #2e8b57; font-weight: bold;">0.8</span>  <span style="color: #2e8b57; font-weight: bold;">1</span>    <span style="color: #66cc66;">&#125;</span>;
P3_formula = Orthant_P3<span style="color: #66cc66;">&#40;</span>R<span style="color: #66cc66;">&#41;</span>;
P3_MC_est  = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, R<span style="color: #66cc66;">&#41;</span>;
diff       = P3_formula - P3_MC_est;
print P3_formula P3_MC_est diff;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/orthantProb2.png" alt="" width="241" height="136" class="alignnone size-full wp-image-58725" />

<p>
Results are shown for two different correlation matrices. The formula in the Orthant_P3 function is exact.
The Monte Carlo estimate is close to the exact value.
</p>

<h3>The computation for other 2-D quadrants</h3>
<p>
The formula is more powerful than it initially seems because of reflective symmetries of the multivariate normal distribution.
The Childs-Sun formula enables you to compute the probability in <em>all</em> orthants, not just the "CDF orthant" where 
every component of the random variable has a negative sign. This is because you can change the signs of the components of the 
MVN random variable to obtain the other quadrants. Geometrically, "changing the sign" means considering the transformation 
X<sub>i</sub> &rarr; -X<sub>i</sub> for some coordinate. This transformation changes the sign of the correlation 
between X<sub>i</sub> and the other variables. Thus, you can obtain the correlation matrix for any transformation that reflects 
one or more coordinates.
</p>
<p>
Let's see how this works in two dimensions. 
Let &rho; be the off-diagonal (R[1,2]) correlation. Here is how to get the probability in all four quadrants:
</p>
<ol>
<li>Quadrant I: This is the orthant probability, P2 = 1/4 + arsin(&rho;) / (2&pi;).</li>
<li>Quadrant II: To find the probability in the region Q2 = {X1 &lt; 0, X2 &gt; 0}, 
define Y1 = -X1 and Y2 = X2. Y is bivariate normal with correlation -&rho;. Because arsin(-&rho;) = -arsin(&rho;),
the probability in Q2 is 1/4 - arsin(&rho;) / (2&pi;). </li>
<li>Quadrant III: To find the probability in the region Q3 = {X1 &lt; 0, X2 &lt; 0}, 
define Y1 = -X1 and Y2 = -X2. Y is bivariate normal with correlation &rho;, so the probability over Q3 is P2.
</li>
<li>Quadrant IV: To find the probability in the region Q4 = {X1 &gt; 0, X2 &lt; 0}, 
define Y1 = X1 and Y2 = -X2. Y is bivariate normal with correlation -&rho;, therefore 
the probability in Q4 is 1/4 - arsin(&rho;) / (2&pi;). </li>
</ol>

<h3>The computation for other k-D orthants</h3>
<p>
The same idea applies for any dimension. 
The Childs-Sun formula gives the orthant probability in the first quadrant where all coordinates are positive.
Every orthant is identified by a sign vector s=(s1, s2, ..., sk), where each component is either +1 or -1.
To compute the probability that X~MVN(0, R) is in the orthant with the sign vector s, do the following:
</p>
<ol>
<li>Define a new MVN variable, Y = (s1*X1, s2*X2, ..., sk*Xk). 
</li>
<li>Compute the correlation matrix for Y. This requires changing some of the signs of the off-diagonal elements in R. See the Appendix for details.
</li>
<li>Use the Childs-Sun formula to find the probability that Y is in the first orthant. This is also the 
probability that X is in the orthant with sign vector s.
</li>
</ol>

<h3>Summary</h3>
<p>
In summary, this article shows three Childs-Sunformulas for computing the orthant probability for a 
standardized multivariate normal distribution 
with correlation matrix R in dimension k:
</p>
<ul>
<li>If k=1, the univariate probability is <span class='MathJax_Preview'>\(P_1 = \frac{1}{2}\)</span><script type='math/tex'>P_1 = \frac{1}{2}</script>.
</li>
<li>If k=2, the bivariate probability is <span class='MathJax_Preview'>\(P_2 = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(R_{12})\)</span><script type='math/tex'>P_2 = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(R_{12})</script>.
</li>
<li>If k=3, the trivariate probability is 
<span class='MathJax_Preview'>\(P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)\)</span><script type='math/tex'>P_3 = \frac{1}{8} + \frac{1}{4\pi} \left( \sin^{-1}(R_{12}) + \sin^{-1}(R_{13}) + \sin^{-1}(R_{23}) \right)</script>
</li>
</ul>
<p>
You can apply these formulas as written to compute the probability in the first orthant (all signs positive) or in the "CDF orthant" (all signs negative). You can use the reflective symmetry of the MVN distribution to find the probabilities in every other orthant.
</p>


<h3>Appendix: How the correlation matrix changes under coordinate reflections</h3>
<p>
This main article indicates that you can apply reflection transformations for coordinates of X~MVN(0,R) and find a 
new correlation matrix, C, such that Childs-Sun formula applies to C gives the probability that X is in some orthant.
This appendix shows the linear algebra steps.
</p><p>
Suppose X~MVN(0,R), and you want to find the probability that X is in an orthant defined by a sign vector s=(s1, s2, ..., sk). In this vector, each s<sub>i</sub> is either +1 or -1.
We can use s to define a new random vector, Y, such that Y is in the first orthant exactly when X is in the target orthant.
</p>
<p>
Let S be the k x k diagonal matrix whose diagonal elements are the components of the sign vector, s.
Define the linear transformation Y = S X.
Because S is a diagonal matrix of signs, the i_th component of Y is simply Y<sub>i</sub> = s<sub>i</sub> * X<sub>i</sub>.
By definition, if X is in the target orthant, then the sign of X<sub>i</sub> matches s<sub>i</sub>.
Therefore, the product s<sub>i</sub> * X<sub>i</sub> is always positive, which means Y is in the first orthant.
</p>
<p>
Because Y is a linear transformation of MVN variables, it is also multivariate normal.
The expected value of Y is E[Y] = S E[X] = 0.
The covariance matrix for Y (call it C) is given by the standard formula for the variance of a linear transformation:
<br />
<span class='MathJax_Preview'>\(
C = \text{Cov}(Y) = S \text{Cov}(X) S^T = S R S
\)</span><script type='math/tex'>
C = \text{Cov}(Y) = S \text{Cov}(X) S^T = S R S
</script>

</p>
<p>
If you multiply the matrices, you will find that the elements of C are given by
C<sub>ij</sub> = s<sub>i</sub> s<sub>j</sub> R<sub>ij</sub>.
Because s<sub>i</sub><sup>2</sup> = 1, the diagonal elements of C are 1, which means C is a valid correlation matrix.
The off-diagonal elements of C simply flip the sign of the correlation whenever exactly one of the two corresponding variables is reflected.
</p>
<p>
In summary, Y~MVN(0, C). The Childs-Sun formula with the correlation matrix C 
gives the probability that Y is in the first orthant.
This is exactly the probability that X is in the orthant with the given sign vector, s.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html">Compute multivariate normal orthant probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/11/mvn-orthant-probability.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2023/11/probbnrm8-150x150.png" />
	</item>
		<item>
		<title>On standardizing multivariate normal probabilities</title>
		<link>https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html</link>
					<comments>https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 04 May 2026 09:24:47 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=58605</guid>

					<description><![CDATA[<p>A course in elementary statistics always introduces the "Z-score." A Z-score is the result of standardizing a normally distributed random variable. By subtracting the distribution's mean and dividing by its standard deviation, you transform a general normal random variable into a standardized variable that has zero mean and unit standard [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">On standardizing multivariate normal probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A course in elementary statistics always introduces the "Z-score." 
A Z-score is the result of standardizing a normally distributed random variable. By subtracting the
distribution's  mean and dividing by its standard deviation, you transform a general 
normal random variable into a standardized variable that has zero mean and unit standard deviation. 
Historically, this transformation was essential because it enabled statisticians to look up probabilities in a single printed table. 
Conceptually, it enables a uniform analysis of normally distributions by measuring all quantities in units of the 
standard deviation about the mean.  
</p>

<p>The concept of a "Z-score" also applies to multivariate normal distributions. 
In multidimensional statistics, you can standardize a multivariate normal (MVN) probability calculation by transforming the covariance matrix into the corresponding correlation matrix. This article shows how to shift and scale high-dimensional MVN distributions, both mathematically and by using the SAS IML language.
</p>

<h3>The 1-D Z-score transformation</h3>

<p>Let's start by reviewing the Z-score transformation for the 1-D normal distribution.
Let <em>X</em> be a normally distributed random variable with mean &mu; and standard deviation &sigma;.
In notation, <em>X</em> &sim; N(&mu;, &sigma;). (Some authors use N(&mu;, &sigma;<sup>2</sup>), but the SAS probability functions (CDF, PDF, QUANTILE, etc.) use the standard deviation as a parameter.)
If you want to compute a left-tail probability, such as P(<em>X</em> &lt; <em>b</em>), you can 
compute the probability on the original (data) scale, or you can apply a transformation and compute the 
probability on the standardized scale. This is useful in numerical computations because it 
reduces the problem to two steps: first, standardize the problem, then compute a probability on the standardized problem. 
Each of these steps can be developed, debugged, and tested independently. 
</p>
<p>
Let's apply the Z-score transformation. Define a new random variable <em>Z</em> = (<em>X</em> - &mu;)/&sigma;. Because a linear transformation of a normal random variable is also normal, <em>Z</em> follows a standard normal distribution: <em>Z</em> &sim; N(0, 1).
By applying the same transformation to the integration limit <em>b</em>, we define a standardized limit <em>c</em> = (<em>b</em> - &mu;)/&sigma;. Consequently, computing the probability on the original data scale is mathematically equivalent to computing the probability on the standardized scale:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
P(<em>X</em> &lt; <em>b</em> | X~N(&mu;, &sigma;)) = P(<em>Z</em> &lt; <em>c</em> | Z~N(0, 1))
</p>

<p>In SAS, you can compute these probabilities directly using the <code class="preserve-code-formatting">CDF</code> function. 
The following SAS IML program demonstrates this equivalence. 
The goal is to find the probability that a random variable from N(&mu;, &sigma;) is less than <em>x</em>. 
You can compute this directly on the original scale, or you can convert the 
limit to a Z-score and evaluate it using the standard normal distribution:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">data</span> Prob;
<span style="color: #0000ff;">input</span> mu sigma @@;
<span style="color: #0000ff;">do</span> <span style="color: #0000ff;">x</span> = -<span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #2e8b57; font-weight: bold;">1</span>;
   <span style="color: #006400; font-style: italic;">/* 1. Compute probability on original scale */</span>
   prob_x = <span style="color: #0000ff;">CDF</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, <span style="color: #0000ff;">x</span>, mu, sigma<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #006400; font-style: italic;">/* 2. Compute probability on standardized scale */</span>
   z = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span> - mu<span style="color: #66cc66;">&#41;</span>/sigma;
   prob_z = <span style="color: #0000ff;">CDF</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, z<span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* N(0,1) */</span>
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
datalines;
 <span style="color: #2e8b57; font-weight: bold;">2</span>   <span style="color: #2e8b57; font-weight: bold;">3</span>
 <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">2</span>
-<span style="color: #2e8b57; font-weight: bold;">0.5</span>   <span style="color: #2e8b57; font-weight: bold;">0.5</span>
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc print</span> noobs; <span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/stdmvn1.png" alt="" width="289" height="281" class="alignnone size-full wp-image-58653" />

<p>
The program uses several examples of <em>x</em>, &mu; and &sigma;.
For all examples, the PROB_X and the PROB_Z columns are identical. 
Geometrically, the X variable represents a data value whereas the Z variable measures the number of standard deviations from the mean.
As shown in the next section, you can generalize this concept to multivariate normal distributions.
</p>

<h3>Standardizing the multivariate normal distribution</h3>
<p>
Suppose you want to compute the joint cumulative probability for a multivariate normal distribution. 
Let <em>X</em> be a <em>k</em>-dimensional random variable such that <em>X</em> &sim; MVN(&mu;, &Sigma;),
 where &mu; = (&mu;<sub>1</sub>, &mu;<sub>2</sub>, ..., &mu;<sub>k</sub>) is the mean vector and 
&Sigma; is a <em>k</em>&nbsp;x&nbsp;<em>k</em> covariance matrix. 
The CDF or left-tail probability is 
P(<em>X</em> &lt; <em>b</em>), where <em>b</em> = (<em>b</em><sub>1</sub>, <em>b</em><sub>2</sub>, ..., <em>b</em><sub>k</sub>) is a vector of upper limits, and the inequality is evaluated component by component.
</p>

<p>
To standardize this problem, you can standardize each original variable. Instead of doing this one variable at a time,
you can use matrix-vector notation and transform all variables simultaneously.
Let <em>D</em> be the diagonal matrix of standard deviations. That is, the diagonal elements of <em>D</em> are the square roots of the diagonal elements of &Sigma;. Then you can express the covariance matrix as &Sigma; = D*R*D, where the matrix <em>R</em> has 1s on the diagonal. 
The matrix <em>R</em> is the correlation matrix for the original variables (and the covariance matrix for the new standardized variables).
</p>
<p>
The standardized random variables are defined by <em>Z</em> = D<sup>-1</sup>(<em>X</em> - &mu;).
<em>Z</em> is multivariate normal. To find its parameters, you can compute its expected value and variance. The expected value is simply a vector of zeros. The variance of <em>Z</em> is <em>R</em>. Thus, the transformed variable <em>Z</em> follows a standardized multivariate normal distribution with mean zero and correlation matrix <em>R</em>. In notation, <em>Z</em> ~ MVN(0, <em>R</em>). 
</p>

<p>To complete the probability transformation, apply the same linear transformation to the integration limits, <em>b</em>. 
The standardized limits are <em>c</em> = D<sup>-1</sup>(<em>b</em> - &mu;). 
A probability for <em>X</em> on the data scale (covariance scale) is equal to the probability for <em>Z</em> on the correlation scale:
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
P(X &lt; b | X ~ MVN(&mu;, &Sigma;)) = P(Z &lt; c | Z ~ MVN(0, R))
</p>
<p>
This is why, for example, <a href="https://blogs.sas.com/content/iml/2023/12/04/bivariate-normal-probability-sas.html">the PROBBNRM function in SAS</a> computes probabilities for the standardized variable Z~BVN(0, R), where BVN is the bivariate normal distribution, and R is a 2x2 correlation matrix.
The developer who implemented the function assumed that users can transform any problem to the standardized scale.
</p>

<h3>Converting from covariance to correlation in SAS IML</h3>
<p>
The SAS IML language supports <a href="https://blogs.sas.com/content/iml/2012/02/17/convert-a-covariance-matrix-to-a-correlation-matrix-in-sas.html">the COV2CORR function</a>, which returns the correlation matrix that is associated with a covariance matrix. 
By using the math in the previous section, you can write a SAS IML function that will transform the limits of integration. 
The following IML function computes
<em>c</em> = D<sup>-1</sup>(<em>b</em> - &mu;). 
The SAS IML language enables you to perform this operation efficiently by using vectorized division, rather than multiplying by 
the inverse of a diagonal matrix:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Sigma is a kxk covariance matrix and b is a row vector with k elements.
   Convert b to inv(D)*(b-mu), where D is the diagonal matrix of
   standard deviations: sqrt(vecdiag(Sigma)).
   The mu vector is optional and defaults to the zero vector.
*/</span>
start Xform_Limits_Cov2Corr<span style="color: #66cc66;">&#40;</span> b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
   D = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span>vecdiag<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* extract diagonal (variances) for X */</span>
   c = <span style="color: #66cc66;">&#40;</span>b - mu<span style="color: #66cc66;">&#41;</span> / rowvec<span style="color: #66cc66;">&#40;</span>D<span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* new limits for Z */</span>
   <span style="color: #0000ff;">return</span> c;
finish;</pre></td></tr></table></div>




<p>
As we did for the 1-D case, let's show the equivalency between the CDF on the data scale and the CDF for the standardized problem.
The exact computation of multivariate normal probabilities in high dimensions is difficult. Instead, 
I will use a simple Monte Carlo simulation to estimate the probabilities. This technique is easy to code and will suffice to demonstrate 
the equivalency of the probabilities.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Use Monte Carlo simulation to estimate the probability that a 
   MVN random variable is less than a specified value in each coordinate.
   Sigma is a kxk covariance matrix; mu is an optional row vector with k elements.
   The row vector b specifies the upper limits of integration. 
   The function returns an estimate of 
   P(X1&lt;b[1] &amp; X2&lt;b[2] &amp; ... &amp; Xk&lt;b[k] | X~MVN(mu, Sigma))
   by simulating N random variates from MVN(mu, Sigma) and returning the proportion
   that are in the specified region.
*/</span>
start MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu=j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,ncol<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">X</span> = randnormal<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, mu, Sigma<span style="color: #66cc66;">&#41;</span>;
   inRegion = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
      v = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> &lt; b<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* binary variable for the i_th coordinate */</span>
      inRegion = inRegion &amp; v;  <span style="color: #006400; font-style: italic;">/* find the intersection of all conditions */</span>
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>inRegion<span style="color: #66cc66;">&#41;</span>;       <span style="color: #006400; font-style: italic;">/* proportion of random points that satisfy criterion */</span>
finish;</pre></td></tr></table></div>




<h3>An Example: The 5-D Min Matrix</h3>

<p>
You can use the built-in <code class="preserve-code-formatting">COV2CORR</code> function and the newly defined <code class="preserve-code-formatting">Xform_Limits_Cov2Corr</code> function to 
transform a problem on the covariance (data) scale into an equivalent standardized problem.
We will use a 5-dimensional "Min Matrix" as our covariance matrix. <a href="https://blogs.sas.com/content/iml/2026/02/16/min-matrix.html">The Min Matrix is a structured matrix</a> where the (<em>i</em>, <em>j</em>)-th element is equal to min(<em>i</em>, <em>j</em>). 
The mean vector, &mu;, and the integration limits, <em>b</em>, are arbitrary vectors.
</p>

<p>
The following program estimates the probability twice, both times using one million random variates in a Monte Carlo simulation.
First, it uses the <code class="preserve-code-formatting">MC_CDFMVN</code> function for the 5-D Min Matrix. 
Then it transforms the limits, creates the associated correlation matrix, and estimates the probability on the standardized scale.
</p>
<p>
So that the estimates are the same, and not just close to each other, you can use the <code class="preserve-code-formatting">RANDSEED</code> subroutine
to reset the random number stream prior to each simulation. This ensures that both computations use the exact same stream of pseudo-random numbers.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Example: use a 5-D &quot;Min Matrix&quot;. 
   See: https://blogs.sas.com/content/iml/2026/02/16/min-matrix.html
*/</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Define the parameters for MVN(mu, Sigma) */</span>
Sigma = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">1</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">3</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">4</span>,
         <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">5</span> <span style="color: #66cc66;">&#125;</span>;
mu = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">5</span><span style="color: #66cc66;">&#125;</span>;
<span style="color: #0000ff;">N</span> = 1E6;
&nbsp;
<span style="color: #006400; font-style: italic;">/* estimate the CDF probability at b for MVN(mu, Sigma)*/</span>
b = <span style="color: #66cc66;">&#123;</span>-<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">7</span><span style="color: #66cc66;">&#125;</span>;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* reset the RNG */</span>
prob_cov = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, b, Sigma, mu<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* standardize the parameters to the correlation scale */</span>
c = Xform_Limits_Cov2Corr<span style="color: #66cc66;">&#40;</span>b, Sigma, mu<span style="color: #66cc66;">&#41;</span>;
R = cov2corr<span style="color: #66cc66;">&#40;</span>Sigma<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* estimate the CDF probability at c for MVN(0, R)*/</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* reset the RNG */</span>
prob_corr = MC_CDFMVN<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>, c, R<span style="color: #66cc66;">&#41;</span>;
&nbsp;
print prob_cov prob_corr;</pre></td></tr></table></div>





<img loading="lazy" decoding="async" src="https://blogs.sas.com/content/iml/files/2026/05/stdmvn2.png" alt="" width="150" height="65" class="alignnone size-full wp-image-58650" />

<p>The output shows that the probability estimate for the original problem is identical to the estimate for the 
standardized problem. 
</p>

<h3>Summary</h3>

<p>
You can use techniques from linear algebra to extend the concept of a Z-score to multivariate normal distributions.
You do not have to physically standardize each column of the data.
Rather, you can convert the covariance matrix to the associated correlation matrix, apply a linear transformation to the 
limits of integration, and&mdash;voil&agrave;!&mdash;you can compute all multivariate normal probabilities 
in a standardized way. 
</p>

<p>
This transformation is not purely academic. In computational statistics, algorithms that
evaluate multivariate distributions are often implemented for the standardized correlation scale to ensure 
stable numerical computations. By scaling the problem and then computing answers on the correlation scale, 
the numerical routines tend to be more accurate and reliable.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html">On standardizing multivariate normal probabilities</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2026/05/04/standardize-mvn-probabilities.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2023/11/probbnrm8-150x150.png" />
	</item>
	</channel>
</rss>
