<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>The DO Loop</title>
	
	<link>https://blogs.sas.com/content/iml</link>
	<description>Statistical programming in SAS with an emphasis on SAS/IML programs</description>
	<lastBuildDate>Wed, 17 Nov 2021 14:59:33 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.5.3</generator>
	<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/TheDoLoop" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="thedoloop" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">TheDoLoop</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">https://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>The order of vertices on a convex polygon</title>
		<link>https://blogs.sas.com/content/iml/2021/11/17/order-vertices-convex-polygon.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/11/17/order-vertices-convex-polygon.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 17 Nov 2021 10:21:44 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=36099</guid>

					<description><![CDATA[<p>In a previous article, I showed how to use theCVEXHULL function in SAS/IML to compute the convex hull of a finite set of planar points. The convex hull is a convex polygon, which is defined by its vertices. To visualize the polygon, you need to know the vertices in sequential [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/17/order-vertices-convex-polygon.html">The order of vertices on a convex polygon</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
In a previous article, I showed <a href="https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html">how to use theCVEXHULL function in SAS/IML to compute the convex hull of a finite set of planar points</a>. 
The convex hull is a convex polygon, which is defined by its vertices. 
To visualize the polygon, you need to know the vertices <em>in sequential order</em>.
Fortunately, the CVEXHULL function returns the vertices in counterclockwise order, so you can connect the
points in the order they are provided.
</p><p>
But what if the CVEXHULL function did not output the vertices in sequential order? It turns out that you can perform a simple computation that orders the vertices of a convex polygon. This article shows how.
</p>

<h3>Connect unsorted vertices of a convex polygon</h3>
<p>
Let's start by defining six points that form the vertices of a convex polygon. The vertices are not sorted in any way, so if you connect the points in the order given, you get a star-like pattern:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
P = <span style="color: #66cc66;">&#123;</span> <span style="color: #2e8b57; font-weight: bold;">0</span>  <span style="color: #2e8b57; font-weight: bold;">2</span>,
      <span style="color: #2e8b57; font-weight: bold;">6</span>  <span style="color: #2e8b57; font-weight: bold;">0</span>,
      <span style="color: #2e8b57; font-weight: bold;">0</span>  <span style="color: #2e8b57; font-weight: bold;">0</span>,
      <span style="color: #2e8b57; font-weight: bold;">4</span> -<span style="color: #2e8b57; font-weight: bold;">1</span>,
      <span style="color: #2e8b57; font-weight: bold;">5</span>  <span style="color: #2e8b57; font-weight: bold;">2</span>, 
      <span style="color: #2e8b57; font-weight: bold;">2</span> -<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #66cc66;">&#125;</span>; 
&nbsp;
<span style="color: #006400; font-style: italic;">/* create a helper function to connect vertices in the order they are given */</span>
start GraphVertices<span style="color: #66cc66;">&#40;</span>P<span style="color: #66cc66;">&#41;</span>;
   Poly = P // P<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #66cc66;">&#93;</span>;   <span style="color: #006400; font-style: italic;">/* repeat first point to close the polygon */</span>
   <span style="color: #0000ff;">call</span> series<span style="color: #66cc66;">&#40;</span>Poly<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>, Poly<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> procopt=<span style="color: #a020f0;">&quot;aspect=1&quot;</span> option=<span style="color: #a020f0;">&quot;markers&quot;</span> grid=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">x</span> y<span style="color: #66cc66;">&#125;</span>;
finish;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Connect Unsorted Points&quot;</span>;
<span style="color: #000080; font-weight: bold;">run</span> GraphVertices<span style="color: #66cc66;">&#40;</span>P<span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2021/11/cvexhull04.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/cvexhull04.png" alt="" width="360" height="360" class="alignnone size-full wp-image-36117" srcset="https://blogs.sas.com/content/iml/files/2021/11/cvexhull04.png 480w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull04-300x300.png 300w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull04-150x150.png 150w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
The line plot does not look like a convex polygon because the vertices were not ordered. 
However, it is not hard to order the vertices of a convex polygon.
</p>

<h3>Order vertices of a convex polygon</h3>

<a href="https://blogs.sas.com/content/iml/files/2021/11/cvexhull06.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/cvexhull06.png" alt="" width="360" height="360" class="alignright size-full wp-image-36111" srcset="https://blogs.sas.com/content/iml/files/2021/11/cvexhull06.png 480w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull06-300x300.png 300w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull06-150x150.png 150w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
The key idea is to recall that the <em>centroid</em> of a polygon is the arithmetic average of the coordinates of its vertices. For a convex polygon, the centroid always lies in the interior of the polygon, as shown in the graph to the right. The centroid is displayed as a large dot in the center of the polygon.
</p><p>
You can use the centroid as the origin and construct the vectors from the centroid to each vertex. For each vector, you can compute the angle made with the horizontal axis. You can then sort the angles, which provides a sequential ordering of the vertices of the convex polygon. In the graph at the right, the angles are given in degrees, but you can use radians for all the calculations. For this example, the angles that the vectors make with the positive X axis are 38 degrees, 150 degrees, 187 degrees, and so forth.
</p><p>
The SAS/IML language provides a compact way to compute the centroid and the vectors. You can <a href="https://blogs.sas.com/content/iml/2015/06/10/polar-angle-curve.html">use the ATAN2 function in SAS to compute the angles</a>, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;">C = <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span>P<span style="color: #66cc66;">&#41;</span>;           <span style="color: #006400; font-style: italic;">/* centroid of a convex polygon */</span>
v = P - C;             <span style="color: #006400; font-style: italic;">/* vectors from centroid to points on convex hull */</span>
radian = atan2<span style="color: #66cc66;">&#40;</span> v<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span>, v<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* angle with X axis for each vector */</span></pre></td></tr></table></div>




<p>
At this point, you could sort the vertices by their radian measure. However, the ATAN2 function returns values in the interval (-&pi; &pi;]. If you prefer to order the vertices by using the standard "polar angle," which is in the interval [0, 2&pi;), you can add 2&pi; to any negative angle from the ATAN2 function.
You can then <a href="https://blogs.sas.com/content/iml/2011/03/16/sorting-a-matrix-by-row-or-column-statistics.html">use the angles to sort the vertices</a>, as follows:</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Optional: The ATAN2 function returns values in (-pi, pi]. 
   Convert to values in (0,2*pi] */</span>
pi = constant<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'pi'</span><span style="color: #66cc66;">&#41;</span>;
idx = loc<span style="color: #66cc66;">&#40;</span> radian&lt;<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #66cc66;">&#41;</span>;
radian<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> = radian<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> + <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #006400; font-style: italic;">*pi;</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* now sort the points by angle */</span>
<span style="color: #0000ff;">call</span> sortndx<span style="color: #66cc66;">&#40;</span>idx, radian<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* get row numbers that sort the angles */</span>
Q = P<span style="color: #66cc66;">&#91;</span>idx,<span style="color: #66cc66;">&#93;</span>;                 <span style="color: #006400; font-style: italic;">/* sort the vertices of the polygon by their angle */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Connect Sorted Points&quot;</span>;
<span style="color: #000080; font-weight: bold;">run</span> GraphVertices<span style="color: #66cc66;">&#40;</span>Q<span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2021/11/cvexhull05.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/cvexhull05.png" alt="" width="360" height="360" class="alignnone size-full wp-image-36114" srcset="https://blogs.sas.com/content/iml/files/2021/11/cvexhull05.png 480w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull05-300x300.png 300w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull05-150x150.png 150w" sizes="(max-width: 360px) 100vw, 360px" /></a>

<p>
With this ordering, the vertices are now sorted in sequential order according to the angle each vertex makes with the centroid. 
</p>
<p>
In summary, the CVEXHULL function in SAS/IML returns vertices of a convex polygon in sequential order.
But if you are even given the vertices in a random order, you can perform a computation to
sort them by using the angle each vertex makes with the centroid. 
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/17/order-vertices-convex-polygon.html">The order of vertices on a convex polygon</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Pfvt7hB-jqc:oEqMx1-YUN8:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Pfvt7hB-jqc:oEqMx1-YUN8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Pfvt7hB-jqc:oEqMx1-YUN8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Pfvt7hB-jqc:oEqMx1-YUN8:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/11/17/order-vertices-convex-polygon.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/11/cvexhull06-150x150.png" />
	</item>
		<item>
		<title>Two-dimensional convex hulls in SAS</title>
		<link>https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 15 Nov 2021 10:20:18 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=36066</guid>

					<description><![CDATA[<p>Given a cloud of points in the plane, it can be useful to identify the convex hull of the points. The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html">Two-dimensional convex hulls in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<a href="https://blogs.sas.com/content/iml/files/2021/11/cvexhull01.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/cvexhull01.png" alt="" width="480" height="360" class="alignright size-full wp-image-36075" srcset="https://blogs.sas.com/content/iml/files/2021/11/cvexhull01.png 640w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull01-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>


<p>
Given a cloud of points in the plane, it can be useful to identify the <em>convex hull</em> of the points.
The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as its vertices. An example of a convex hull is shown
to the right. The convex hull is the polygon that encloses the points.
</p><p>
This article describes <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect105.htm">the CVEXHULL function</a> in the SAS/IML language, which computes the convex hull for a set of planar points.
</p>

<h3>How to compute a 2-D convex hull in SAS</h3><h3>
<p>
The CVEXHULL function takes a <em>N</em>&nbsp;x&nbsp;2 data matrix. Each row of the matrix is the (x,y) coordinates for a point. The CVEXHULL function returns a column vector of length <em>N</em>. 
</p>
<ul>
<li>The first few elements are positive integers. They represent the rows of the data matrix that form the convex hull.  </li>
<li>The positive integers are sorted so that you can visualize the convex hull by connecting the points in order.</li>
<li>The remaining elements are negative integers. The absolute values of these integers represent the rows of the data matrix that are contained in the convex hull or are on the boundary of the convex hull but are not vertices.</li>
</ul>

<p>The following example comes from the SAS/IML documentation. The data matrix contains 18 points. Of those, six are the vertices of the convex hull. The output of the CVEXHULL function is shown below:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>; 
points = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">0</span>  <span style="color: #2e8b57; font-weight: bold;">2</span>, <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">2</span>, <span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #2e8b57; font-weight: bold;">2</span>, <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">0.5</span> <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>  <span style="color: #2e8b57; font-weight: bold;">0</span>, 
          <span style="color: #2e8b57; font-weight: bold;">2</span> -<span style="color: #2e8b57; font-weight: bold;">1</span>,   <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">1</span>,   <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">1</span>,   <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">4</span> -<span style="color: #2e8b57; font-weight: bold;">1</span>, 
          <span style="color: #2e8b57; font-weight: bold;">5</span>  <span style="color: #2e8b57; font-weight: bold;">2</span>,   <span style="color: #2e8b57; font-weight: bold;">5</span> <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">5</span> <span style="color: #2e8b57; font-weight: bold;">0</span>,   <span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #66cc66;">&#125;</span>; 
&nbsp;
<span style="color: #006400; font-style: italic;">/* Find the convex hull:
   - indices on the convex hull are positive
   - the indices for the convex hull are listed first, in sequential order
   - interior indices are negative
 */</span>
Indices = cvexhull<span style="color: #66cc66;">&#40;</span> points <span style="color: #66cc66;">&#41;</span>; 
<span style="color: #0000ff;">reset</span> wide;
print <span style="color: #66cc66;">&#40;</span>Indices`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Indices&quot;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/cvexhull02.png" alt="" width="442" height="59" class="alignnone size-full wp-image-36072" srcset="https://blogs.sas.com/content/iml/files/2021/11/cvexhull02.png 442w, https://blogs.sas.com/content/iml/files/2021/11/cvexhull02-300x40.png 300w" sizes="(max-width: 442px) 100vw, 442px" />

<p>
I have highlighted the first six elements. The indices tell you that the convex hull is formed by using the 1st, 5th, 8th, 14th, 18th, and 15th points of the data matrix. You can use the LOC statement to find the positive values in the <tt>Indices</tt> vector. You can use those values to extract the points on the convex hull, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;">hullIdx = indices<span style="color: #66cc66;">&#91;</span>loc<span style="color: #66cc66;">&#40;</span>indices&gt;<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;   <span style="color: #006400; font-style: italic;">/* the positive indices */</span>
convexHull = points<span style="color: #66cc66;">&#91;</span>hullIdx, <span style="color: #66cc66;">&#93;</span>;      <span style="color: #006400; font-style: italic;">/* extract rows */</span>
print hullIdx convexHull<span style="color: #66cc66;">&#91;</span>c=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'cx'</span> <span style="color: #a020f0;">'cy'</span><span style="color: #66cc66;">&#125;</span> L=<span style="color: #a020f0;">&quot;&quot;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/cvexhull03.png" alt="" width="112" height="193" class="alignnone size-full wp-image-36078" />

<p>
The output shows that the convex hull is formed by the six points (0,2), (0,0), ..., (5,2).
</p>

</h3><h3>Visualize the convex hull</h3>
<p>
The graph at the beginning of this article shows the convex hull as a shaded polygon. The original points are overlaid on the polygon and labeled by the observation number. The six points that form the convex hull are colored red. This section shows how to create the graph.  
</p>
<p>
The graph uses the POLYGON statement to visualize the convex hull. 
This enables you to shade the interior of the convex hull. If you do not need the shading, you
could use a SERIES statement, but to get a <em>closed</em> polygon you would need to add the first point to the end of the list of vertices.
</p>
<p>
To create the graph, you must write the relevant information to a SAS data set so that you can use PROC SGPLOT to create the graph. The following statements write the (x,y) coordinates of the point, the observation numbers (for the data labels), the coordinates of the convex hull vertices (cx, cy), and an ID variable, which is required to use the POLYGON statement. It also creates a binary indicator variable that is used to color-code the markers in the scatter plot:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">x</span> = points<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>; y = points<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span>; 
obsNum = t<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:nrow<span style="color: #66cc66;">&#40;</span>points<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* optional: use observation numbers for labels */</span>
<span style="color: #006400; font-style: italic;">/* The points on the convex hull are sorted in counterclockwise order.
   If you use a series plot, you must repeat the first point 
   so that the polygon is closed. For example, use
   convexHull = convexHull // convexHull[1,]; */</span>
cx = convexHull<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>; cy = convexHull<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span>; 
ID = j<span style="color: #66cc66;">&#40;</span>nrow<span style="color: #66cc66;">&#40;</span>cx<span style="color: #66cc66;">&#41;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;        <span style="color: #006400; font-style: italic;">/* create ID variable for POLYGON statement */</span>
<span style="color: #006400; font-style: italic;">/* create a binary (0/1) indicator variable */</span> 
OnHull = j<span style="color: #66cc66;">&#40;</span>nrow<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* most points NOT vertices of the convex hull */</span>
OnHull<span style="color: #66cc66;">&#91;</span>hullIdx<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>;         <span style="color: #006400; font-style: italic;">/* these points are the vertices */</span>
&nbsp;
<span style="color: #0000ff;">create</span> CHull <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">'x'</span> <span style="color: #a020f0;">'y'</span> <span style="color: #a020f0;">'cx'</span> <span style="color: #a020f0;">'cy'</span> <span style="color: #a020f0;">'ID'</span> <span style="color: #a020f0;">'obsNum'</span> <span style="color: #a020f0;">'OnHull'</span><span style="color: #66cc66;">&#125;</span>;
append;
<span style="color: #0000ff;">close</span>;
<span style="color: #000080; font-weight: bold;">QUIT</span>;</pre></td></tr></table></div>




<p>
In the graph at the top of this article, vertices of the convex hull are colored red and the other points are blue. When you use the GROUP= option in PROC SGPLOT statements, <a href="https://blogs.sas.com/content/graphicallyspeaking/2014/10/19/consistent-group-colors-by-value/">the group colors might depend on the order of the observations in the data</a>. To ensure that the colors are consistent
regardless of the order of the data set, you can use a discrete attribute map to associate colors and values of the grouping variable. For details about using a discrete attribute maps, see <a href="https://blogs.sas.com/content/graphicallyspeaking/2016/09/13/legend-order-and-group-attributes/">Kuhfeld's 2016 article</a>. 
</p><p>
To use a discrete attribute map, you need to define it in a SAS data set, read it by using the DATTRMAP= option on the PROC SGPLOT statement, and specify it by using the ATTRID= statement on the SCATTER statement, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">data</span> DAttrs;                             <span style="color: #006400; font-style: italic;">/* use DATTRMAP=&lt;data set name&gt; */</span>
<span style="color: #0000ff;">length</span> MarkerStyleElement $11.;
ID = <span style="color: #a020f0;">&quot;HullAttr&quot;</span>;                         <span style="color: #006400; font-style: italic;">/* use ATTRID=&lt;ID value&gt; */</span>
Value = <span style="color: #2e8b57; font-weight: bold;">0</span>; MarkerStyleElement = <span style="color: #a020f0;">&quot;GraphData1&quot;</span>; <span style="color: #0000ff;">output</span>; <span style="color: #006400; font-style: italic;">/* 0 ==&gt; 1st color */</span>
Value = <span style="color: #2e8b57; font-weight: bold;">1</span>; MarkerStyleElement = <span style="color: #a020f0;">&quot;GraphData2&quot;</span>; <span style="color: #0000ff;">output</span>; <span style="color: #006400; font-style: italic;">/* 1 ==&gt; 2nd color */</span>
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Points and Convex Hull&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=CHull DATTRMAP=DAttrs;
   polygon <span style="color: #0000ff;">x</span>=cx y=cy ID=ID / fill outline lineattrs=GraphData2;
   scatter <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">x</span> y=y / datalabel=obsNum <span style="color: #0000ff;">group</span>=OnHull 
                     markerattrs=<span style="color: #66cc66;">&#40;</span>symbol=CircleFilled<span style="color: #66cc66;">&#41;</span> ATTRID=HullAttr;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<p>
The graph is shown at the top of this article. Notice that the points
(0.5, 2) and (1, 2) are on the boundary of the convex hull, but they are drawn in blue because they are not vertices of the polygon.
</p>

<h3>Summary</h3>
<p>
In summary, you can compute a 2-D convex hull by using the CVEXHULL function in SAS/IML software. 
The output is a set of indices, which you can use to extract the vertices of the convex hull and to
color-code markers in a scatter plot. 
</p><p>
By the way, there is a hidden message in the graph of the convex hull. Can you see it? It has been hiding in the SAS/IML documentation for more than 20 years.
</p><p>
In closing, I'll mention that a 2-D convex hull is one computation in the general field of
computational geometry. The SAS/IML group is working to add additional
functionality to the language, including convex hulls in higher dimensions.
In your work, do you have specific needs for results in computational geometry? If so, let me know the details in the comments.
</p><p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html">Two-dimensional convex hulls in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=CX55JmmXon4:WhczTKbma8Y:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=CX55JmmXon4:WhczTKbma8Y:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=CX55JmmXon4:WhczTKbma8Y:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=CX55JmmXon4:WhczTKbma8Y:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/11/cvexhull01-150x150.png" />
	</item>
		<item>
		<title>Create a frequency polygon in SAS</title>
		<link>https://blogs.sas.com/content/iml/2021/11/10/create-frequency-polygon-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/11/10/create-frequency-polygon-sas.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 10 Nov 2021 10:24:44 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=36021</guid>

					<description><![CDATA[<p>I was recently asked how to create a frequency polygon in SAS. A frequency polygon is an alternative to a histogram that shows similar information about the distribution of univariate data. It is the piecewise linear curve formed by connecting the midpoints of the tops of the bins. The graph [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/10/create-frequency-polygon-sas.html">Create a frequency polygon in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<a href="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly1.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly1.png" alt="" width="480" height="360" class="alignright size-full wp-image-36027" srcset="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly1.png 640w, https://blogs.sas.com/content/iml/files/2021/11/FreqPoly1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
I was recently asked how to create a frequency polygon in SAS.  A frequency polygon is an alternative to a histogram that shows similar information about the distribution of univariate data. It is the piecewise linear curve formed by connecting the midpoints of the tops of the bins. The graph to the right shows a histogram and a 
frequency polygon for the same data. 
This article shows how to create a frequency polygon in SAS.
</p>
<p>
In practice, frequency polygons are not used as often as histograms are, but they are useful pedagogical tools for teaching the fundamentals of density estimation. The histogram is an estimate of the density of univariate data, but it is a bar chart. Accordingly, it looks different from density estimate curves, such as parametric densities and kernel density estimates.
The frequency polygon shows the same information as a histogram but displays the information as a line plot. Therefore, you can more easily compare the frequency polygon curve and other density estimate curves.
</p>
<p>
A frequency polygon is also a good way to introduce the ideas behind a cumulative distribution. 
<a href="https://blogs.sas.com/content/iml/2016/09/26/create-ogive-sas.html">An ogive is a graph of the cumulative sum</a> of the vertical coordinates of the frequency polygon. The ogive approximates
the cumulative distribution in the same way that the frequency polygon approximates the density.
</p>


<h3>Create a frequency polygon in SAS</h3>
<p>
You can use the UNIVARIATE procedure in SAS to generate the points for a frequency polygon.
You can use the OUTHIST= option to specify a data set that contains the counts for each bar in the histogram. 
The midpoints of the histogram bins are contained in the _MIDPT_ variable. The count in each bin is
contained in the _COUNT_ variable. 
</p><p>
If you do not like the default width of the histogram bins, you can use the MIDPOINTS= option to specify your own set of midpoints. For example, the following statements create a histogram for the EngineSize variable in the Sashelp.Cars data set. You can use the SERIES statement in PROC SGPLOT to create a line plot that displays the vertical height of each histogram bar, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=sashelp.cars<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">keep</span>=EngineSize<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">var</span> EngineSize;
   histogram / outhist=OutHist grid vscale=count
               midpoints=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1.4</span> to <span style="color: #2e8b57; font-weight: bold;">8.4</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">0.4</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* use midpoints= option to specify midpoints */</span>
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* optionally, print the OutHist data */</span>
<span style="color: #006400; font-style: italic;">/* proc print data=OutHist; run;      */</span>
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Frequency Polygon&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=OutHist;
   series <span style="color: #0000ff;">x</span>=_MIDPT_ y=_COUNT_ / markers;
   yaxis grid values=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span> to <span style="color: #2e8b57; font-weight: bold;">80</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">20</span><span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Count&quot;</span> offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span>;
   xaxis grid values=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1.4</span> to <span style="color: #2e8b57; font-weight: bold;">8.4</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">0.4</span><span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Engine Size (L)&quot;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly2.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-36039" srcset="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly2.png 640w, https://blogs.sas.com/content/iml/files/2021/11/FreqPoly2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
The frequency polygon is shown. Like a histogram, the shape of the frequency polygon depends on the
bin width and anchor position. You can change those values by using the MIDPOINTS= option.
</p>

<h3>Overlay a frequency polygon and a kernel density estimate</h3>
<p>
As I mentioned earlier, an advantage of the frequency polygon is that it is a 
curve, not a bar chart. As such, it is easier to compare to other density estimates curves. 
In PROC UNIVARIATE, you can use the KERNEL option to overlay a kernel density curve on a histogram.
You can use the OUTKERNEL= option to write the kernel density estimate to a data set. You can then overlay and compare the frequency curve (a crude histogram-based estimate) and the kernel density estimate, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=sashelp.cars<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">keep</span>=EngineSize<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">var</span> EngineSize;
   histogram / outhist=OutHist grid vscale=count
               kernel outkernel=OutKer
               midpoints=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1.4</span> to <span style="color: #2e8b57; font-weight: bold;">8.4</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">0.4</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* use midpoints= option to specify midpoints */</span>
   ods <span style="color: #0000ff;">select</span> Moments Histogram;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">data</span> Density;  <span style="color: #006400; font-style: italic;">/* combine the estimates */</span>
<span style="color: #0000ff;">set</span> OutHist OutKer<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">rename</span>=<span style="color: #66cc66;">&#40;</span>_Count_=KerCount<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Frequency Polygon and Kernel Density Estimate&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Density;
   series <span style="color: #0000ff;">x</span>=_MIDPT_ y=_COUNT_ / legendlabel=<span style="color: #a020f0;">&quot;Frequency Polygon&quot;</span>;
   series <span style="color: #0000ff;">x</span>=_VALUE_ y=KerCount / legendlabel=<span style="color: #a020f0;">&quot;Kernel Density Estimate&quot;</span>;
   yaxis offsetmin=<span style="color: #2e8b57; font-weight: bold;">0</span> grid values=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span> to <span style="color: #2e8b57; font-weight: bold;">80</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">20</span><span style="color: #66cc66;">&#41;</span> <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Estimated Count&quot;</span>;
   xaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Engine Size (L)&quot;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly3.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-36036" srcset="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly3.png 640w, https://blogs.sas.com/content/iml/files/2021/11/FreqPoly3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>


<p>
As shown in the graph, a kernel density estimate is a smoother version of the frequency polygon.
</p>

<h3>Summary</h3>
<p>
This article shows how to create a graph of the frequency polygon in SAS.
A frequency polygon is a piecewise linear curve formed by connecting the midpoints of the tops of the bars in a histogram. The frequency polygon is a curve, so it is easier to compare it with other parametric or nonparametric
density estimates.
</p><p>
One final remark: I don't like the name "frequency polygon." A polygon is a <em>closed</em> planar region formed by connecting a set of points and then connecting the first and last points. The density estimate in this article is not closed. I would prefer a term such as "frequency polyline" or "frequency curve,"  but "polygon" seems to be the standard term that appears in introductory statistics textbooks. 
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/10/create-frequency-polygon-sas.html">Create a frequency polygon in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=gw9g0OP4z2Q:_KSUMDUWa0A:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=gw9g0OP4z2Q:_KSUMDUWa0A:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=gw9g0OP4z2Q:_KSUMDUWa0A:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=gw9g0OP4z2Q:_KSUMDUWa0A:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/11/10/create-frequency-polygon-sas.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/11/FreqPoly1-150x150.png" />
	</item>
		<item>
		<title>The normal approximation and random samples of the binomial distribution</title>
		<link>https://blogs.sas.com/content/iml/2021/11/08/normal-approximation-binomial.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/11/08/normal-approximation-binomial.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 08 Nov 2021 10:21:51 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Simulation]]></category>
		<category><![CDATA[Statistical Thinking]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35970</guid>

					<description><![CDATA[<p>Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/08/normal-approximation-binomial.html">The normal approximation and random samples of the binomial distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<a href="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal1.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal1.png" alt="" width="480" height="360" class="alignright size-full wp-image-35991" srcset="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal1.png 640w, https://blogs.sas.com/content/iml/files/2021/11/BinomNormal1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success.
Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the normal density (the red curve, N(&mu;=9500, &sigma;=21.79)) can be a very good approximation to the binomial density (blue bars, Binom(p=0.95, nTrials=10000)). However, because the binomial density is discrete, the binomial density is defined only for positive integers, whereas the normal density is defined for all real numbers.
</p>
<p>
In this graph, the binomial density and the normal density are close. But what does the approximation look like if you overlay a bar chart of a random sample from the binomial distribution? It turns out that the bar chart can have large deviations from the normal curve, even for a large sample. Read on for an example.
</p>

<h3>The normal approximation to the binomial distribution</h3>
<p>
I have written two previous articles that discuss the normal approximation to the binomial distribution:
</p>
<ul>
<li><a href="https://blogs.sas.com/content/iml/2016/09/12/overlay-curve-bar-chart-sas.html">Overlay the binomial and normal densities</a>: This article shows how to overlay the discrete binomial density and the continuous normal density by using
VBARBASIC (or NEEDLE) statement and the SERIES statement in PROC SGPLOT. The graph to the right uses a SAS program that is presented in that article.
</li>
<li><a href="https://blogs.sas.com/content/iml/2012/03/14/the-normal-approximation-to-the-binomial-distribution-how-the-quantiles-compare.html">The Normal approximation to the binomial distribution: How the quantiles compare</a>. This article discusses the fact that the tails of the binomial distribution do not agree with the tails of the normal quantiles, even if the normal approximation is very close in the center of the distribution.</li>
</ul>

<p>
The normal approximation is used to estimate probabilities because it is often easier to use the area under the normal curve than to sum many discrete values.
However, as shown in the second article, the discrete binomial distribution can have statistical properties that are different from the normal distribution. 
</p>

<h3>Random binomial samples</h3>
<p>
It is important to remember that a <em>random sample</em> (from <em>any</em> distribution!) can look much different from the underlying probability density function. 
The following graph shows a random sample from the binomial distribution Binom(0.95, 10000). The distribution of the sample looks quite different from the density curve. 
</p>

<a href="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal2.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35985" srcset="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal2.png 640w, https://blogs.sas.com/content/iml/files/2021/11/BinomNormal2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
In statistical terms, the observed and expected values have large deviations.
Also, note that there can be a considerable deviation between adjacent bars. For example, in the graph, some bars have about 2% of the total frequency whereas an adjacent bar might have half that value.
I was surprised to observe the large deviations in a large sample.
</p>

<p>
This graph emphasizes the fact that a random sample from the binomial distribution can look different from the smooth bell-shaped curve of the probability density. 
I think I was surprised by the magnitude of the deviations from the expected values because I have more experience visualizing <em>continuous</em> distributions. For a continuous distribution, we use a histogram to display the empirical distribution. When the bin width of the histogram is greater than 1, it smooths out the differences between adjacent bars to create a much smoother estimate of the density. For example, the following graph displays the distribution of the same random sample but uses a bin width of 10 to aggregate the frequency. The resulting histogram is much smoother and resembles the normal approximation.
</p>

<a href="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal3.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35988" srcset="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal3.png 640w, https://blogs.sas.com/content/iml/files/2021/11/BinomNormal3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>



<h3>Summary</h3>
<p>
The normal approximation is a convenient way to compute probabilities for the binomial distribution. However, it is important to remember that the binomial distribution is a discrete distribution. A binomial random variable can assume values only for positive integers. One consequence of this observation is that a bar chart of a random
binomial sample can show considerable deviations from the theoretical density. This is normal (pun intended!). 
It is often overlooked because if you treat the random variable as if it were continuous and use a histogram to estimate the density, the histogram smooths out a lot of the bar-to-bar deviations.
</p>

<h3>Appendix</h3>
<p>
The following SAS program creates the graphs in this article.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* define parameters for the Binom(p=0.95, nTrials=10000) simulation */</span>
<span style="color: #0000ff;">%let</span> p = <span style="color: #2e8b57; font-weight: bold;">0.95</span>;                    <span style="color: #006400; font-style: italic;">/* probability of success */</span>
<span style="color: #0000ff;">%let</span> NTrials = <span style="color: #2e8b57; font-weight: bold;">10000</span>;             <span style="color: #006400; font-style: italic;">/* number of trials */</span>
<span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">1000</span>;                    <span style="color: #006400; font-style: italic;">/* sample size */</span>   
&nbsp;
<span style="color: #006400; font-style: italic;">/* First graph: Compute the density of the Normal and Binomial distributions. See
   https://blogs.sas.com/content/iml/2016/09/12/overlay-curve-bar-chart-sas.html
*/</span>
<span style="color: #000080; font-weight: bold;">data</span> Binom;
<span style="color: #0000ff;">n</span> = <span style="color: #0000ff; font-weight: bold;">&amp;nTrials</span>;  p = <span style="color: #0000ff; font-weight: bold;">&amp;p</span>;  q = <span style="color: #2e8b57; font-weight: bold;">1</span> - p;
mu = <span style="color: #0000ff;">n</span><span style="color: #006400; font-style: italic;">*p;</span>  sigma = <span style="color: #0000ff;">sqrt</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">n</span><span style="color: #006400; font-style: italic;">*p*q);</span>   <span style="color: #006400; font-style: italic;">/* parameters for the normal approximation */</span>
Lower = mu-<span style="color: #2e8b57; font-weight: bold;">3.5</span><span style="color: #006400; font-style: italic;">*sigma;</span>             <span style="color: #006400; font-style: italic;">/* evaluate normal density on [Lower, Upper] */</span>
Upper = mu+<span style="color: #2e8b57; font-weight: bold;">3.5</span><span style="color: #006400; font-style: italic;">*sigma;</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* PDF of normal distribution */</span>
<span style="color: #0000ff;">do</span> t = Lower to Upper <span style="color: #0000ff;">by</span> sigma/<span style="color: #2e8b57; font-weight: bold;">20</span>;
   <span style="color: #0000ff;">Normal</span> = <span style="color: #0000ff;">pdf</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;normal&quot;</span>, t, mu, sigma<span style="color: #66cc66;">&#41;</span>;       <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* PMF of binomial distribution */</span>
t = .; <span style="color: #0000ff;">Normal</span> = .;        <span style="color: #006400; font-style: italic;">/* these variables are not used for the bar chart */</span>
<span style="color: #0000ff;">do</span> j = <span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #0000ff;">floor</span><span style="color: #66cc66;">&#40;</span>Lower<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> to <span style="color: #0000ff;">ceil</span><span style="color: #66cc66;">&#40;</span>Upper<span style="color: #66cc66;">&#41;</span>;
   Binomial = <span style="color: #0000ff;">pdf</span><span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Binomial&quot;</span>, j, p, <span style="color: #0000ff;">n</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #006400; font-style: italic;">/* store mu and sigma in macro variables */</span>
<span style="color: #0000ff;">call</span> symput<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;mu&quot;</span>, strip<span style="color: #66cc66;">&#40;</span>mu<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;      
<span style="color: #0000ff;">call</span> symput<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;sigma&quot;</span>, strip<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">round</span><span style="color: #66cc66;">&#40;</span>sigma,<span style="color: #2e8b57; font-weight: bold;">0.01</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">label</span> Binomial=<span style="color: #a020f0;">&quot;Binomial Probability&quot;</span>  <span style="color: #0000ff;">Normal</span>=<span style="color: #a020f0;">&quot;Normal Density&quot;</span>;
<span style="color: #0000ff;">keep</span> t <span style="color: #0000ff;">Normal</span> j Binomial;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* overlay binomial density (needle plot) and normal density (series plot) */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Binomial Probability and Normal Approximation&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Binom(0.95, 10000) and N(9500, 21.79)&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Binom;
   needle <span style="color: #0000ff;">x</span>=j y=Binomial;
   series <span style="color: #0000ff;">x</span>=t y=<span style="color: #0000ff;">Normal</span> / lineattrs=GraphData2<span style="color: #66cc66;">&#40;</span>thickness=<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
   inset <span style="color: #a020f0;">&quot;p = &amp;p&quot;</span>  <span style="color: #a020f0;">&quot;q = %sysevalf(1-&amp;p)&quot;</span> <span style="color: #a020f0;">&quot;nTrials = &amp;nTrials&quot;</span>  
         <span style="color: #a020f0;">&quot;(*ESC*){unicode mu} = np = &amp;mu&quot;</span>           <span style="color: #006400; font-style: italic;">/* use Greek letters */</span>
         <span style="color: #a020f0;">&quot;(*ESC*){unicode sigma} = sqrt(npq) = &amp;sigma&quot;</span> /
         position=topright border;
   yaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Probability&quot;</span>;
   xaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span> integer;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/*************************/</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Second graph: simulate a random sample from Binom(p, NTrials) */</span>
<span style="color: #000080; font-weight: bold;">data</span> Bin<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">keep</span>=<span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to &amp;<span style="color: #0000ff;">N</span>; 
   <span style="color: #0000ff;">x</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Binomial&quot;</span>, <span style="color: #0000ff; font-weight: bold;">&amp;p</span>, <span style="color: #0000ff; font-weight: bold;">&amp;NTrials</span><span style="color: #66cc66;">&#41;</span>; 
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>; 
&nbsp;
<span style="color: #006400; font-style: italic;">/* count the frequency of each observed count */</span>
<span style="color: #000080; font-weight: bold;">proc freq</span> <span style="color: #000080; font-weight: bold;">data</span>=Bin noprint;
   tables <span style="color: #0000ff;">x</span> / out=FreqOut;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">data</span> All;
<span style="color: #0000ff;">set</span> FreqOut Binom<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">keep</span>=t <span style="color: #0000ff;">Normal</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">Normal</span> = <span style="color: #2e8b57; font-weight: bold;">100</span><span style="color: #006400; font-style: italic;">*1*Normal;</span>   <span style="color: #006400; font-style: italic;">/* match scales: 100*h*PDF, where h=binwidth */</span>
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* overlay sample and normal approximation */</span>
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Random Binomial Sample&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Bar Chart of Binom(0.95, 10000)&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=All;   
   needle <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">x</span> y=Percent;
   series <span style="color: #0000ff;">x</span>=t y=<span style="color: #0000ff;">Normal</span> / lineattrs=GraphData2<span style="color: #66cc66;">&#40;</span>thickness=<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;
   inset <span style="color: #a020f0;">&quot;n = &amp;n&quot;</span>  <span style="color: #a020f0;">&quot;p = &amp;p&quot;</span> 
         <span style="color: #a020f0;">&quot;(*ESC*){unicode mu} = &amp;mu&quot;</span>           <span style="color: #006400; font-style: italic;">/* use Greek letters */</span>
         <span style="color: #a020f0;">&quot;(*ESC*){unicode sigma} = &amp;sigma&quot;</span> /
         position=topright border;
   yaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Percent / Scaled Density&quot;</span>;
   xaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span> integer;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/*************************/</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Third graph: the bar-to-bar deviations are smoothed if you use a histogram */</span>
title2 <span style="color: #a020f0;">&quot;Histogram BinWidth=10&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Bin;   
   histogram <span style="color: #0000ff;">x</span> / scale=percent binwidth=<span style="color: #2e8b57; font-weight: bold;">10</span>;
   xaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span> integer values=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">9400</span> to <span style="color: #2e8b57; font-weight: bold;">9600</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">20</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* for comparison, the histogram looks like the bar chart if you set BINWIDTH=1 */</span>
title2 <span style="color: #a020f0;">&quot;Histogram BinWidth=1&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Bin;   
   histogram <span style="color: #0000ff;">x</span> / binwidth=<span style="color: #2e8b57; font-weight: bold;">1</span> scale=percent;
   xaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;x&quot;</span> integer;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>



<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/08/normal-approximation-binomial.html">The normal approximation and random samples of the binomial distribution</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=RvQCMgMFIFI:d1rx5riULtE:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=RvQCMgMFIFI:d1rx5riULtE:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=RvQCMgMFIFI:d1rx5riULtE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=RvQCMgMFIFI:d1rx5riULtE:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/11/08/normal-approximation-binomial.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/11/BinomNormal2-150x150.png" />
	</item>
		<item>
		<title>Add reference lines to a bar chart in SAS</title>
		<link>https://blogs.sas.com/content/iml/2021/11/03/reference-lines-bar-chart-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/11/03/reference-lines-bar-chart-sas.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 03 Nov 2021 09:28:43 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Statistical Graphics]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35922</guid>

					<description><![CDATA[<p>A SAS programmer asked whether it is possible to add reference lines to the categorical axis of a bar chart. The answer is yes. You can use the VBAR statement, but I prefer to use the VBARBASIC (or VBARPARM) statement, which enables you to overlay a wide variety of graphs [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/03/reference-lines-bar-chart-sas.html">Add reference lines to a bar chart in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A SAS programmer asked whether it is possible to add reference lines to the categorical axis of a bar chart. The answer is yes. You can use the VBAR statement, but I prefer to use the VBARBASIC (or VBARPARM) 
statement, which enables you to overlay a wide variety of graphs on a bar chart. I have previously written about <a href="https://blogs.sas.com/content/iml/2021/04/14/overlay-graphs-basic-bar-chart.html">using the VBARBASIC statement to overlay graphs on bar charts</a>. The VBARBASIC chart is compatible with more graphs than the VBAR chart. See the documentation for <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/grstatproc/p0yud64khw8fuin1xgr85dgxbb7t.htm">a complete discussion of "compatible" plot types.</a>
</p>
<p>
This article shows two ways to overlay a reference line on the categorical axis of a bar chart. 
But the SAS programmer wanted more. He wanted to create a bar for each day of the year. That is a lot of bars! For bar charts that have many bars, I recommend using the NEEDLE statement to create a needle plot.
The second part of this article demonstrates a needle plot and overlays reference lines for certain holidays.
</p>
<p>
For simplicity, this article discusses only vertical bar charts, but all programs can be adapted to display horizontal bar charts. 
</p>

<h3>Reference lines and bar charts that use the VBAR statement</h3>

<a href="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR1.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR1.png" alt="" width="480" height="360" class="alignright size-full wp-image-35940" srcset="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR1.png 640w, https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
First, to be clear, you can easily add <em>horizontal</em> reference lines to a vertical bar chart. <a href="https://blogs.sas.com/content/iml/2020/04/13/horizontal-vertical-reference-lines-sas.html">This is straightforward</a>. The programmer wanted to add <em>vertical</em> reference lines to the <em>categorical</em> axis, as shown in the graph to the right. In this graph, reference lines are added behind the bars for Age=12 and Age=14. I made the bars semi-transparent so that the full reference lines are visible.
</p>
<p>
As the SAS programmer discovered, the following attempt to add reference lines does not display any reference lines:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Bar Chart with Reference Line on Categorical Axis&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Sashelp.Class;
  refline <span style="color: #2e8b57; font-weight: bold;">12</span> <span style="color: #2e8b57; font-weight: bold;">14</span> / axis=<span style="color: #0000ff;">x</span> lineattrs=<span style="color: #66cc66;">&#40;</span>color=red<span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* DOES NOT WORK */</span>
  vbar Age / response=Weight transparency=<span style="color: #2e8b57; font-weight: bold;">0.2</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<p>
Why don't the reference lines appear? As I have previously written, <a href="https://blogs.sas.com/content/iml/2020/04/13/horizontal-vertical-reference-lines-sas.html">you must specify the <em>formatted</em> values for a categorical axis</a>. This is mentioned in <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/grstatproc/n0ezn6n9coszztn100iyzhfvwnhp.htm">the documentation for the REFLINE statement</a>, which states that "unformatted numeric values do not map to a formatted discrete axis. For example, if reference lines are drawn at points on a discrete X axis, the REFLINE values must be the formatted value that appears on the X axis."
In other words, you must change the REFLINE values to be "the formatted values," which are '12' and '14'. The following call to PROC SGPLOT displays the vertical reference lines:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Sashelp.Class;
  refline <span style="color: #a020f0;">'12'</span> <span style="color: #a020f0;">'14'</span> / axis=<span style="color: #0000ff;">x</span> lineattrs=<span style="color: #66cc66;">&#40;</span>color=red<span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* YES! THIS WORKS! */</span>
  vbar Age / response=Weight transparency=<span style="color: #2e8b57; font-weight: bold;">0.2</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<p>The reference lines are shown in the graph at the beginning of this section.
</p>

<h3>Reference lines and the VBARBASIC statement</h3>
<p>
I prefer to use the VBARBASIC statement for most bar charts. If you use the VBARBASIC statement, you can specify the raw reference values. To be honest, I am not sure why it works, but, in general, the VBARBASIC statement is better when you need to overlay a bar chart and other graphical elements. 
If you use the VBARBASIC statement, the natural syntax works as expected:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=Sashelp.Class;
  refline <span style="color: #2e8b57; font-weight: bold;">12</span> <span style="color: #2e8b57; font-weight: bold;">14</span> / axis=<span style="color: #0000ff;">x</span> lineattrs=<span style="color: #66cc66;">&#40;</span>color=red<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* THIS WORKS, TOO! */</span>
  vbarbasic Age / response=Weight transparency=<span style="color: #2e8b57; font-weight: bold;">0.2</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<p> 
The graph is the same as shown in the previous section. 
</p>

<h3>Reference lines for holidays on a graph of sales by date</h3>
<p>
This section discusses an example that has hundreds of bars. Suppose you want to display a bar chart for sales by date for an entire year. 
For data like these, I have two recommendations:
</p>
<ol>
<li>Do not use a vertical bar chart. Even if each bar requires only three pixels, the chart will be more than 3*365 &asymp; 1,100 pixels wide. On a monitor that displays 72 pixels per inch, this graph would be about 40 cm (15.3 inches) wide.
A better choice is to use a needle plot, which is essentially a bar chart where each bar is represented as a vertical line. 
</li>
<li>
The horizontal axis cannot be discrete. If it is, you will get 365 dates printed along the axis. Instead, you want to use the XAXIS TYPE=TIME option to display the bars along an axis where tick marks are placed according to months, not days. (If the categories are not dates but are "days since the beginning," you can use the XAXIS TYPE=LINEAR option instead.)
</li>
</ol>

<p>Recall that the SAS programmer wanted to display holidays on the graph of sales for each day. Rather than specify the holidays on the REFLINE statement (for example, '01JAN2003'd '25DEC2003'd), it is more convenient to
put the reference line values into a SAS data set and specify the name of the  You can use the HOLIDAY function in SAS to get the date associated with major government holidays.
</p><p>
The following SAS DATA step extracts a year's worth of data for the sale of potato chips (in 2003) from the Sashelp.Snacks data set.  These data are concatenated with a separate data set that contains the holidays that you want to display by using reference lines.
A needle plot shows the daily sales and the reference lines.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">data</span> Snacks;       <span style="color: #006400; font-style: italic;">/* sales of potato chips for each date in 2003 */</span>
<span style="color: #0000ff;">set</span> Sashelp.Snacks;
<span style="color: #0000ff;">where</span> <span style="color: #a020f0;">'01JAN2003'</span>d &lt;= <span style="color: #0000ff;">Date</span> &lt;= <span style="color: #a020f0;">'31DEC2003'</span>d <span style="color: #0000ff;">AND</span> Product=<span style="color: #a020f0;">&quot;Classic potato chips&quot;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">data</span> Reflines;     <span style="color: #006400; font-style: italic;">/* holidays to overlay as reference lines */</span>
<span style="color: #0000ff;">format</span> RefDate DATE9.;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Christmas&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;       <span style="color: #0000ff;">output</span>;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Halloween&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;       <span style="color: #0000ff;">output</span>;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Memorial&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;        <span style="color: #0000ff;">output</span>;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;NewYear&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;         <span style="color: #0000ff;">output</span>;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Thanksgiving&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #0000ff;">output</span>;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;USIndependence&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;  <span style="color: #0000ff;">output</span>;
RefDate = holiday<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Valentines&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">2003</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #0000ff;">output</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">data</span> All;          <span style="color: #006400; font-style: italic;">/* concatentate the data and reference lines */</span>
<span style="color: #0000ff;">set</span> Snacks RefLines;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Sales and US Holidays&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Needle Plot&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=All;
  refline RefDate / axis=<span style="color: #0000ff;">x</span> lineattrs=<span style="color: #66cc66;">&#40;</span>color=red<span style="color: #66cc66;">&#41;</span>;
  needle <span style="color: #0000ff;">x</span>=<span style="color: #0000ff;">Date</span> y=QtySold;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR2.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35937" srcset="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR2.png 640w, https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
Notice that you do not have to use the XAXIS TYPE=TIME option with the NEEDLE statement. The SGPLOT procedure uses TYPE=TIME option by default when the X variable has a time, date, or datetime format.  If you decide to use the VBARBASIC statement, you should include the XAXIS TYPE=TIME statement.
</p>

<h3>Summary</h3>
<p>
In summary, this article shows how to add vertical reference lines to a vertical bar chart. You can use the VBAR statement and specify the formatted reference values, but I prefer to use the VBARBASIC statement whenever I want to overlay a bar chart and other graphical elements. You can also use a needle plot, which is especially helpful when you need to display 100 or more bars.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/03/reference-lines-bar-chart-sas.html">Add reference lines to a bar chart in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=2gsvzDOeoew:kj-WxQGLo4o:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=2gsvzDOeoew:kj-WxQGLo4o:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=2gsvzDOeoew:kj-WxQGLo4o:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=2gsvzDOeoew:kj-WxQGLo4o:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/11/03/reference-lines-bar-chart-sas.html/feed</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/11/ReflineVBAR2-150x150.png" />
	</item>
		<item>
		<title>Fit a mixture of Weibull distributions in SAS</title>
		<link>https://blogs.sas.com/content/iml/2021/11/01/fit-mixture-weibull-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/11/01/fit-mixture-weibull-sas.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 01 Nov 2021 09:24:46 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35814</guid>

					<description><![CDATA[<p>A previous article discusses how to use SAS regression procedures to fit a two-parameter Weibull distribution in SAS. The article shows how to convert the regression output into the more familiar scale and shape parameters for the Weibull probability distribution, which are fit by using PROC UNIVARIATE. Although PROC UNIVARIATE [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/01/fit-mixture-weibull-sas.html">Fit a mixture of Weibull distributions in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A previous article discusses <a href="https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html">how to use SAS regression procedures to fit a two-parameter Weibull distribution in SAS</a>.
The article shows how to convert the regression output into the more familiar scale and shape parameters for the Weibull probability distribution, which are fit by using PROC UNIVARIATE. 
</p><p>
Although PROC UNIVARIATE can fit many univariate distributions, it cannot fit a mixture of distributions. For that task, you need to use PROC FMM, which fits finite mixture models.
This article discusses how to use PROC FMM to fit a mixture of two Weibull distributions and how to interpret the results. The same technique can be used to fit other mixtures of distributions. 
If you are going to use the parameter estimates in SAS functions such as the PDF, CDF, and RAND functions, you cannot use the regression parameters directly. You must transform them into the distribution parameters.
</p>

<h3>Simulate a mixture of Weibull data</h3>
<p>
You can use the RAND function in the SAS DATA step to <a href="https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html">simulate a mixture distribution</a> that has two components, each drawn from a Weibull distribution.  
The RAND function samples from a two-parameter Weibull distribution Weib(&alpha;, &beta;) whose density is given by <br />
<span class='MathJax_Preview'>\(f(x; \alpha, \beta) = 
\frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)\)</span><script type='math/tex'>f(x; \alpha, \beta) = 
\frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)</script>
<br />
where 
&alpha; is a shape parameter and &beta; is a scale parameter. This parameterization is used by most Base SAS functions and procedures, as well as many regression procedures in SAS.
The following SAS DATA step simulates data from two Weibull distributions.
The first component is sampled from 
Weib(&alpha;=1.5, &beta;=0.8)
and the second component is sampled from 
Weib(&alpha;=4, &beta;=2). For the mixture distribution, the probability of drawing from the first distribution is 0.667 and the probability of drawing from the second distribution is 0.333.
</p><p>
After generating the data, you can call PROC UNIVARIATE to estimate the parameters for each component.
Notice that this fits each component separately. If the parameter estimates are close to the parameter values, that is evidence that the simulation generated the data correctly.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* sample from a mixture of two-parameter Weibull distributions */</span>
<span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">3000</span>;
<span style="color: #000080; font-weight: bold;">data</span> Have<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">drop</span>=i<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">12345</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">array</span> prob <span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> _temporary_ <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0.667</span> <span style="color: #2e8b57; font-weight: bold;">0.333</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to &amp;<span style="color: #0000ff;">N</span>;
   component = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Table&quot;</span>, of prob<span style="color: #66cc66;">&#91;</span><span style="color: #006400; font-style: italic;">*]);</span>
   <span style="color: #0000ff;">if</span> component=<span style="color: #2e8b57; font-weight: bold;">1</span> <span style="color: #0000ff;">then</span>
      d = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;weibull&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">1.5</span>, <span style="color: #2e8b57; font-weight: bold;">0.8</span><span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* C=Shape=1.5; Sigma=Scale=0.8 */</span>
   <span style="color: #0000ff;">else</span>
      d = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;weibull&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">4</span>,   <span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* C=Shape=4; Sigma=Scale=2 */</span>
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=Have;
   class component;
   <span style="color: #0000ff;">var</span> d;
   histogram d / weibull NOCURVELEGEND;  <span style="color: #006400; font-style: italic;">/* fit (Sigma, C) for each component */</span>
   ods <span style="color: #0000ff;">select</span> Histogram ParameterEstimates Moments;
   ods <span style="color: #0000ff;">output</span> ParameterEstimates = UniPE;
   inset weibull<span style="color: #66cc66;">&#40;</span>shape scale<span style="color: #66cc66;">&#41;</span> / pos=NE; 
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Weibull Estimates for Each Component&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc print</span> <span style="color: #000080; font-weight: bold;">data</span>=UniPE noobs;
   <span style="color: #0000ff;">where</span> Parameter <span style="color: #0000ff;">in</span> <span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'Scale'</span>, <span style="color: #a020f0;">'Shape'</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">var</span> Component Parameter Symbol Estimate;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<a href="https://blogs.sas.com/content/iml/files/2021/11/WeibMix1.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/WeibMix1.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35907" srcset="https://blogs.sas.com/content/iml/files/2021/11/WeibMix1.png 640w, https://blogs.sas.com/content/iml/files/2021/11/WeibMix1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<br />
<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/WeibMix1a.png" alt="" width="289" height="139" class="alignnone size-full wp-image-35913" />

<p>
The graph shows a histogram for data in each component. PROC UNIVARIATE overlays a Weibull density on each histogram, based on the parameter estimates.  
The estimates for both components are close to the parameter values. The first component contains 1,970 observations, which is 65.7% of the total sample, so the estimated mixing probabilities are close to the mixing parameters. I used ODS OUTPUT and PROC PRINT to display one table that contains the parameter estimates from the two groups.  PROC UNIVARIATE calls the shape parameter <em>c</em> and the scale parameter &sigma;.
</p>

<h3>Fitting a finite mixture distribution</h3>
<p>
The PROC UNIVARIATE call uses the Component variable to identify the Weibull distribution
to which each observation belongs. If you do not have the Component variable, is it still possible to estimate a two-component Weibull model?
</p><p>
The answer is yes.
<a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/statug/statug_fmm_overview.htm">The FMM procedure</a> fits statistical models for which the distribution of the response is a finite mixture of distributions.
In general, the component distributions can be from different families, but this example is a homogeneous mixture, with both components from the Weibull family. When fitting a mixture model, we assume that we do not know which observations belong to which component. We must estimate the mixing probabilities and the parameters for the components. Typically, <a href="https://blogs.sas.com/content/iml/2011/10/21/the-power-of-finite-mixture-models.html">you need a lot of data and well-separated components for this effort to be successful</a>.
</p>
<p>
The following call to PROC FMM fits a two-component Weibull model to the simulated data. As shown in a previous article, the estimates from PROC FMM are for the intercept and scale of the error term for a Weibull regression model. These estimates are different from the shape and scale parameters in the Weibull distribution. However, 
<a href="https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html">you can transform the regression estimates into the shape and scale parameters</a>, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Weibull Estimates for Mixture&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc fmm</span> <span style="color: #000080; font-weight: bold;">data</span>=Have plots=density;
   model d = / dist=weibull <span style="color: #0000ff;">link</span>=<span style="color: #0000ff;">log</span> k=<span style="color: #2e8b57; font-weight: bold;">2</span>;
   ods <span style="color: #0000ff;">select</span> ParameterEstimates MixingProbs DensityPlot;
   ods <span style="color: #0000ff;">output</span> ParameterEstimates=PE0;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Add the estimates of Weibull scale and shape to the table of regression estimates.
   See https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html */</span>
<span style="color: #000080; font-weight: bold;">data</span> FMMPE;
<span style="color: #0000ff;">set</span> PE0<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">rename</span>=<span style="color: #66cc66;">&#40;</span>ILink=WeibScale<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">if</span> Parameter=<span style="color: #a020f0;">&quot;Scale&quot;</span> <span style="color: #0000ff;">then</span> WeibShape = <span style="color: #2e8b57; font-weight: bold;">1</span>/Estimate;
<span style="color: #0000ff;">else</span> WeibShape = ._;  <span style="color: #006400; font-style: italic;">/* ._ is one of the 28 missing values in SAS */</span>
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc print</span> <span style="color: #000080; font-weight: bold;">data</span>=FMMPE; 
   <span style="color: #0000ff;">var</span> Component Parameter Estimate WeibShape WeibScale;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/WeibMix2.png" alt="" width="431" height="139" class="alignnone size-full wp-image-35898" srcset="https://blogs.sas.com/content/iml/files/2021/11/WeibMix2.png 431w, https://blogs.sas.com/content/iml/files/2021/11/WeibMix2-300x97.png 300w" sizes="(max-width: 431px) 100vw, 431px" />

<p>
The program renames the ILink column to WeibScale. It also adds a new column (WeibShape) 
to the ParameterEstimates table. These two columns display the Weibull shape and scale parameter estimates for each component. Despite not knowing which observation came from which component, the procedure provides good estimates for the Weibull parameters. PROC FMM estimates the first component as 
Weib(&alpha;=1.52, &beta;=0.74)
and the second component as
Weib(&alpha;=3.53, &beta;=1.88). 
It estimates the mixing parameters for the first component as 0.,6 and the parameter for the second component as 0.4.
</p><p>

The PLOTS=DENSITY option on the PROC FMM statement produces a plot of the data and overlays the component and mixture distributions. The plot is shown below and is discussed in the next section.
</p>


<a href="https://blogs.sas.com/content/iml/files/2021/11/WeibMix3.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/WeibMix3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35904" srcset="https://blogs.sas.com/content/iml/files/2021/11/WeibMix3.png 640w, https://blogs.sas.com/content/iml/files/2021/11/WeibMix3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<h3>The graph of the component densities</h3>
<p>
The PLOTS=DENSITY option produces a graph of the data and overlays
the component and mixture distributions. In the graph, the red curve shows the density of the first Weibull component (W1(d)), the green curve shows the density of the second Weibull component (W2(d)), and the blue curve shows the density of the mixture. Technically, only the blue curve is a "true" density that integrates to unity (or 100% on a percent scale). The components are scaled densities. The integral of a component equals the mixing probability, which for these data are 0.6 and 0.4, respectively. The mixture density equals the sum of the component densities.
</p><p>
Look closely at the legend in the plot, which identifies the component curves by the parameter estimates.
Notice, that the estimates in the legend are the REGRESSION estimates, not the shape and scale estimates for the Weibull distribution. Do not be misled by the legend. If you plot the PDF <br />
<tt>density = PDF("Weibull", d, 0.74, 0.66);  /* WRONG! */ </tt><br />
you will NOT get the density curve for the first component. Instead, you need to convert the regression estimates into the shapes and scale parameters for the Weibull distribution. The following DATA step uses the transformed parameter estimates and demonstrates how to graph the component and mixture densities:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* plot the Weibull component densitites and the mixture density */</span>
<span style="color: #000080; font-weight: bold;">data</span> WeibComponents;
<span style="color: #0000ff;">retain</span> d1 d2;
<span style="color: #0000ff;">array</span> WeibScale<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> _temporary_ <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0.7351</span>,  <span style="color: #2e8b57; font-weight: bold;">1.8820</span><span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* =exp(Intercept) */</span>
<span style="color: #0000ff;">array</span> WeibShape<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> _temporary_ <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1.52207</span>, <span style="color: #2e8b57; font-weight: bold;">3.52965</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* =1/Scale */</span>
<span style="color: #0000ff;">array</span> MixParm<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span>   _temporary_ <span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0.6</span>, <span style="color: #2e8b57; font-weight: bold;">0.4</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> d = <span style="color: #2e8b57; font-weight: bold;">0.01</span>, <span style="color: #2e8b57; font-weight: bold;">0.05</span> to <span style="color: #2e8b57; font-weight: bold;">3.2</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">0.05</span>;
   d1 = MixParm<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #006400; font-style: italic;">*pdf(&quot;Weibull&quot;, d, WeibShape[1], WeibScale[1]);</span> 
   d2 = MixParm<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span><span style="color: #006400; font-style: italic;">*pdf(&quot;Weibull&quot;, d, WeibShape[2], WeibScale[2]);</span> 
   Component = <span style="color: #a020f0;">&quot;Mixture        &quot;</span>; density = d1+d2; <span style="color: #0000ff;">output</span>;
   Component = <span style="color: #a020f0;">&quot;Weib(1.52,0.74)&quot;</span>; density = d1;    <span style="color: #0000ff;">output</span>; 
   Component = <span style="color: #a020f0;">&quot;Weib(3.53,1.88)&quot;</span>; density = d2;    <span style="color: #0000ff;">output</span>; 
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Weibull Mixture Components&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgplot</span> <span style="color: #000080; font-weight: bold;">data</span>=WeibComponents;
   series <span style="color: #0000ff;">x</span>=d y=density / <span style="color: #0000ff;">group</span>=Component;
   keylegend / location=inside position=NE across=<span style="color: #2e8b57; font-weight: bold;">1</span> opaque;
   xaxis values=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span> to <span style="color: #2e8b57; font-weight: bold;">3.2</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">0.2</span><span style="color: #66cc66;">&#41;</span> grid offsetmin=<span style="color: #2e8b57; font-weight: bold;">0.05</span> offsetmax=<span style="color: #2e8b57; font-weight: bold;">0.05</span>;
   yaxis grid;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2021/11/WeibMix4.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/11/WeibMix4.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35901" srcset="https://blogs.sas.com/content/iml/files/2021/11/WeibMix4.png 640w, https://blogs.sas.com/content/iml/files/2021/11/WeibMix4-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>The density curves are the same, but the legend for this graph displays the shape and scale parameters for the Weibull distribution. If you want to reproduce the vertical scale (percent), you can multiply the densities by 100*<em>h</em>, where <em>h</em>=0.2 is the width of the histogram bins.
</p>
<p>
In general, be aware that the PLOTS=DENSITY option produces a graph in which the legend labels refer to the REGRESSION parameters. For example, if you use PROC FMM to fit a mixture of normal distributions, the parameter estimates in the legend are for the mean and the VARIANCE of the normal distributions. However, if you intend to use those estimates in other SAS functions (such as PDF, CDF, and RAND), you must take the square root of the variance to obtain the standard deviation. 
</p>

<h3>Summary</h3>
<p>
This article uses PROC FMM to fit a mixture of two Weibull distributions. The article shows how to interpret the parameter estimates from the procedure by transforming them into the shape and scale parameters for the Weibull distribution. The article also emphasizes that if you use the PLOTS=DENSITY option produces a graph, 
the legend in the graph contains the regression parameters, which are not the same as the parameters that are used for the PDF, CDF, and RAND functions.
</p><p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/11/01/fit-mixture-weibull-sas.html">Fit a mixture of Weibull distributions in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=lvkqMuObLjI:oUx69X6ZO-c:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=lvkqMuObLjI:oUx69X6ZO-c:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=lvkqMuObLjI:oUx69X6ZO-c:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=lvkqMuObLjI:oUx69X6ZO-c:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/11/01/fit-mixture-weibull-sas.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2017/01/ProgrammingTips-2-150x150.png" />
	</item>
		<item>
		<title>Interpret estimates for a Weibull regression model in SAS</title>
		<link>https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 27 Oct 2021 09:30:31 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Regression]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35673</guid>

					<description><![CDATA[<p>It can be frustrating when the same probability distribution has two different parameterizations, but such is the life of a statistical programmer. I previously wrote an article about the gamma distribution, which has two common parameterizations: one that uses a scale parameter (&#946;) and another that uses a rate parameter [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html">Interpret estimates for a Weibull regression model in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
It can be frustrating when the same probability distribution has two different parameterizations, but such is the life of a statistical programmer. I previously wrote <a href="https://blogs.sas.com/content/iml/2018/10/17/parameter-estimates-for-different-parameterizations.html">an article about the gamma distribution</a>, which has two common parameterizations: one that uses a scale parameter (&beta;) and another that uses a rate parameter (c = 1/&beta;). 
</p><p>
The relationship between scale and rate parameters is straightforward, but sometimes the relationship between different parameterizations is more complicated.
Recently, a SAS programmer was using a regression procedure to fit the parameters of a Weibull distribution.
He was confused about how the output from 
a SAS regression procedure relates to a more familiar parameterization of the Weibull distribution, such as is fit by PROC UNIVARIATE. 
This article shows how to perform two-parameter Weibull regression in several 
SAS procedures, including PROC RELIABILITY, PROC LIFEREG, and PROC FMM.
The parameter estimates from regression procedures are not the usual Weibull parameters, but you can transform them into the Weibull parameters.
</p>
<p>
This article fits a two-parameter Weibull model. In a two-parameter model, the threshold parameter is assumed to be  0. A zero threshold assumes that the data can be any positive value.
</p>

<h3>Fitting a Weibull distribution in PROC UNIVARIATE</h3>
<p>
PROC UNIVARIATE is the first tool to reach for if you want to fit a Weibull distribution in SAS.
The most common parameterization of the Weibull density is <br />
<span class='MathJax_Preview'>\(f(x; \alpha, \beta) = 
\frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)\)</span><script type='math/tex'>f(x; \alpha, \beta) = 
\frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)</script>
<br />
where 
&alpha; is a shape parameter and &beta; is a scale parameter. This parameterization is used by most Base SAS functions and procedures, as well as many regression procedures in SAS.
The following SAS DATA step simulates data from the Weibull(&alpha;=1.5, &beta;=0.8) distribution and fits the parameters by using PROC UNIVARIATE:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* sample from a Weibull distribution */</span>
<span style="color: #0000ff;">%let</span> <span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">100</span>;
<span style="color: #000080; font-weight: bold;">data</span> Have<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">drop</span>=i<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">12345</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to &amp;<span style="color: #0000ff;">N</span>;
   d = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Weibull&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">1.5</span>, <span style="color: #2e8b57; font-weight: bold;">0.8</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* Shape=1.5; Scale=0.8 */</span>
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=Have;
   <span style="color: #0000ff;">var</span> d;
   histogram d / weibull endpoints=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">0</span> to <span style="color: #2e8b57; font-weight: bold;">2.5</span> <span style="color: #0000ff;">by</span> <span style="color: #2e8b57; font-weight: bold;">0.25</span><span style="color: #66cc66;">&#41;</span>;  <span style="color: #006400; font-style: italic;">/* fit Weib(Sigma, C) to the data */</span>
   probplot / weibull2<span style="color: #66cc66;">&#40;</span>C=<span style="color: #2e8b57; font-weight: bold;">1.383539</span> SCALE=<span style="color: #2e8b57; font-weight: bold;">0.684287</span><span style="color: #66cc66;">&#41;</span> grid; <span style="color: #006400; font-style: italic;">/* OPTIONAL: P-P plot */</span>
   ods <span style="color: #0000ff;">select</span> Histogram ParameterEstimates ProbPlot;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg2.png" alt="" width="239" height="193" class="alignnone size-full wp-image-35853" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg2.png 239w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg2-168x137.png 168w" sizes="(max-width: 239px) 100vw, 239px" />
<br />

<a href="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg1.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg1.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35865" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg1.png 640w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg1-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>


<p>The histogram of the simulated data is overlaid with a density from the fitted Weibull distribution. The parameter estimates are Shape=1.38 and Scale=0.68, which are close to the parameter values. 
PROC UNIVARIATE uses the symbols <em>c</em> and &sigma; for the shape and scale parameters, respectively.
</p>

<a href="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg3.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg3.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35862" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg3.png 640w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg3-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
<a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/procstat/procstat_univariate_details63.htm#procstat.univariate.probdist_w2">The probability-probability (P-P) plot for the Weibull distribution</a> is shown. 
In the P-P plot, a reference line is added by using the option <tt>weibull2(C=1.383539 SCALE=0.684287)</tt>. (In practice, you must run the procedure once to get those estimates, then a second time to plot the P-P plot.)
The slope of the reference line is 1/Shape = 0.72 and the intercept of the reference line is log(Scale) = -0.38. Notice that the P-P plot is plotting the quantiles of log(<em>d</em>), not of <em>d</em> itself.
</p>

<h3>Weibull regression versus univariate fit</h3>
<p>
It might seem strange to use a regression procedure to fit a univariate distribution, but as I have explained before, there are many situations for which <a href="https://blogs.sas.com/content/iml/2013/05/13/regression-for-univariate-analysis.html">a regression procedure is a good choice for performing a univariate analysis</a>.
Several SAS regression parameters can fit Weibull models. In these models, it is usually assumed that the response variable is a time until some event happens (such as failure, death, or occurrence of a disease).
<a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/statug/statug_lifereg_overview.htm">The documentation for PROC LIFEREG</a> provides an overview of fitting a model where the logarithm of the random errors follows a Weibull distribution. 
In this article, we do not use any covariates. We simply model the mean and scale of the response variable.
</p><p>
A problem with using a regression procedure is that a regression model provides estimates for intercepts, slopes, and scales. It is not always intuitive to see how those regression estimates relate to the more familiar parameters for the probability distribution. 
However, the P-P plot in the previous section shows how intercepts and slopes can be related to parameters of a distribution.  <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/statug/statug_lifereg_details04.htm#statug_lifereg004332">The documentation for the LIFEREG procedure</a> states that the Weibull scale parameter is exp(Intercept) and the Weibull shape parameter is the reciprocal of the regression scale parameter.
</p><p>
Notice how confusing this is! For the Weibull distribution, the regression model estimates a SCALE parameter for the error distribution.  But the reciprocal of that scale estimate is the Weibull SHAPE parameter, NOT the Weibull scale parameter!  (In this article, the response distribution  and the error distribution are the same.)
</p>

<h3>Weibull regression in SAS</h3>
<p>
The LIFEREG procedure includes an option to produce a probability-probability (P-P) plot, which is similar to a Q-Q plot. The LIFEREG procedure also estimates not only the regression parameters but also provides estimates for the exp(Intercept) and 1/Scale quantities.
The following statements use a Weibull regression model to fit the simulated data:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Weibull Estimates from LIFEREG Procedure&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc lifereg</span> <span style="color: #000080; font-weight: bold;">data</span>=Have;
   model d = / dist=Weibull;
   probplot;
   inset;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg4.png" alt="" width="587" height="182" class="alignnone size-full wp-image-35850" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg4.png 587w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg4-300x93.png 300w" sizes="(max-width: 587px) 100vw, 587px" />


<p>
The ParameterEstimates table shows the estimates for the Intercept (-0.38) and Scale (0.72) parameters in the Weibull regression model. 
We previously saw these numbers as the parameters of the reference line in the P-P plot from PROC UNIVARIATE. Here, they are the result of a maximum likelihood estimate for the regression model. To get from these values to the Weibull parameter estimates, you need to compute Weib_Scale = exp(Intercept) = 0.68 and 
Weib_Shape = 1/Scale = 1.38.  PROC LIFEREG estimates these quantities for you and provides standard errors and confidence intervals.
</p><p>
The graphical output of the PROBPLOT statement is equivalent to the P-P plot in PROC UNIVARIATE, except that PROC LIFEREG reverses the axes and automatically adds the reference line and a confidence band.
</p>


<a href="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg5.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg5.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35859" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg5.png 640w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg5-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<h3>Other regression procedures</h3>
<p>
Before ending this article, I want to mention two other regression procedures that perform similar computations: PROC RELIABILITY, which is in SAS/QC software, and PROC FMM in SAS/STAT software.
</p><p>
The following statements call PROC RELIABILITY to fit a regression model to the simulated data:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Weibull Estimates from RELIABILITY Procedure&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc reliability</span> <span style="color: #000080; font-weight: bold;">data</span>=Have;
   distribution Weibull;
   model d = ;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg6.png" alt="" width="398" height="194" class="alignnone size-full wp-image-35847" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg6.png 398w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg6-300x146.png 300w" sizes="(max-width: 398px) 100vw, 398px" />

<p>
The parameter estimates are similar to the estimates from PROC LIFEREG. The output also includes an estimate of the Weibull shape parameter, which is 1/EV_Scale. The output does not include an estimate for the Weibull scale parameter, which is exp(Intercept).
</p><p>
In a similar way, you can use PROC FMM to fit a Weibull model. PROC FMM Is typically used to fix a mixture distribution, but you can specify the K=1 option to fit a single response distribution, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Weibull Estimates from FMM Procedure&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc fmm</span> <span style="color: #000080; font-weight: bold;">data</span>=Have;
   model d = / dist=weibull <span style="color: #0000ff;">link</span>=<span style="color: #0000ff;">log</span> k=<span style="color: #2e8b57; font-weight: bold;">1</span>;
   ods <span style="color: #0000ff;">select</span> ParameterEstimates;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg7.png" alt="" width="474" height="130" class="alignnone size-full wp-image-35844" srcset="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg7.png 474w, https://blogs.sas.com/content/iml/files/2021/10/WeibullReg7-300x82.png 300w" sizes="(max-width: 474px) 100vw, 474px" />
<br />

<p>
The ParameterEstimates table shows the estimates for the Intercept (-0.38) and Scale (0.72) parameters in the Weibull regression model. The Weibull scale parameter is shown in the column labeled "Inverse Linked Estimate."  (The model uses a LOG link, so the inverse link is EXP.)  There is no estimate for the Weibull shape parameter. which is the reciprocal of the Scale estimate.
</p>

<h3>Summary</h3>
<p>
The easiest way to fit a Weibull distribution to univariate data is to use the UNIVARIATE procedure in Base SAS. The Weibull shape and scale parameters are directly estimated by that procedure. However, you can also fit a Weibull model by using a SAS regression procedure. If you do this, the regression parameters are the Intercept and the scale of the error distribution. You can transform these estimates into estimates for the Weibull shape and scale parameters. 
This article shows the output (and how to interpret it) for several SAS procedures that can fit a Weibull regression model.
</p><p>
Why would you want to use a regression procedure instead of PROC UNIVARIATE? One reason is that the response variable (failure or survival) might depend on additional covariates. A regression model enables you to account for additional covariates and still understand the underlying distribution of the random errors. A second reason is that the FMM procedure can fit a mixture of distributions. To make sense of the results, you must be able to interpret the regression output in terms of the usual parameters for the probability distributions. In a second article, I show <a href="https://blogs.sas.com/content/iml/2021/11/01/fit-mixture-weibull-sas.html">how to fit a mixture of Weibull distributions</a>.
</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html">Interpret estimates for a Weibull regression model in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Ru963cUrvDk:zBVBV9kclZo:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Ru963cUrvDk:zBVBV9kclZo:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Ru963cUrvDk:zBVBV9kclZo:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ru963cUrvDk:zBVBV9kclZo:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/10/27/weibull-regression-model-sas.html/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/10/WeibullReg5-150x150.png" />
	</item>
		<item>
		<title>An introduction to genetic algorithms in SAS</title>
		<link>https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 20 Oct 2021 09:29:16 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Optimization]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35709</guid>

					<description><![CDATA[<p>A genetic algorithm (GA) is a heuristic optimization technique. The method tries to mimic natural selection and evolution by starting with a population of random candidates. Candidates are evaluated for "fitness" by plugging them into the objective function. The characteristics of the better candidates are combined to create a new [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html">An introduction to genetic algorithms in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
A genetic algorithm (GA) is a heuristic optimization technique. The method tries to mimic natural selection and evolution by starting with a population of random candidates. Candidates are evaluated for "fitness" by plugging them into the objective function. The characteristics of the better candidates are combined to create a new set of candidates. Some of the new candidates experience mutations. Eventually, over many generations, a GA can produce candidates that approximately solve the optimization problem.
This article uses SAS/IML to implement a genetic algorithm in SAS.
</p>
<p>
A previous article discusses <a href="https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html">the mutation and crossover operators</a>, which are important in implementing a genetic algorithm. In previous articles, the solution vectors were represented by column vectors. In the GA routines, candidates are <em>row</em> vectors. The population is a matrix, where each row represents an individual.
</p>
A GA is best illustrated by using an example. This article uses <a href="https://blogs.sas.com/content/iml/2021/10/11/knapsack-problem.html">the binary knapsack problem</a>. In the knapsack problem, a knapsack can hold W kilograms. There are N objects, each with a different value and weight. You want to maximize the value of the objects you put into the knapsack without exceeding the weight.
A solution to the knapsack problem is a 0/1 binary vector <strong>b</strong>. If <strong>b</strong>[i]=1, the i_th object is in the knapsack; if <strong>b</strong>[i]=0, it is not. 
Although <a href="https://blogs.sas.com/content/iml/2021/10/11/knapsack-problem.html">the knapsack problem can be solved by using a constrained linear program</a>, 
this article solves an unconstrained problem and an objective function that penalizes a candidate if it exceeds the weight limit.


<h3>The main steps in a genetic algorithm</h3>
<p>
The <em>SAS/IML User's Guide</em> provides <a href="https://documentation.sas.com/doc/en/pgmsascdc/v_006/imlug/imlug_g%20eneticalgs_sect001.htm">an overview of genetic algorithms</a>. The five main steps follow:
</p>
<ul>
	<li><strong>Encoding</strong>: Each potential solution is represented as a <em>chromosome</em>, which is a vector of values. 
For the knapsack problem, each chromosome is an N-dimensional vector of binary values.
</li>
	<li><strong>Fitness</strong>: Choose a function to assess the fitness of each candidate chromosome. This is usually the objective function for unconstrained problems, or <a href="https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html">a penalized objective function</a> for problems that have constraints. The fitness of a candidate determines the probability that it will contribute its genes to the next generation of candidates.
</li>
	<li><strong>Selection</strong>: Choose which candidates become parents to the next generation of candidates.
</li>
	<li><strong>Crossover (Reproduction)</strong>: Choose how to produce children from parents.
</li>
	<li><strong>Mutation</strong>: Choose how to randomly mutate some children to introduce additional diversity.
</li>
</ul>

<h3>Encoding a solution vector</h3>
<p>
Each potential solution is represented as an N-dimensional vector of values. For the knapsack problem, you can choose a binary vector. In SAS/IML, you use <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect177.htm">the GASETUP function</a> to define the encoding for a problem. The SAS/IML language supports four different encodings. The knapsack problem can be encoded by using an integer vector, as follows:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Individuals are ROW vectors. Population is a matrix of stacked rows. */</span>
<span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">12345</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #006400; font-style: italic;">/* Solve the knapsack problem: max Value*b  subject to Weight*b &lt;= WtLimit */</span>
<span style="color: #006400; font-style: italic;">/* Item:  1 2 3 4 5   6   7   8   9  10  11  12  13  14  15  16  17 */</span>
Weight = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#125;</span>`;
Value  = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">5</span> <span style="color: #2e8b57; font-weight: bold;">3.1</span> <span style="color: #2e8b57; font-weight: bold;">3.0</span> <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">1.3</span> <span style="color: #2e8b57; font-weight: bold;">1.2</span> <span style="color: #2e8b57; font-weight: bold;">1.1</span> <span style="color: #2e8b57; font-weight: bold;">1.0</span> <span style="color: #2e8b57; font-weight: bold;">1.1</span> <span style="color: #2e8b57; font-weight: bold;">1.0</span> <span style="color: #2e8b57; font-weight: bold;">1.0</span> <span style="color: #2e8b57; font-weight: bold;">0.9</span> <span style="color: #2e8b57; font-weight: bold;">0.8</span> <span style="color: #2e8b57; font-weight: bold;">0.6</span><span style="color: #66cc66;">&#125;</span>`;
WtLimit = <span style="color: #2e8b57; font-weight: bold;">9</span>;                                 <span style="color: #006400; font-style: italic;">/* weight limit */</span>
<span style="color: #0000ff;">N</span> = nrow<span style="color: #66cc66;">&#40;</span>Weight<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* set up an encoding for the GA */</span>
id = gaSetup<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">2</span>,                 <span style="color: #006400; font-style: italic;">/* 2-&gt; integer vector encoding */</span>
             nrow<span style="color: #66cc66;">&#40;</span>weight<span style="color: #66cc66;">&#41;</span>,      <span style="color: #006400; font-style: italic;">/* size of vector */</span>
             <span style="color: #2e8b57; font-weight: bold;">123</span><span style="color: #66cc66;">&#41;</span>;              <span style="color: #006400; font-style: italic;">/* internal seed for GA */</span></pre></td></tr></table></div>




<p>
The GASETUP function returns an identifier for the problem. This identifier must be used as the first argument to subsequent calls to GA routines. It is possible to have a program that runs several GAs, each with its own identifier. The GASETUP call specifies that candidate vectors are integer vectors. Later in the program, you can tell the GA that they are binary vectors.
</p> 

<h3>Fitness, mutation, and reproduction</h3>
<p>
In previous articles, I discussed fitness, mutation, and crossover functions:
</p>
<ul>
	<li>You can  
<a href="https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html">use a penalized objective function to evaluate the fitness of a candidate solution.</a>. In the SAS/IML language, use <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect175.htm">the GASETOBJ subroutine</a> to register the objective function and to specify whether the GA should minimize or maximize the function.
</li>
	<li>You can <a href="https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html">define a mutation subroutine</a> to control how candidates get mutated.
You can register the mutation module by using <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect174.htm">the GASETMUT subroutine</a>. When you call the GASETMUT subroutine, you also need to specify the probability that a candidate will be mutated.
</li>
	<li>You can <a href="https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html">define a crossover subroutine</a> to control how two parents produce two children.
You can register the crossover module by using 
<a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect173.htm">the GASETCRO subroutine</a>.
When you call the GASETCRO subroutine, you also need to specify the probability that a fit candidate in the current generation is chosen to reproduce and thus contribute its characteristics to the next generation.
</li>
</ul>

<p>The following SAS/IML statements define the fitness module (ObjFun), the mutation module (Mutate), and the crossover module (Cross) and register these modules with the GA system.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* just a few of the many hyperparameters */</span>
lambda = <span style="color: #2e8b57; font-weight: bold;">100</span>;       <span style="color: #006400; font-style: italic;">/* factor to penalize exceeding the weight limit */</span>
ProbMut= <span style="color: #2e8b57; font-weight: bold;">0.2</span>;       <span style="color: #006400; font-style: italic;">/* probability that the i_th site is mutated */</span>
ProbCross = <span style="color: #2e8b57; font-weight: bold;">0.3</span>;    <span style="color: #006400; font-style: italic;">/* children based on 30%-70% split of parents */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* b is a binary column vector */</span>
start ObjFun<span style="color: #66cc66;">&#40;</span> b <span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>Weight, Value, WtLimit, lambda<span style="color: #66cc66;">&#41;</span>;
   wsum = b <span style="color: #006400; font-style: italic;">* Weight;</span>
   val  = b <span style="color: #006400; font-style: italic;">* Value;</span>
   <span style="color: #0000ff;">if</span> wsum &gt; WtLimit <span style="color: #0000ff;">then</span>                      <span style="color: #006400; font-style: italic;">/* penalize if weight limit exceeded */</span>
      val = val - lambda<span style="color: #006400; font-style: italic;">*(wsum - WtLimit)##2;</span>  <span style="color: #006400; font-style: italic;">/* subtract b/c we want to maximize value */</span>
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span>val<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Mutation operator for a binary vector, b. */</span>
start Mutate<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>ProbMut<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">N</span> = ncol<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
   k = <span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>, randfun<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #a020f0;">&quot;Binomial&quot;</span>, ProbMut, <span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* how many sites? */</span>
   j = sample<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span>, k, <span style="color: #a020f0;">&quot;NoReplace&quot;</span><span style="color: #66cc66;">&#41;</span>;                <span style="color: #006400; font-style: italic;">/* choose random elements */</span>
   b<span style="color: #66cc66;">&#91;</span>j<span style="color: #66cc66;">&#93;</span> = ^b<span style="color: #66cc66;">&#91;</span>j<span style="color: #66cc66;">&#93;</span>;                                   <span style="color: #006400; font-style: italic;">/* mutate these sites */</span>
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* Crossover operator for a a pair of parents. */</span>
start Cross<span style="color: #66cc66;">&#40;</span>child1, child2, parent1, parent2<span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>ProbCross<span style="color: #66cc66;">&#41;</span>;
   b = j<span style="color: #66cc66;">&#40;</span>ncol<span style="color: #66cc66;">&#40;</span>parent1<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">call</span> randgen<span style="color: #66cc66;">&#40;</span>b, <span style="color: #a020f0;">&quot;Bernoulli&quot;</span>, ProbCross<span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 0/1 vector */</span>
   idx = loc<span style="color: #66cc66;">&#40;</span>b=<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;                          <span style="color: #006400; font-style: italic;">/* locations to cross */</span>
   child1 = parent1;
   child2 = parent2;
   <span style="color: #0000ff;">if</span> ncol<span style="color: #66cc66;">&#40;</span>idx<span style="color: #66cc66;">&#41;</span>&gt;<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;                  <span style="color: #006400; font-style: italic;">/* exchange values */</span>
      child1<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> = parent2<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span>;
      child2<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> = parent1<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span>;
   <span style="color: #0000ff;">end</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* register these modules so the GA can call them as needed */</span>
<span style="color: #0000ff;">call</span> gaSetObj<span style="color: #66cc66;">&#40;</span>id, <span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #a020f0;">&quot;ObjFun&quot;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 1-&gt;maximize objective module */</span>
<span style="color: #0000ff;">call</span> gaSetCro<span style="color: #66cc66;">&#40;</span>id,
              <span style="color: #2e8b57; font-weight: bold;">1.0</span>,              <span style="color: #006400; font-style: italic;">/* hyperparameter: crossover probability */</span>
              <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #a020f0;">&quot;Cross&quot;</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* user-defined crossover module */</span>
<span style="color: #0000ff;">call</span> gaSetMut<span style="color: #66cc66;">&#40;</span>id,
              <span style="color: #2e8b57; font-weight: bold;">0.20</span>,             <span style="color: #006400; font-style: italic;">/* hyperparameter: mutation probability */</span>
              <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #a020f0;">&quot;Mutate&quot;</span><span style="color: #66cc66;">&#41;</span>;     <span style="color: #006400; font-style: italic;">/* user-defined mutation module */</span></pre></td></tr></table></div>




<h3>Fitness and selection</h3>
<p>
A genetic algorithm pits candidates against each other according to Darwin's observation that individuals who are fit are more likely to pass on their characteristics to the next generation. 
</p><p>
A genetic algorithm evolves the population across many generations. The individuals who are more fit are likely to "reproduce" and send their progeny to the next round. 
To preserve the characteristics of the very best individuals (called <em>elites</em>), some 
individuals are "cloned" and passed unchanged to the next generations. 
After many rounds, the population is fitter, on average, and the elite individuals are the best solutions to the objective function.
</p><p>
In SAS/IML, you can use <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect176.htm">the GASETSEL subroutine</a> to specify the rules for selecting elite individuals and for are selecting which individuals are eligible to reproduce. The following call specifies that 3 elite individuals are "cloned" for the next generation. For the remaining individuals, pairs are selected at random. With 95% probability, the more fit individual is selected to reproduce:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">call</span> gaSetSel<span style="color: #66cc66;">&#40;</span>id,
              <span style="color: #2e8b57; font-weight: bold;">3</span>,                <span style="color: #006400; font-style: italic;">/* hyperparameter: carry k elites directly to next generation */</span>
              <span style="color: #2e8b57; font-weight: bold;">1</span>,                <span style="color: #006400; font-style: italic;">/* dual tournament */</span>
              <span style="color: #2e8b57; font-weight: bold;">0.95</span><span style="color: #66cc66;">&#41;</span>;            <span style="color: #006400; font-style: italic;">/* best-player-wins probability */</span></pre></td></tr></table></div>



<p>
At this point, the GA problem is fully defined.
</p>

<h3>Initial population and evolution</h3>
<p>
You can use <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect170.htm">the GAINIT subroutine</a> to generate the initial population. Typically, the initial population is generated randomly. If there are bounds constraints on the solution vector, you can specify them. For example, the following call generates a very small population of 10 random individuals. The chromosomes are constrained to be binary by specifying a lower bound of 0 and an upper bound of 1 for each element of the integer vector.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* example of very small population; only 10 candidates */</span>
<span style="color: #0000ff;">call</span> gaInit<span style="color: #66cc66;">&#40;</span>id,  <span style="color: #2e8b57; font-weight: bold;">10</span>,            <span style="color: #006400; font-style: italic;">/* initial population size */</span>
            j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,  nrow<span style="color: #66cc66;">&#40;</span>weight<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span> //   <span style="color: #006400; font-style: italic;">/* lower and upper bounds of binary vector */</span>
            j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,  nrow<span style="color: #66cc66;">&#40;</span>weight<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;</pre></td></tr></table></div>




<p>
You can evolve the population by calling <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect172.htm">the GAREGEN subroutine</a>. 
The GAREGEN call selects individuals to reproduce according to the tournament rules. The selected individuals become parents. Pairs of parents produce children according to the crossover operation. Some children are mutated according to the mutation operator.
The children become the next generation and replace the parents as the "current" population.
</p><p>
At any time, you can use <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect168.htm">the GAGETMEM subroutine</a> to obtain the members of the current population.
You can use <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect169.htm">the GAGETVAL subroutine</a> to obtain the fitness scores of the current population.
Let's manually call the GAREGEN subroutine a few times and use the GAGETVAL subroutine after each call to see how the population evolves:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">call</span> gaGetVal<span style="color: #66cc66;">&#40;</span>f0, id<span style="color: #66cc66;">&#41;</span>;     <span style="color: #006400; font-style: italic;">/* initial generation is random */</span> 
<span style="color: #0000ff;">call</span> gaRegen<span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span>;          <span style="color: #006400; font-style: italic;">/* create the next generation via selection, crossover, and mutation */</span>
<span style="color: #0000ff;">call</span> gaGetVal<span style="color: #66cc66;">&#40;</span>f1, id<span style="color: #66cc66;">&#41;</span>;     <span style="color: #006400; font-style: italic;">/* evaluate fitness */</span>
<span style="color: #0000ff;">call</span> gaRegen<span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span>;          <span style="color: #006400; font-style: italic;">/* continue for additional generations ...*/</span>
<span style="color: #0000ff;">call</span> gaGetVal<span style="color: #66cc66;">&#40;</span>f2, id<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* print fitness for each generation and top candidates so far */</span>
print f0<span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Initial Fitness&quot;</span><span style="color: #66cc66;">&#93;</span> f1<span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;1st Gen Fitness&quot;</span><span style="color: #66cc66;">&#93;</span> f2<span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;2nd Gen Fitness&quot;</span><span style="color: #66cc66;">&#93;</span>;
<span style="color: #0000ff;">call</span> gaGetMem<span style="color: #66cc66;">&#40;</span>best, val, id, <span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span>;
print best<span style="color: #66cc66;">&#91;</span>f=<span style="color: #2e8b57; font-weight: bold;">1.0</span> L=<span style="color: #a020f0;">&quot;Best Members&quot;</span><span style="color: #66cc66;">&#93;</span>, val<span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Best Values&quot;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/GA3.png" alt="" width="319" height="553" class="alignnone size-full wp-image-35775" srcset="https://blogs.sas.com/content/iml/files/2021/10/GA3.png 319w, https://blogs.sas.com/content/iml/files/2021/10/GA3-173x300.png 173w" sizes="(max-width: 319px) 100vw, 319px" />


<p>
The output shows how the population evolves for three generations. Initially, only one member of the population satisfies the weight constraints of the knapsack problem. The feasible solution has a positive fitness score (9.7); the infeasible solutions are negative for this problem. 
After the selection, crossover, and mutation operations, the next generation has two feasible solutions (9.7 and 8.1). After another round, the population has six feasible solutions and the score for the best solution has increased to 11.7. The population is becoming more fit, on average.
</p><p>
After the second call to GAREGEN, the GAGETMEM call gets the candidates in positions 1:3. Recall that you specified three "elite" members in the GASETSEL call. The elite members are therefore placed at the top of the population. (The remaining individuals are not sorted according to fitness.)
The chromosomes for the elite members are shown. In this encoding, each chromosome is a 0/1 binary vector that determines which objects are placed in the knapsack and which are left out.
</p>


<h3>Evolution towards a solution</h3>
<p>
The previous section purposely used a small population of 10 individuals. Such a small population lacks genetic diversity and might not converge to an optimal solution (or at least not quickly). A more reasonable population contains 100 or more individuals.  You can use the GAINIT call a second time to reinitialize the initial population.
Now that you have experience with the GAREGEN and GAGETVAL calls, you can use those calls in a DO loop to iterate over many generations. The following statements iterate the initial population through 15 generations. For each generation, the program records the best fitness score (f[1]) and the median fitness score.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* for genetic diversity, better to have a larger population */</span>
<span style="color: #0000ff;">call</span> gaInit<span style="color: #66cc66;">&#40;</span>id,  <span style="color: #2e8b57; font-weight: bold;">100</span>,                   <span style="color: #006400; font-style: italic;">/* initial population size */</span>
            j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,  nrow<span style="color: #66cc66;">&#40;</span>weight<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span> //   <span style="color: #006400; font-style: italic;">/* lower and upper bounds of binary vector */</span>
            j<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>,  nrow<span style="color: #66cc66;">&#40;</span>weight<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;   
<span style="color: #006400; font-style: italic;">/* record the best and median scores for each generation */</span>
niter = <span style="color: #2e8b57; font-weight: bold;">15</span>;
summary = j<span style="color: #66cc66;">&#40;</span>niter,<span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span>;
summary<span style="color: #66cc66;">&#91;</span>,<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span> = t<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:niter<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* (Iteration, Best Value, Median Value) */</span>
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to niter;
   <span style="color: #0000ff;">call</span> gaRegen<span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">call</span> gaGetVal<span style="color: #66cc66;">&#40;</span>f, id<span style="color: #66cc66;">&#41;</span>;
   summary<span style="color: #66cc66;">&#91;</span>i,<span style="color: #2e8b57; font-weight: bold;">2</span><span style="color: #66cc66;">&#93;</span> = f<span style="color: #66cc66;">&#91;</span><span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#93;</span>;   summary<span style="color: #66cc66;">&#91;</span>i,<span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#93;</span> = median<span style="color: #66cc66;">&#40;</span>f<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">end</span>;
print summary<span style="color: #66cc66;">&#91;</span>c = <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Iteration&quot;</span> <span style="color: #a020f0;">&quot;Best Value&quot;</span> <span style="color: #a020f0;">&quot;Median Value&quot;</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>





<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/GA4.png" alt="" width="246" height="465" class="alignnone size-full wp-image-35772" srcset="https://blogs.sas.com/content/iml/files/2021/10/GA4.png 246w, https://blogs.sas.com/content/iml/files/2021/10/GA4-159x300.png 159w" sizes="(max-width: 246px) 100vw, 246px" />

<p>
The output shows that the fitness of the best candidate is monotonic. It increases from an initial value of 16.3 to the final value of 19.6, which is the optimal value.
Similarly, the median value tends to increase. Because of random mutation and the crossover operations, statistics such as the median or mean are usually not monotonic. Nevertheless, the fitness of later generations tends to be better than for earlier generations.
</p><p>
You can examine the chromosomes of the elite candidates in the final generation by using the GAGETMEM subroutine:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* print the top candidates */</span>
<span style="color: #0000ff;">call</span> gaGetMem<span style="color: #66cc66;">&#40;</span>best, f, id, <span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #2e8b57; font-weight: bold;">3</span><span style="color: #66cc66;">&#41;</span>;
print best<span style="color: #66cc66;">&#91;</span>f=<span style="color: #2e8b57; font-weight: bold;">1.0</span> L=<span style="color: #a020f0;">&quot;Best Member&quot;</span><span style="color: #66cc66;">&#93;</span>, f<span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Final Best Value&quot;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/GA5.png" alt="" width="312" height="241" class="alignnone size-full wp-image-35769" srcset="https://blogs.sas.com/content/iml/files/2021/10/GA5.png 312w, https://blogs.sas.com/content/iml/files/2021/10/GA5-300x232.png 300w" sizes="(max-width: 312px) 100vw, 312px" />


<p>
The output confirms that the best candidate is the <a href="https://blogs.sas.com/content/iml/2021/10/11/knapsack-problem.html">same binary vector as was found by using a constrained linear program in a previous article</a>.
</p>
<p>
The GA algorithm maintains an internal state, which enables you to continue iterating if the current generation is not satisfactory. In this case, the problem is solved, so you can use the GAEND subroutine to release the internal memory and resources that are associated with this GA. After you call GAEND, the identifier becomes invalid.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">call</span> gaEnd<span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span>;   <span style="color: #006400; font-style: italic;">/* free the memory and internal resources for this GA */</span></pre></td></tr></table></div>




<h3>The advantages and disadvantage of genetic algorithms</h3>
<p>
As with any tool, it is important to recognize that a GA has strengths and weaknesses.
Strengths include:
</p>
<ul>
<li>A GA is amazingly flexible. It can be used to solve a wide variety of optimization problems.
</li>
<li>A GA can provide useful suboptimal solutions. The elite members of a population might be "good enough," even if they are not optimal.
</li>
</ul>
<p>
Weaknesses include:
</p>
<ul>
<li>A GA is dependent on random operations. If you change the random number seed, you might obtain a completely different solution or no solution at all.
</li>
<li>A GA can take a long time to produce an optimal solution. It does not tell you whether a candidate is optimal, only that it is the "most fit" so far.
</li>
<li>A GA requires many heuristic choices. It is not always clear how to implement the mutation and crossover operators or how to implement the tournament that selects individuals to be parents.
</li>
</ul>


<h3>Summary</h3>
<p>
In summary, this article shows how to use low-level routines in SAS/IML software to implement a genetic algorithm.
Genetic algorithms can solve optimization problems that are intractable for traditional mathematical optimization algorithms. Like all tools, a GA has strengths and weaknesses.
By gaining experience with GAs, you can build intuition about when and how to apply this powerful
method.
</p><p>
For other ways to use genetic algorithms in SAS, see <a href="https://documentation.sas.com/doc/en/pgmsascdc/v_013/orlsoug/orlsoug_ga_toc.htm">the GA procedure in SAS/OR software</a> and <a href="https://documentation.sas.com/doc/en/pgmsascdc/v_013/ormpug/ormpug_lsosolver_toc.htm">the black-box solver in PROC OPTMODEL.</a>
</p><p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html">An introduction to genetic algorithms in SAS</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=IWkS_d0PYCc:U7WcThYB6uU:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=IWkS_d0PYCc:U7WcThYB6uU:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=IWkS_d0PYCc:U7WcThYB6uU:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IWkS_d0PYCc:U7WcThYB6uU:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/10/20/intro-genetic-algorithms-sas.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2017/01/ProgrammingTips-2-150x150.png" />
	</item>
		<item>
		<title>Crossover and mutation: An introduction to two operations in genetic algorithms</title>
		<link>https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html#comments</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Mon, 18 Oct 2021 09:22:19 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Statistical Programming]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35649</guid>

					<description><![CDATA[<p>This article uses an example to introduce to genetic algorithms (GAs) for optimization. It discusses two operators (mutation and crossover) that are important in implementing a genetic algorithm. It discusses choices that you must make when you implement these operations. Some programmers love using genetic algorithms. Genetic algorithms are heuristic [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html">Crossover and mutation: An introduction to two operations in genetic algorithms</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This article uses an example to introduce to <em>genetic algorithms</em> (GAs) for optimization. 
It discusses two operators (mutation and crossover) that are important in implementing a genetic algorithm. It discusses choices that you must make when you implement these operations. 
</p>
<p>
Some programmers love using genetic algorithms. 
Genetic algorithms are heuristic methods that can be used to solve problems that are difficult to solve by using standard discrete or calculus-based optimization methods. A genetic algorithm tries to mimic natural selection and evolution by starting with a population of random candidates. Candidates are evaluated for "fitness" by plugging them into the objective function. The better candidates are combined to create a new set of candidates. Some of the new candidates experience mutations. Eventually, over many generations, a GA can produce candidates that approximately solve the optimization problem.
Randomness plays an important role. Re-running a GA with a different random number seed might produce a different solution.
</p>
<p>
Critics of genetic algorithms note two weaknesses of the method. First, you are not guaranteed to get the optimal solution. However, in practice, GAs often find an acceptable solution that is good enough to be used. The second complaint is that the user must make many heuristic choices about how to implement the GA. Critics correctly note that implementing a genetic algorithm is as much an art as it is a science. You must choose values for hyperparameters and define operators that are often based on a "feeling" that these choices might result in an acceptable solution. 
</p>
<p>
This article discusses two fundamental parts of a genetic algorithm: the crossover and the mutation operators. The operations are discussed by using <a href="https://blogs.sas.com/content/iml/2021/10/11/knapsack-problem.html">the binary knapsack problem</a> as an example. In the knapsack problem, a knapsack can hold W kilograms. There are N objects, each with a different value and weight. You want to maximize the value of the objects you put into the knapsack without exceeding the weight.
A solution to the knapsack problem is a 0/1 binary vector <strong>b</strong>. If <strong>b</strong>[i]=1, the i_th object is in the knapsack; if <strong>b</strong>[i]=0, it is not. 
</p>

<h3>A brief overview of genetic algorithms</h3>
<p>
The <em>SAS/IML User's Guide</em> provides <a href="https://documentation.sas.com/doc/en/pgmsascdc/v_006/imlug/imlug_g%20eneticalgs_sect001.htm">an overview of genetic algorithms</a>. The main steps in a genetic algorithm are as follows:
</p>
<ul>
	<li><strong>Encoding</strong>: Each potential solution is represented as a <em>chromosome</em>, which is a vector of values. The values can be binary, integer-valued, or real-valued.
(The values are sometimes called genes.)
For the knapsack problem, each chromosome is an N-dimensional vector of binary values.
</li>
	<li><strong>Fitness</strong>: Choose a function to assess the fitness of each candidate chromosome. This is usually the objective function for unconstrained problems, or <a href="https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html">a penalized objective function</a> for problems that have constraints. The fitness of a candidate determines the probability that it will contribute its genes to the next generation of candidates.
</li>
	<li><strong>Selection</strong>: Choose which candidates become parents to the next generation of candidates.
</li>
	<li><strong>Crossover (Reproduction)</strong>: Choose how to produce children from parents.
</li>
	<li><strong>Mutation</strong>: Choose how to randomly mutate some children to introduce additional diversity.
</li>
</ul>

<p>
This article discusses the crossover and the mutation operators. 
</p>

<h3>The mutation operator</h3>
<p>
The mutation operator is the easiest operation to understand. In each generation, some candidates are randomly perturbed. By chance, some of the mutations might be beneficial and make the candidate more fit. Others are detrimental and make the candidate less fit.
</p>
<p>
For a binary chromosome, a mutation consists of changing the parity of some proportion of the elements. 
The simplest mutation operation is to always change <em>k</em> random elements for some hyperparameter <em>k</em> &lt; N. 
A more realistic mutation operation is to choose the number of sites randomly according to a binomial probability distribution with hyperparameter <em>p</em><sub>mut</sub>.  Then <em>k</em> is a random variable that differs for each mutation operation.
</p>
<p>
The following SAS/IML program 
chooses <em>p</em><sub>mut</sub>=0.2 and defines a subroutine that mutates a binary vector, <strong>b</strong>. In this example, there are N=17 items that you can put into the knapsack. The subroutine first uses the Binom(<em>p</em><sub>mut</sub>, N) probability distribution to obtain a random number of sites, <em>k</em>, to mutate. (But if the distribution returns 0, set <em>k</em>=1.)
The SAMPLE function then draws <em>k</em> random positions (without replacement), 
and <a href="https://blogs.sas.com/content/iml/2021/10/06/logical-negation-vectors.html">the values in those positions are changed</a>.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">12345</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">N</span> = <span style="color: #2e8b57; font-weight: bold;">17</span>;                      <span style="color: #006400; font-style: italic;">/* size of binary vector */</span>
ProbMut= <span style="color: #2e8b57; font-weight: bold;">0.2</span>;                <span style="color: #006400; font-style: italic;">/* mutation in 20% of sites */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* Mutation operator for a binary vector, b. 
   The number of mutation sites k ~ Binom(ProbMut, N), but not less than 1. 
   Randomly sample (without replacement) k sites. 
   If an item is not in knapsack, put it in; if an item is in the sack, take it out. */</span>
start Mutate<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>ProbMut<span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">N</span> = nrow<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
   k = <span style="color: #0000ff;">max</span><span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>, randfun<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>, <span style="color: #a020f0;">&quot;Binomial&quot;</span>, ProbMut, <span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* how many sites? */</span>
   j = sample<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span>, k, <span style="color: #a020f0;">&quot;NoReplace&quot;</span><span style="color: #66cc66;">&#41;</span>;                <span style="color: #006400; font-style: italic;">/* choose random elements */</span>
   b<span style="color: #66cc66;">&#91;</span>j<span style="color: #66cc66;">&#93;</span> = ^b<span style="color: #66cc66;">&#91;</span>j<span style="color: #66cc66;">&#93;</span>;                                   <span style="color: #006400; font-style: italic;">/* mutate these sites */</span>
finish;
&nbsp;
Items = <span style="color: #2e8b57; font-weight: bold;">5</span>:<span style="color: #2e8b57; font-weight: bold;">12</span>;                  <span style="color: #006400; font-style: italic;">/* choose items 5-12 */</span>
b = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;  b<span style="color: #66cc66;">&#91;</span>Items<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>;
bOrig = b;
<span style="color: #000080; font-weight: bold;">run</span> Mutate<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;
print <span style="color: #66cc66;">&#40;</span>bOrig`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Original b&quot;</span> c=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>
      <span style="color: #66cc66;">&#40;</span>b`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Randomly Mutated b&quot;</span> c=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/GA1.png" alt="" width="369" height="201" class="alignnone size-full wp-image-35727" srcset="https://blogs.sas.com/content/iml/files/2021/10/GA1.png 369w, https://blogs.sas.com/content/iml/files/2021/10/GA1-300x163.png 300w" sizes="(max-width: 369px) 100vw, 369px" />

<p>
In this example, the original chromosome has a 1 in locations 5:12. The binomial distribution randomly decides to mutate <em>k</em>=4 sites. The SAMPLE function randomly chooses the locations 3, 11, 15, and 17. The parity of those sites is changed. This is seen in the output, which shows that the parity of these four sites differs between the original and the mutated <strong>b</strong> vector.
</p><p>
Notice that you must choose HOW the mutation operator works, and you must choose a hyperparameter that determines how many sites get mutated. The best choices depend on the problem you are trying to solve. Typically, you should choose a small value for the probability <em>p</em><sub>mut</sub> so that only a few sites are mutated.
</p>
<p>
In the SAS/IML language, there are several built-in mutation operations that you can use. They are discussed in <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect174.htm">the documentation for the GASETMUT subroutine</a>.
</p>

<h3>The crossover operator</h3>
<p>
The crossover operator is analogous to the creation of offspring through sexual reproduction. 
You, as the programmer, must decide how the parent chromosomes, p1 and p2, will combine to create two children, c1 and c2.
There are many choices you can make. Some reasonable choices include:
</p>
<ul>
<li>Randomly choose a location <em>s</em>, 1 &le; <em>s</em> &le; N. You then split the parent chromosomes at that location and exchange and combine the left and right portions of the parents' chromosomes.
One child chromosome is <tt>c1 = p1[1:s] // p2[s+1:N]</tt> and the other is 
<tt>c2 = p2[1:s] // p1[s+1:N]</tt>. Note that each child gets some values ("genes") from each parent.
</li>
<li>Randomly choose a location <em>s</em>, 1 &le; <em>s</em> &le; N. 
Divide the first chromosome into subvectors of length <em>s</em> and N-<em>s</em>.
Divide the second chromosome into subvectors of length N-<em>s</em> and <em>s</em>.
Exchange the subvectors of the same length to form the child chromosomes.
</li>
<li>Randomly choose <em>k</em> locations. Exchange the locations between parents to form the child chromosomes.
</li>
</ul>

<p>
The following SAS/IML function implements the third crossover method. 
The method uses a hyperparameter, <em>p</em><sub>cross</sub>, which is the probability that each location in the chromosome is selected. On average, about N<em>p</em><sub>cross</sub> locations will be selected. In the following program, <em>p</em><sub>cross</sub> = 0.3, so we expect 17(0.3)=5.1 values to be exchanged between the parent chromosomes to form the children:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;">start uniform_cross<span style="color: #66cc66;">&#40;</span>child1, child2, parent1, parent2<span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>ProbCross<span style="color: #66cc66;">&#41;</span>;
   b = j<span style="color: #66cc66;">&#40;</span>nrow<span style="color: #66cc66;">&#40;</span>parent1<span style="color: #66cc66;">&#41;</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">call</span> randgen<span style="color: #66cc66;">&#40;</span>b, <span style="color: #a020f0;">&quot;Bernoulli&quot;</span>, ProbCross<span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 0/1 vector */</span>
   idx = loc<span style="color: #66cc66;">&#40;</span>b=<span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;                    <span style="color: #006400; font-style: italic;">/* locations to cross */</span>
&nbsp;
   child1 = parent1;                  <span style="color: #006400; font-style: italic;">/* initialize children */</span>
   child2 = parent2;
   <span style="color: #0000ff;">if</span> ncol<span style="color: #66cc66;">&#40;</span>idx<span style="color: #66cc66;">&#41;</span>&gt;<span style="color: #2e8b57; font-weight: bold;">0</span> <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">do</span>;            <span style="color: #006400; font-style: italic;">/* exchange values */</span>
      child1<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> = parent2<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span>;     <span style="color: #006400; font-style: italic;">/* child1 gets some from parent2 */</span>
      child2<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span> = parent1<span style="color: #66cc66;">&#91;</span>idx<span style="color: #66cc66;">&#93;</span>;     <span style="color: #006400; font-style: italic;">/* child2 gets some from parent1 */</span>
   <span style="color: #0000ff;">end</span>;
finish;
&nbsp;
ProbCross = <span style="color: #2e8b57; font-weight: bold;">0.3</span>;                               <span style="color: #006400; font-style: italic;">/* crossover 25% of sites */</span>
Items = <span style="color: #2e8b57; font-weight: bold;">5</span>:<span style="color: #2e8b57; font-weight: bold;">12</span>;   p1 = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;  p1<span style="color: #66cc66;">&#91;</span>Items<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>; <span style="color: #006400; font-style: italic;">/* choose items 5-12 */</span>
Items = <span style="color: #2e8b57; font-weight: bold;">10</span>:<span style="color: #2e8b57; font-weight: bold;">15</span>;  p2 = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;  p2<span style="color: #66cc66;">&#91;</span>Items<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>; <span style="color: #006400; font-style: italic;">/* choose items 10-15 */</span>
<span style="color: #000080; font-weight: bold;">run</span> uniform_cross<span style="color: #66cc66;">&#40;</span>c1, c2, p1, p2<span style="color: #66cc66;">&#41;</span>;
print <span style="color: #66cc66;">&#40;</span>p1`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Parent1&quot;</span> c=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>, <span style="color: #66cc66;">&#40;</span>p2`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Parent2&quot;</span> c=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>,
      <span style="color: #66cc66;">&#40;</span>c1`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Child1&quot;</span> c=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>, <span style="color: #66cc66;">&#40;</span>c2`<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Child2&quot;</span> c=<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span>:<span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/GA2.png" alt="" width="367" height="184" class="alignnone size-full wp-image-35697" srcset="https://blogs.sas.com/content/iml/files/2021/10/GA2.png 367w, https://blogs.sas.com/content/iml/files/2021/10/GA2-300x150.png 300w, https://blogs.sas.com/content/iml/files/2021/10/GA2-164x82.png 164w" sizes="(max-width: 367px) 100vw, 367px" />

<p>
I augmented the output to show how the child chromosomes are created from their parents. For this run, the selected locations are 1, 8, 10, 12, and 14. The first child gets all values from the first parent except for the values in these five positions, which are from the second parent. The second child is formed similarly.
</p>
<p>
When the parent chromosomes resemble each other, the children will resemble the parents. However, if the parent chromosomes are very different, the children might not look like either parent.
</p><p>
Notice that you must choose HOW the crossover operator works, and you must choose a hyperparameter that determines how to split the parent chromosomes. In more sophisticated crossover operations, there might be additional hyperparameters, such as the probability that a subchromosome from a parent gets reversed in the child.
There are many heuristic choices to make, and the best choice is not knowable. 
</p><p>
In the SAS/IML language, there are many built-in crossover operations that you can use. They are discussed in <a href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_013/imlug/imlug_langref_sect173.htm">the documentation for the GASETCRO subroutine</a>.
</p>

<h3>Summary</h3>
<p>
Genetic algorithms can solve optimization problems that are intractable for traditional mathematical optimization algorithms. But the power comes at a cost. The user must make many heuristic choices about how the GA should work. The user must choose hyperparameters that control the probability that certain events happen during mutation and crossover operations. 
The algorithm uses random numbers to generate new potential solutions from previous candidates.
This article used the SAS/IML language to discuss some of the choices that are required to implement these operations.
</p>
<p>
A subsequent article discusses how to implement a genetic algorithm in SAS.
</p>

<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html">Crossover and mutation: An introduction to two operations in genetic algorithms</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=QKtC0Kn-Oag:ymmSOKkyqK0:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=QKtC0Kn-Oag:ymmSOKkyqK0:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=QKtC0Kn-Oag:ymmSOKkyqK0:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=QKtC0Kn-Oag:ymmSOKkyqK0:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/10/18/crossover-mutation.html/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/10/GA1-150x150.png" />
	</item>
		<item>
		<title>Penalties versus constraints in optimization problems</title>
		<link>https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html</link>
					<comments>https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html#respond</comments>
		
		<dc:creator><![CDATA[Rick Wicklin]]></dc:creator>
		<pubDate>Wed, 13 Oct 2021 09:28:05 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Optimization]]></category>
		<guid isPermaLink="false">https://blogs.sas.com/content/iml/?p=35538</guid>

					<description><![CDATA[<p>Sometimes we can learn as much from our mistakes as we do from our successes. Recently, I needed to solve an optimization problem for which the solution vector was a binary vector subject to a constraint. I was in a hurry. Without thinking much about what I was doing, I [...]</p>
<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html">Penalties versus constraints in optimization problems</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>
Sometimes we can learn as much from our mistakes as we do from our successes.
Recently, I needed to solve an optimization problem for which the solution vector was a binary vector subject to a constraint. I was in a hurry. Without thinking much about what I was doing, I decided to eliminate the constraint by introducing a penalty term into the objective function. I then solved the unconstrained penalized optimization problem.  A few days later, I thought further about what I had done. I had reformulated a constrained optimization problem as an unconstrained penalized optimization.
Was that a wise (or valid!) thing to do? Are there issues that I should have considered before I took reformulated the problem?
</p>

<h3>The knapsack problem</h3>
<p>
I recently wrote about how to solve a famous optimization problem called 
<a href="https://blogs.sas.com/content/iml/2021/10/11/knapsack-problem.html">the <em>knapsack problem</em></a>. 
In the knapsack problem, you assume that a knapsack can hold W kilograms. There are N objects whose values and weights are represented by elements of the vectors <strong>v</strong> and  <strong>w</strong>, respectively. The goal is to maximize the value of the objects you put into the knapsack without exceeding a weight constraint, W.
</p><p>
As shown in my previous article, the knapsack problem can be solved by using integer linear programming to find a binary (0/1) solution vector that maximizes <strong>v</strong>`*<strong>b</strong> subject to the constraint 
<strong>w</strong>`*<strong>b</strong> &le; W.  
</p><p>
The problem I needed to solve was similar to the knapsack problem, so let's examine what happens if you reformulate the knapsack problem to eliminate the constraint on the weights. 
The idea is to introduce a 
penalty for any set of weights that exceed the limit. 
For example, if a candidate set of items have weight W<sub>c</sub> &gt; W, then you could subtract a positive quantity such as 
&lambda;*(W<sub>c</sub> - W)<sup>2</sup>.  If the penalty parameter &lambda; &gt; 0 is large enough, then subtracting the penalty term will not affect the optimal solution, which we are trying to maximize. (If you are minimizing an objective function, then ADD a positive penalty term.)
Notice that the penalty term destroys the linearity of the objective function.
</p>

<h3>A penalized knapsack formulation</h3>
<p>
The following SAS/IML program demonstrates how to  use an IF-THEN statement to include a penalty term in the objective function for the knapsack problem:
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
<span style="color: #006400; font-style: italic;">/* Solve the knapsack problem: max Value*b  subject to Weight*b &lt;= WtLimit */</span>
<span style="color: #006400; font-style: italic;">/* Item:  1 2 3 4 5   6   7   8   9  10  11  12  13  14  15  16  17 */</span>
Weight = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2</span> <span style="color: #2e8b57; font-weight: bold;">3</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">4</span> <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span>   <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#125;</span>;
Value  = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">6</span> <span style="color: #2e8b57; font-weight: bold;">5</span> <span style="color: #2e8b57; font-weight: bold;">3.1</span> <span style="color: #2e8b57; font-weight: bold;">3.0</span> <span style="color: #2e8b57; font-weight: bold;">1.5</span> <span style="color: #2e8b57; font-weight: bold;">1.3</span> <span style="color: #2e8b57; font-weight: bold;">1.2</span> <span style="color: #2e8b57; font-weight: bold;">1.1</span> <span style="color: #2e8b57; font-weight: bold;">1.0</span> <span style="color: #2e8b57; font-weight: bold;">1.1</span> <span style="color: #2e8b57; font-weight: bold;">1.0</span> <span style="color: #2e8b57; font-weight: bold;">1.0</span> <span style="color: #2e8b57; font-weight: bold;">0.9</span> <span style="color: #2e8b57; font-weight: bold;">0.8</span> <span style="color: #2e8b57; font-weight: bold;">0.6</span><span style="color: #66cc66;">&#125;</span>;
WtLimit = <span style="color: #2e8b57; font-weight: bold;">9</span>;                                 <span style="color: #006400; font-style: italic;">/* weight limit */</span>
<span style="color: #0000ff;">N</span> = ncol<span style="color: #66cc66;">&#40;</span>Weight<span style="color: #66cc66;">&#41;</span>;
lambda = <span style="color: #2e8b57; font-weight: bold;">10</span>;       <span style="color: #006400; font-style: italic;">/* factor to penalize exceeding the weight limit */</span>
&nbsp;
<span style="color: #006400; font-style: italic;">/* b is a binary column vector */</span>
start ObjFun<span style="color: #66cc66;">&#40;</span> b <span style="color: #66cc66;">&#41;</span> global<span style="color: #66cc66;">&#40;</span>Weight, Value, WtLimit, lambda<span style="color: #66cc66;">&#41;</span>;
   wsum = Weight <span style="color: #006400; font-style: italic;">* b;</span>
   val  = Value <span style="color: #006400; font-style: italic;">* b;</span>
   <span style="color: #0000ff;">if</span> wsum &gt; WtLimit <span style="color: #0000ff;">then</span>                      <span style="color: #006400; font-style: italic;">/* penalize if weight limit exceeded */</span>
      val = val - lambda<span style="color: #006400; font-style: italic;">*(wsum - WtLimit)##2;</span>  <span style="color: #006400; font-style: italic;">/* subtract b/c we want to maximize value */</span>
   <span style="color: #0000ff;">return</span><span style="color: #66cc66;">&#40;</span>val<span style="color: #66cc66;">&#41;</span>;
finish;
&nbsp;
<span style="color: #006400; font-style: italic;">/* call the objective function on some possible solution vectors */</span>
Items = <span style="color: #2e8b57; font-weight: bold;">5</span>:<span style="color: #2e8b57; font-weight: bold;">12</span>;                  <span style="color: #006400; font-style: italic;">/* choose items 5-12 */</span>
b = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;  b<span style="color: #66cc66;">&#91;</span>Items<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>;
ValPen = ObjFun<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;            <span style="color: #006400; font-style: italic;">/* values penalized if too much weight */</span>
ValUnPen = Value<span style="color: #006400; font-style: italic;">*b;</span>            <span style="color: #006400; font-style: italic;">/* raw values, ignoring the weights */</span>
Wt = Weight<span style="color: #006400; font-style: italic;">*b;</span>                 <span style="color: #006400; font-style: italic;">/* total weight of items */</span>
print <span style="color: #66cc66;">&#40;</span>ValPen||ValUnpen||Wt<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Items 5:12&quot;</span> 
    c=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Penalized Value&quot;</span> <span style="color: #a020f0;">&quot;Unpenalized Value&quot;</span> <span style="color: #a020f0;">&quot;Weight&quot;</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#93;</span>;
&nbsp;
Items = <span style="color: #2e8b57; font-weight: bold;">5</span>:<span style="color: #2e8b57; font-weight: bold;">13</span>;                  <span style="color: #006400; font-style: italic;">/* choose items 5-13 */</span>
b = j<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>,<span style="color: #2e8b57; font-weight: bold;">1</span>,<span style="color: #2e8b57; font-weight: bold;">0</span><span style="color: #66cc66;">&#41;</span>;  b<span style="color: #66cc66;">&#91;</span>Items<span style="color: #66cc66;">&#93;</span> = <span style="color: #2e8b57; font-weight: bold;">1</span>;
ValPen = ObjFun<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#41;</span>;            <span style="color: #006400; font-style: italic;">/* values penalized if too much weight */</span>
ValUnPen = Value<span style="color: #006400; font-style: italic;">*b;</span>            <span style="color: #006400; font-style: italic;">/* raw values, ignoring the weights */</span>
Wt = Weight<span style="color: #006400; font-style: italic;">*b;</span>                 <span style="color: #006400; font-style: italic;">/* total weight of items */</span>
print <span style="color: #66cc66;">&#40;</span>ValPen||ValUnpen||Wt<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span>L=<span style="color: #a020f0;">&quot;Items 5:13&quot;</span> 
    c=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Penalized Value&quot;</span> <span style="color: #a020f0;">&quot;Unpenalized Value&quot;</span> <span style="color: #a020f0;">&quot;Weight&quot;</span><span style="color: #66cc66;">&#125;</span><span style="color: #66cc66;">&#93;</span>;</pre></td></tr></table></div>




<img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/penalopt1.png" alt="" width="309" height="184" class="alignnone size-full wp-image-35637" srcset="https://blogs.sas.com/content/iml/files/2021/10/penalopt1.png 309w, https://blogs.sas.com/content/iml/files/2021/10/penalopt1-300x179.png 300w" sizes="(max-width: 309px) 100vw, 309px" />

<p>
In the first example (Items 5:12), the weight of the items does not exceed the weight limit. Therefore, the objective function returns the actual value of the items. 
In the second example (Items 5:13), the weight of the items is 10, which exceeds the weight limit. Therefore, the objective function applies the penalty term. Instead of returning 14.3 as the value of the items, the function returns 4.3, which is 10 less because the penalty term evaluates to -10*1<sup>2</sup> for this example. 
</p>
<p>
You can visualize how the penalty works by generating many random binary vectors. The following statements generate 1,000 random binary vectors where the probability of generating a 1 is 0.3. For each random vector, the program calls the penalized objective function to obtain a (penalized) value of the items. A scatter plot displays the penalized value versus the weight of the items.
</p>


<div class="wp_syntax"><table><tr><td class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* generate 1,000 random binary vectors as columns of b */</span>
<span style="color: #0000ff;">call</span> randseed<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1234</span><span style="color: #66cc66;">&#41;</span>;
NumSim = <span style="color: #2e8b57; font-weight: bold;">1000</span>;
Val = j<span style="color: #66cc66;">&#40;</span>NumSim, <span style="color: #2e8b57; font-weight: bold;">1</span>, .<span style="color: #66cc66;">&#41;</span>;
b = randfun<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>||NumSim, <span style="color: #a020f0;">&quot;Bernoulli&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">0.3</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to NumSim;
   Val<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span> = ObjFun<span style="color: #66cc66;">&#40;</span>b<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">end</span>;
Wt  = Weight <span style="color: #006400; font-style: italic;">* b;</span>
&nbsp;
<span style="color: #0000ff;">title</span>  <span style="color: #a020f0;">&quot;Value of 1,000 Random Knapsacks&quot;</span>;
title2 <span style="color: #a020f0;">&quot;b ~ Bern(0.3); lambda=10&quot;</span>;
<span style="color: #0000ff;">call</span> scatter<span style="color: #66cc66;">&#40;</span>Wt, Val<span style="color: #66cc66;">&#41;</span> grid=<span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">x</span> y<span style="color: #66cc66;">&#125;</span> <span style="color: #0000ff;">label</span>=<span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Weight&quot;</span><span style="color: #66cc66;">&#125;</span>
             other = <span style="color: #a020f0;">&quot;refline 9 / axis=x; yaxis min=-20 label='Value';&quot;</span>;</pre></td></tr></table></div>





<a href="https://blogs.sas.com/content/iml/files/2021/10/penalopt2.png"><img loading="lazy" src="https://blogs.sas.com/content/iml/files/2021/10/penalopt2.png" alt="" width="480" height="360" class="alignnone size-full wp-image-35640" srcset="https://blogs.sas.com/content/iml/files/2021/10/penalopt2.png 640w, https://blogs.sas.com/content/iml/files/2021/10/penalopt2-300x225.png 300w" sizes="(max-width: 480px) 100vw, 480px" /></a>

<p>
For each random binary vector, the ObjFun function returns the penalized value.
The graph shows that, on average, the penalized value peaks when the weight is 9 kg, which is the maximum weight the knapsack can hold. If the collective weight of the items exceeds 10, the "value" is actually NEGATIVE for this example! Although the graph shows only 1,000 random choices for the <strong>b</strong> vector, it is representative of the behavior for the full 2<sup>17</sup> = 131,072 possible values of <strong>b</strong>.
</p><p>
Notice that the largest value achieved by these random binary vectors occurs when the sum of weights is 9, which is the constraint. However, if you use a smaller value for the penalty parameter (such as &lambda;=1) and rerun the simulation, the largest values of the penalized objective function can occur for weights that exceed the constraint.  Try it!
</p><p>
The graph provides evidence that you can maximize the unconstrained penalized objective function and find the same optimal value, provided that the coefficient of the penalty term is large enough. 
</p>

<h3>Choosing the penalty term: A little bit of math</h3>
<p>
What does "large enough" mean? In the example, I used &lambda;=10, but is 10 large enough?  Can you find a value for &lambda; that guarantees that the solution of the penalized problem satisfies the constraint?
</p><p>
Let's examine how to choose the penalty parameter, &lambda;. Suppose your goal is to maximize a function <em>f(x)</em> subject to the general constraint that <em>g(x)</em> &le; 0 for some function <em>g</em>. You can define a penalty function, <em>p(x)</em>, which has the property <em>p(x)</em> = 0 whenever <em>g(x)</em> &le; 0, 
and
<em>p(x)</em> &gt; 0 whenever <em>g(x)</em> &gt; 0. A common choice is a quadratic penalty such as 
<em>p(x)</em> = |max(0, <em>g(x)</em>)|<sup>2</sup>.
You then maximize the penalized objective function <em>q(x;&lambda;)</em> = <em>f(x)</em> - &lambda;<em>p(x)</em> for a large value of the penalty parameter, &lambda;.
</p>

<p>
Suppose that <strong>b</strong><sup>*</sup> is an optimal solution and V<sup>*</sup> is the optimal value. For an arbitrary binary vector, <strong>b</strong>, we want to know whether 
<strong>v</strong> * <strong>b</strong> can be greater than V<sup>*</sup> if 
<strong>w</strong> * <strong>b</strong> &gt; W. 
Let V<sup>*</sup> + &Delta;V be the value and W + &Delta;W be the weight of the items that are selected by <strong>b</strong>. Then the penalized objective function is <br />
V<sup>*</sup> + &Delta;V - &lambda;(&Delta;W)<sup>2</sup> <br />
and we want to choose &lambda; so large that <br />
V<sup>*</sup> + &Delta;V - &lambda;(&Delta;W)<sup>2</sup> &le; V<sup>*</sup> <br />
Solving for &lambda; gives <br />
&lambda; &ge; &Delta;V / (&Delta;W)<sup>2</sup> <br />
</p>
<p>
You can examine the data to find a lower bound for &lambda;. You want the numerator to be as large as possible and the denominator to be as small as possible.  The largest value for the numerator is the total value of the items, sum(<strong>v</strong>). The smallest value for the denominator is the minimum difference between the weights of items, which is 0.5 for this example. Therefore, for this example, &lambda; &ge; 4*sum(<strong>v</strong>) will ensure that the solution of the unconstrained penalized objective function is also a solution for the constrained problem. This bound is not tight; you can choose a smaller value with a little more effort.
</p>

<h3>Conclusions</h3>
<p>
Even if something is possible, it is not necessarily a good idea. Solving a linear problem with linear constraints can be done very efficiently. If you introduce a penalty term, the objective function is no longer linear, and solving the problem might require more computational resources. Now that I think about it, it was a mistake to reformulate my problem.
</p><p>
In summary, I should have been more careful when I reformulated a constrained optimization problem as an unconstrained penalized optimization.  The optimal solution to the penalized problem might or might not be an optimal solution to the original problem. 
By looking carefully at the mathematics of this problem, I gained a better understanding of the relationship between constrained optimization and unconstrained penalized optimization.
</p>


<p>The post <a rel="nofollow" href="https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html">Penalties versus constraints in optimization problems</a> appeared first on <a rel="nofollow" href="https://blogs.sas.com/content/iml">The DO Loop</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=qGCYKPAOlFU:ozVi3r81e-k:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=qGCYKPAOlFU:ozVi3r81e-k:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=qGCYKPAOlFU:ozVi3r81e-k:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=qGCYKPAOlFU:ozVi3r81e-k:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div>]]></content:encoded>
					
					<wfw:commentRss>https://blogs.sas.com/content/iml/2021/10/13/penalties-constraints-optimization.html/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			<enclosure url="https://blogs.sas.com/content/iml/files/2021/10/penalopt2-150x150.png" />
	</item>
	</channel>
</rss>
