<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>The DO Loop</title>
	
	<link>http://blogs.sas.com/content/iml</link>
	<description>Statistical programming in SAS with an emphasis on SAS/IML programs</description>
	<lastBuildDate>Wed, 19 Jun 2013 09:26:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/TheDoLoop" /><feedburner:info uri="thedoloop" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>TheDoLoop</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Macros and loops in the SAS/IML language</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/8bjPF-m8ENc/</link>
		<comments>http://blogs.sas.com/content/iml/2013/06/19/macros-and-loops/#comments</comments>
		<pubDate>Wed, 19 Jun 2013 09:26:37 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[SAS Programming]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8494</guid>
		<description><![CDATA[I am not a big fan of the macro language, and I try to avoid it when I write SAS/IML programs. I find that the programs with many macros are hard to read and debug. Furthermore, the SAS/IML language supports loops and indexing, so many macro constructs can be replaced [...]]]></description>
			<content:encoded><![CDATA[<p>
I am not a big fan of the macro language, and I try to avoid it when I write SAS/IML programs. I find that the programs with many macros are hard to read and debug. Furthermore, the SAS/IML language supports loops and indexing, so many macro constructs can be replaced by standard SAS/IML syntax.
</p>
<p>
Nevertheless, many SAS customers use macro constructs as part of their daily SAS programming tasks, and that practice often continues when they write SAS/IML programmers.  A customer recently asked a question about the macro language that required knowledge of the way that macro variables are handled within a SAS/IML loop. This post shares my response.
</p><p>
Here's the crux of the customer's question. Run the following SAS/IML program and see if you can understand why it behaves as it does:
</p>


<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">proc iml;
i = 7;
call symputx(&quot;j&quot;, i);    /* 1. Put value of i into macro variable j */
y1 = &amp;j;                 /* 2. Assign y1 the value of &amp;j            */
print y1;                /* success! */
&nbsp;
y = j(1,4,.);
do i = 1 to ncol(y);     /* 3. Start processing the DO block of statements */
   call symputx(&quot;j&quot;, i); /* 4. Put value of i into macro variable j */
   y[i] = &amp;j;            /* 5. Hmmmm, what does this do inside the loop? */
end;
print y;                 /* Not what you might expect? */</pre></div></div>




<img src="http://blogs.sas.com/content/iml/files/2013/06/macroloop.png" alt="" width="76" height="128" class="aligncenter size-full wp-image-8502" />

<p>
As you can see from the output, the first use of the macro variable (outside the DO loop), works as expected. But the second does not. The customer wanted to know why the elements of <tt>y</tt> are not 
set to 1, 2, 3, 4 within the loop.
</p><p>
The key point to remember about macro variables is that SAS code never sees them. Macro variables are evaluated by the macro preprocessor <em>at parse time</em>, not at run time. The SAS/IML code never sees &amp;j, only the constant value that the preprocessor substitutes for &amp;j.
</p><p>
It is also important to remember that PROC IML is an interactive procedure. (The "I" in IML stands for interactive!) Each statement or block of statements is parsed as it is encountered, as opposed to the DATA step, which parses the entire program before beginning execution.
</p><p>
Let's examine the program step-by-step to understand why the first construct works but the second does not. The following steps refer to the numbers in the program comments:
</p>
<ol>
<li>The value of the SAS/IML scalar <tt>i</tt> is copied (as text) into the macro variable <tt>j</tt>.</li>
<li>The statement is encountered. The value of the macro variable <tt>j</tt> is substituted by the macro preprocesser. Then the statement is executed. The SAS/IML variable <tt>y1</tt> is assigned to the value 7.</li>
<li>A DO loop is encountered by the SAS/IML parser. The parser finds the matching END statement and proceeds to parse the <em>entire</em> body of the loop in order to check for syntax errors.  This parsing phase occurs exactly one time.  Because the block of statements contain a macro variable, the macro preprocessor substitutes the value of the macro variable <tt>j</tt>, which is 7.
</li>
<li>For each iteration, the value of the SAS/IML scalar <tt>i</tt> is copied (as text) into the macro variable <tt>j</tt>.</li>
<li>For each iteration, the <em>i</em>th element of the <tt>y</tt> vector is assigned the value 7.  In particular, this statement does not contain a reference to the macro varible <tt>j</tt>.</li>
</ol>

<p>
To the casual reader of the program, it looks like &amp;j will have a different value during each step of the iteration. But but it doesn't. The expression &amp;j is resolved at <em>parse time</em>. SAS/IML parses the entire body of the DO loop once, before any execution occurs, and at parse time the expression &amp;j is 7.
</p>
<p>
There is a way to get what the customer wants. The <a href="http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#n08h8unph3lz0un1ap3kqru4iym0.htm">SYMGET function</a> retrieves the value of a macro variable at run time. Therefore the following statements fill the vector <tt>y</tt> with the values 1 through 4:
</p>


<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">do i = 1 to ncol(y);
   call symputx(&quot;j&quot;, i);
   y[i] = num(symget(&quot;j&quot;));  /* get macro value at run time */
end;
print y;                     /* Yes! This is what we want! */</pre></div></div>




<img src="http://blogs.sas.com/content/iml/files/2013/06/macroloop2.png" alt="" width="77" height="59" class="aligncenter size-full wp-image-8501" />

<p>
For me, this blog post emphasizes three facts:
</p>
<ul>
<li>Always remember that macro substitution is done by a preprocessor, which operates at parse time.</li>
<li>The SAS/IML language parses an entire block of statements (between the DO and END statements) one time before executing the block.</li>
<li>Mixing macro code and SAS/IML statements can be confusing and hard to debug.  When you have the option, use SAS/IML language features instead of relying on macro language constructs.</li>
</ul>
<div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/sasprogramming/">SAS Programming</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=8bjPF-m8ENc:JxNajsARg0Q:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=8bjPF-m8ENc:JxNajsARg0Q:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=8bjPF-m8ENc:JxNajsARg0Q:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=8bjPF-m8ENc:JxNajsARg0Q:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/8bjPF-m8ENc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/06/19/macros-and-loops/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/06/19/macros-and-loops/</feedburner:origLink></item>
		<item>
		<title>Repetition factors versus frequency variables</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/wdOiVWedqXE/</link>
		<comments>http://blogs.sas.com/content/iml/2013/06/17/repetition-factors-versus-frequency-variables/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 10:25:23 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8483</guid>
		<description><![CDATA[A regular reader noticed my post on initializing vectors by using repetition factors and asked whether that technique would be useful to expand data that are given in value-frequency pairs. The short answer is "no." Repetition factors are useful for defining (static) matrix literals. However, if you want to expand [...]]]></description>
			<content:encoded><![CDATA[<p>
A regular reader noticed my post on <a href="http://blogs.sas.com/content/iml/2013/02/25/epitition-factors/">initializing vectors by using repetition factors</a> and asked whether that technique would be useful to expand data that are given in value-frequency pairs.  The short answer is "no."  Repetition factors are useful for defining (static) matrix literals. However, if you want to expand data dynamically, I recommend that you use the <a href="http://support.sas.com/documentation/cdl/en/imlug/64248/HTML/default/viewer.htm#imlug_langref_sect251.htm">REPEAT function</a> or the technique in the article on <a href="http://blogs.sas.com/content/iml/2012/05/04/expand-data-by-using-frequencies/">expanding data by using frequencies.</a>
</p>
<p>
For example, the following SAS/IML statement defines a vector for which the value 2.2 is repeated two times and the value 3.3 is repeated three times. The resulting vector has five elements:
</p>


<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">proc iml;
x = {[2] 2.2  [3] 3.3};</pre></div></div>



<p>
This vector is constructed as a matrix literal. If instead you have the values and frequencies in separate vectors, then use the <tt>ExpandFreq</tt> module in <a href="http://blogs.sas.com/content/iml/2012/05/04/expand-data-by-using-frequencies/">my previous post</a>: 
</p>


<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">values = {2.2 3.3};
freq   = {2   3};
load module=(ExpandFreq); /* define or load ExpandFreg module here */
y = ExpandFreq(values, freq);</pre></div></div>





<p>
This is probably a good time to remind everyone about the <a href="https://communities.sas.com/community/support-communities/sas_iml_and_sas_iml_studio">SAS/IML Support Community</a>. You can post your SAS/IML questions there 24 hours a day. That is always better than sending me direct email. There are lots of experienced SAS/IML experts out there, so use the SAS/IML Support Community to tap into that knowledge. 
</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/data-analysis/">Data Analysis</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=wdOiVWedqXE:o2qIR7fdJyE:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=wdOiVWedqXE:o2qIR7fdJyE:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=wdOiVWedqXE:o2qIR7fdJyE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=wdOiVWedqXE:o2qIR7fdJyE:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/wdOiVWedqXE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/06/17/repetition-factors-versus-frequency-variables/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/06/17/repetition-factors-versus-frequency-variables/</feedburner:origLink></item>
		<item>
		<title>How to interpret a residual-fit spread plot</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/ZDYKsTUrOAQ/</link>
		<comments>http://blogs.sas.com/content/iml/2013/06/12/interpret-residual-fit-spread-plot/#comments</comments>
		<pubDate>Wed, 12 Jun 2013 09:22:33 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8315</guid>
		<description><![CDATA[In a previous blog post, I described how to use a spread plot to compare the distributions of several variables. Each spread plot is a graph of centered data values plotted against the estimated cumulative probability. Thus, spread plots are similar to a (rotated) plot of the empirical cumulative distribution [...]]]></description>
			<content:encoded><![CDATA[<a href="http://blogs.sas.com/content/iml/files/2013/05/rfspread1.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/rfspread1-300x198.png" alt="" width="300" height="198" class="alignleft size-medium wp-image-8324" /></a>
<p>
In a previous blog post, I described <a href="http://blogs.sas.com/content/iml/2013/06/10/compare-data-distributions/">how to use a spread plot to compare the distributions of several variables</a>. Each spread plot is a graph of centered data values plotted against the estimated cumulative probability. Thus, spread plots are similar to a (rotated) plot of the empirical cumulative distribution function. Users of the SAS regression procedures will recognize the spread plots as one of the plots that are created automatically by procedures such as PROC REG. The spread plots of the fitted and residual values appear in the middle column of the third row of the <a href="http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_reg_sect050.htm">regression diagnostics panel</a>.
</p><p>
In the SAS documentation, the residual-fit spread plot is also called an "RF plot."  This article describes how to interpret the R-F spread plot.
</p>
<h3>The residual-fit spread plot in SAS output</h3>
<p>
When I first saw the R-F spread plot in the PROC REG diagnostics panel, there were two things that I found confusing:
</p>
<ul>
<li>
The title of the left plot is "Fit&ndash;Mean."  I read the title as "fit <em>hyphen </em>mean," and I didn't know what that meant. However, the correct way to read the title is "fit <em>minus </em>mean," which is equivalent to "centered fit."  
</li><li>
The label for the horizontal axis is "Proportion Less."  I didn't know what that meant, either.  I now know that scatter plot shows empirical quantiles versus their <a href="http://en.wikipedia.org/wiki/Q%E2%80%93Q_plot#Plotting_positions" title="Wikipdia paragraph on plotting positions for quantiles"><em>plotting positions</em></a>. Recall that the <em>p</em>th empirical quantile is the data value that is greater
than (or equal to) a proportion <em>p</em> of the data values. Therefore, if a point on the scatter plot has coordinates (<em>p<sub>i</sub>, q<sub>i</sub></em>), it means that the vertical coordinate is the <em>i</em>th quantile, and approximately <em>p<sub>i</sub></em> of the other data values are less than that proportion.
</li></ul>

<h3>History of the residual-fit spread plot</h3>
<p>
The spread plot is a graph of the centered data versus the corresponding plotting position. Essentially, it is a plot of the sorted data against the corresponding rank, except that using the plotting position instead of ranks makes it possible to compare variables that have different numbers of nonmissing observations. Also, using centered data instead of raw values enables you to compare the spread of variables that have different means.
</p><p>
William S. Cleveland featured the residual-fit spread plot in his book <a href="http://www.amazon.com/Visualizing-Data-William-S-Cleveland/dp/0963488406"><em>Visualizing Data</em> (1993)</a>. He describes how to create a <em>quantile plot</em> on pp. 17&ndash;20, and describes quantile plots for fitted and residual values on p. 35&ndash;38. Then he says (p. 40):
</p>

<blockquote>
It is informative to study how influential the [explanatory] variable is in explaining the variation in the [response variable]. The fitted values and the residuals are two sets of values each of which has a distribution. If the spread of the fitted-value distribution is large compared with the spread of the residual distribution, then the [explanatory] variable is influential. If it is small, the [explanatory] variable is not as influential.... Since it is the spreads of the distributions that are of interest, the fitted values minus their overall mean are graphed.... This <em>residual-fit spread plot</em>, or <em>r-f spread plot</em>, shows [whether] the spreads of the residuals and fit values are comparable.
</blockquote>
<br />
<p>
Cleveland goes on to use the R-F spread plot about 20 times in multiple examples.
</p>

<h3>The residual-fit spread plot as a regression diagnostic</h3>
<p>
Following Cleveland's examples, the residual-fit spread plot can be used to assess the fit of a regression as follows:
</p>
<ul>
<li><strong>Compare the spread of the fit to the spread of the residuals.</strong>  This is the main idea. If the left side of the plot (the centered fitted values) is taller than the right side (the residual values), then you conclude that the spread of the residual values is small relative to the spread of the fitted values.
</li><li>
<strong>Examine the distribution of the residual values.</strong> The quantile plot of the residual values contains all of the information that a box plot does&mdash;and more. If the distribution does not appear to be normally distributed, the model might not fit the data.
</li><li>
<strong>Are there extreme values for the distribution of the residual values?</strong> These indicate outliers: observations for which the observed value is far from the fitted value.
</li><li>
<strong>Are there extreme values for the distribution of the fitted values?</strong> These might indicate influential observations that have high leverage in the model. They need to be investigated.
</li></ul>

<h3>Some examples on simulated data</h3>
<p>The best way to practice interpreting the R-F spread plots are to view some examples for which the true model is known. The following DATA step simulates two response variables:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">data</span> RegData<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">drop</span>=i<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">12345</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #2e8b57; font-weight: bold;">100</span>;
   <span style="color: #0000ff;">x</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span><span style="color: #66cc66;">&#41;</span>;
   y1 = <span style="color: #2e8b57; font-weight: bold;">2</span> + <span style="color: #2e8b57; font-weight: bold;">4</span>*<span style="color: #0000ff;">x</span>    + rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">0.25</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* small error */</span>
   y2 = <span style="color: #2e8b57; font-weight: bold;">2</span> + <span style="color: #2e8b57; font-weight: bold;">4</span>*<span style="color: #0000ff;">x</span>**<span style="color: #2e8b57; font-weight: bold;">2</span> + rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;    <span style="color: #006400; font-style: italic;">/* not linear in x */</span>
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>




<p>
For a real regression analysis, I would look at several diagnostic plots, but in the subsequent examples I will only interpret the residual-fit spread plots. I use the DIAGNOSTICS(UNPACK) option on the PLOTS= option to extract the R-F spread plot from the diagnostics panel.
</p>
<h4>Example 1: Examining the residual variation in a model</h4>
<p>
The <tt>y1</tt> variable has a small error term. The following statements display the R-F spread plot:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;y = 2 + 4*x + eps, eps ~ N(0,0.25)&quot;</span>;
ods <span style="color: #0000ff;">select</span> RFPlot;
<span style="color: #000080; font-weight: bold;">proc reg</span> <span style="color: #000080; font-weight: bold;">data</span> = RegData plots=diagnostics<span style="color: #66cc66;">&#40;</span>unpack<span style="color: #66cc66;">&#41;</span>;
   model y1 = <span style="color: #0000ff;">x</span>;
<span style="color: #000080; font-weight: bold;">quit</span>;</pre></div></div>



<a href="http://blogs.sas.com/content/iml/files/2013/05/rfspread2.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/rfspread2.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8326" /></a>

<p>
Notice that the left plot (the centered fitted values) is "taller" than the right plot (the residual values), which indicates that the residual values have a smaller spread. In terms of the model, the <tt>x</tt> variable accounts for a significant portion of the variation in the model, with only a little residual variation.  
</p><p>
You can change the standard deviation of the error term to 5 and rerun the program.  The new R-F spread plot (not shown), shows that the spread of the residual values is larger than the spread of the fitted values. The interpretation would be that considerable variation remains after accounting for the effect of the <tt>x</tt> variable.
</p>
<h4>Example 2: A misspecified model</h4>
<p>
In the previous example, the model was correctly specified. In this second example, the true model is quadratic in the <tt>x</tt> variable, but the fitted model is linear in <tt>x</tt>.
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;y = 2 + 4*x**3 + eps, eps ~ N(0,0.25)&quot;</span>;
title2 <span style="color: #a020f0;">&quot;Model is y = x&quot;</span>;
ods <span style="color: #0000ff;">select</span> RFPlot;
<span style="color: #000080; font-weight: bold;">proc reg</span> <span style="color: #000080; font-weight: bold;">data</span> = RegData plots<span style="color: #66cc66;">&#40;</span>only<span style="color: #66cc66;">&#41;</span>=diagnostics<span style="color: #66cc66;">&#40;</span>unpack<span style="color: #66cc66;">&#41;</span>;
   model y2 = <span style="color: #0000ff;">x</span>;
<span style="color: #000080; font-weight: bold;">quit</span>;</pre></div></div>




<a href="http://blogs.sas.com/content/iml/files/2013/06/rfspread4.png"><img src="http://blogs.sas.com/content/iml/files/2013/06/rfspread4.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8364" /></a>

<p>
In the R-F spread plot for the (misspecified) model, the right-hand plot is taller than the left-hand plot. This shows that there is a lot of variation that is not explained by the model. Furthermore, the residual distribution does not appear to be normally distributed. The right tail of the residual distribution is long, and the distribution is skewed. If I saw a plot like this in real data, I would look at other plots (such as the plot of residuals versus the predicted values) to see if the residuals exhibit a pattern that can be modeled. 
</p>

<h3>Closing Comments</h3>
<p>
Several SAS regression procedures produce a regression diagnostics panel automatically.
Each graph reveals information about the regression model and whether it fits the data well.  This article has described how to interpret a residual-fit plot, which is located in the last row of the diagnostics panel.    The residual-fit spread plot, which was featured prominently in Cleveland's book, <em>Visualizing Data</em>, is one tool in the arsenal of regression diagnostic plots.
</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/data-analysis/">Data Analysis</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=ZDYKsTUrOAQ:M_sd1-WXMCg:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=ZDYKsTUrOAQ:M_sd1-WXMCg:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=ZDYKsTUrOAQ:M_sd1-WXMCg:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=ZDYKsTUrOAQ:M_sd1-WXMCg:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/ZDYKsTUrOAQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/06/12/interpret-residual-fit-spread-plot/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/06/12/interpret-residual-fit-spread-plot/</feedburner:origLink></item>
		<item>
		<title>Visually comparing different data distributions: The spread plot</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/IG33PUzcNHo/</link>
		<comments>http://blogs.sas.com/content/iml/2013/06/10/compare-data-distributions/#comments</comments>
		<pubDate>Mon, 10 Jun 2013 09:26:50 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Statistical Graphics]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8286</guid>
		<description><![CDATA[Suppose that you have several data distributions that you want to compare. Questions you might ask include "Which variable has the largest spread?" and "Which variables exhibit skewness?" More generally, you might be interested in visualizing how the distribution of one variable differs from the distribution of other variables. The [...]]]></description>
			<content:encoded><![CDATA[<p>
Suppose that you have several data distributions that you want to compare. Questions you might ask include "Which variable has the largest spread?" and "Which variables exhibit skewness?"
More generally, you might be interested in visualizing how the distribution of one variable differs from  the distribution of other variables.
</p><p>
The usual way to compare data distributions is to use histograms.  One technique is to display a panel of histograms, which are known as <a href="http://support.sas.com/documentation/cdl/en/procstat/65543/HTML/default/viewer.htm#procstat_univariate_examples15.htm" title="How to create comparative histograms by using PROC UNIVARIATE">comparative histograms</a>.  I have used this approach to <a href="http://blogs.sas.com/content/iml/2011/02/18/comedy-vs-drama-a-comparative-histogram-of-tvs-top-earners/">compare salaries between two categories of workers</a>.  The comparative histogram is produced automatically by PROC UNIVARIATE when the analysis includes a classification variable.
</p><p>
A related approach is to  
<a href="http://blogs.sas.com/content/iml/2011/06/21/overlaying-two-histograms-in-sas/">use transparency to overlay two histograms</a> on the same axis. This method works well for two distributions. However, for three or more distributions, an overlay of histograms can be difficult to read.
</p>
<h3>A problem of scale</h3>
<p>
If one of the variables has a range that is an order of magnitude greater than the range of another variable, the comparative histogram can lose its effectiveness. To illustrate this, consider the following comparative histogram of three widely varying quantities:
</p>

<a href="http://blogs.sas.com/content/iml/files/2013/05/spread1.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/spread1.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8296" /></a>

<p>
The variable in the top histogram has a range that is 10 times the range of the variable in the lower histogram. Consequently, all data in the bottom panel is inside of a single histogram bin. This plot does not reveal anything about the distribution of the third variable.
</p><p>
A different way to compare distributions is to plot a panel of the empirical cumulative distribution functions (CDF). The following call to PROC UNIVARIATE creates these "comparative CDF plots," as well as the comparative histograms, for simulated data:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* simulate three variables with different distributions */</span>
<span style="color: #000080; font-weight: bold;">data</span> Wide;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">123</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #2e8b57; font-weight: bold;">100</span>;
   <span style="color: #0000ff;">Uniform</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Uniform&quot;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">Normal</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">Gamma</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Gamma&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">4</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* transpose from wide to long data format to create a CLASS variable */</span>
<span style="color: #000080; font-weight: bold;">proc transpose</span> <span style="color: #000080; font-weight: bold;">data</span>=Wide name=ID out=Long<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">drop</span>=i<span style="color: #66cc66;">&#41;</span>;  <span style="color: #0000ff;">by</span> i;  <span style="color: #000080; font-weight: bold;">run</span>;
<span style="color: #000080; font-weight: bold;">data</span> Long; 
   <span style="color: #0000ff;">set</span> Long<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">rename</span>=<span style="color: #66cc66;">&#40;</span>Col1=<span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>; 
   <span style="color: #0000ff;">label</span> ID=;
<span style="color: #000080; font-weight: bold;">run</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* create comparative histograms and CDF plots */</span>
ods <span style="color: #0000ff;">select</span> Histogram CDFPlot;
<span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=Long;
   class ID;
   histogram <span style="color: #0000ff;">x</span> / nrows=<span style="color: #2e8b57; font-weight: bold;">3</span>;
   cdfplot <span style="color: #0000ff;">x</span> / nrows=<span style="color: #2e8b57; font-weight: bold;">3</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>




<a href="http://blogs.sas.com/content/iml/files/2013/05/spread2.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/spread2.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8295" /></a>

<p>
The comparative CDF plot shows the empirical distributions. The horizontal axis indicates the range of the data: [0, 11]  for the <tt>Gamma</tt> variable, [-3,3] for the <tt>Normal</tt> variable, and [-0.5, 0.5] for the <tt>Uniform</tt> variable. However, the plot is far from perfect. Although the CDF plot reveals the approximate center and scale of the data, I would find it difficult to conclude from the comparative CDF plot that the <tt>Gamma</tt> variable is skewed whereas the <tt>Normal</tt> variable is symmetric.
</p>

<h3>The spread plot for distributions</h3>
<p>
In the classic books <a href="http://www.amazon.com/Graphical-Methods-Data-Analysis-Statistics/dp/0412052717"><em>Graphical Methods for Data Analysis</em> (Chambers, et al., 1983)</a>
and 
<a href="http://www.amazon.com/Visualizing-Data-William-S-Cleveland/dp/0963488406/ref=pd_bxgy_b_img_y">Visualizing Data (Cleveland, 1993)</a>, the authors recommend using scatter plots of quantiles to visualize and compare distributions.
</p><p>
Consider rotating the CDF panel 90 degrees in the counter-clockwise direction. If you also center the data so that each variable has zero mean, then the result called a <em>spread plot</em>. The spread plot, as its name implies, is used to compare the spread (range) of distributions, as well as to give some indication of the tail behavior (long tails, outliers, and so forth) in the data.
</p><p>
I have previously blogged about <a href="http://blogs.sas.com/content/iml/2011/10/28/modeling-the-distribution-of-data-create-a-qq-plot/">how to compute empirical quantiles</a>. The following SAS/IML program centers each variable and estimates the empirical cumulative probabilities.  The SGPANEL procedure is then used to visualize the results:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;
varNames = <span style="color: #66cc66;">&#123;</span><span style="color: #a020f0;">&quot;Gamma&quot;</span> <span style="color: #a020f0;">&quot;Normal&quot;</span> <span style="color: #a020f0;">&quot;Uniform&quot;</span><span style="color: #66cc66;">&#125;</span>;
use Wide;  read all <span style="color: #0000ff;">var</span> varNames <span style="color: #0000ff;">into</span> <span style="color: #0000ff;">x</span>;  <span style="color: #0000ff;">close</span> Wide;
&nbsp;
<span style="color: #006400; font-style: italic;">/* assume nonmissing values */</span>
<span style="color: #0000ff;">x</span> = <span style="color: #0000ff;">x</span> - <span style="color: #0000ff;">mean</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;          <span style="color: #006400; font-style: italic;">/* 1. Center the variables */</span>
<span style="color: #0000ff;">N</span> = nrow<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;
F = j<span style="color: #66cc66;">&#40;</span>nrow<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>, ncol<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to ncol<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">x</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* 2. Compute quantile of each variable */</span>
   F<span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span> = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">rank</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">X</span><span style="color: #66cc66;">&#91;</span>,i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>-<span style="color: #2e8b57; font-weight: bold;">0.375</span><span style="color: #66cc66;">&#41;</span>/<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">N</span>+<span style="color: #2e8b57; font-weight: bold;">0.25</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* Blom (1958) fractional rank */</span>
<span style="color: #0000ff;">end</span>;
ID = <span style="color: #0000ff;">repeat</span><span style="color: #66cc66;">&#40;</span>varNames, <span style="color: #0000ff;">N</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* 3. ID variables */</span>
<span style="color: #0000ff;">create</span> Spread <span style="color: #0000ff;">var</span> <span style="color: #66cc66;">&#123;</span>ID <span style="color: #0000ff;">x</span> F<span style="color: #66cc66;">&#125;</span>;  append;  <span style="color: #0000ff;">close</span> Spread;
<span style="color: #000080; font-weight: bold;">quit</span>;
&nbsp;
<span style="color: #0000ff;">title</span> <span style="color: #a020f0;">&quot;Comparative Spread Plot&quot;</span>;
<span style="color: #000080; font-weight: bold;">proc sgpanel</span> <span style="color: #000080; font-weight: bold;">data</span>=Spread;
   panelby ID / rows=<span style="color: #2e8b57; font-weight: bold;">1</span> novarname;
   scatter <span style="color: #0000ff;">x</span>=f y=<span style="color: #0000ff;">x</span>;
   refline <span style="color: #2e8b57; font-weight: bold;">0</span> / axis=y;
   colaxis <span style="color: #0000ff;">label</span>=<span style="color: #a020f0;">&quot;Proportion of Data&quot;</span>;
   rowaxis <span style="color: #0000ff;">display</span>=<span style="color: #66cc66;">&#40;</span>nolabel<span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>



<a href="http://blogs.sas.com/content/iml/files/2013/05/spread3.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/spread3.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8346" /></a>

<p>
There does not appear to be a standard term for the quantity on the horizontal axis, which I have labeled "proportion of data."  I call it the estimated cumulative probability, but it is also called the <em>estimated proportion</em>, <a href="http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm"><em>the fractional rank</em></a>, the <em>estimate for cumulative proportion</em>, or simply <em>plotting positions</em>. In Chambers et al. (1983) it is called the <em>fraction of data</em>. In Cleveland (1993) it is called the <em>f-value</em>, which is short for fractional value. In some SAS output it is labeled "proportion less [than the value shown]."
(There is also <a href="http://www.jstor.org/stable/2684934">many different ways to compute the plotting positions</a>. Chambers and Cleveland each use the simpler fractional rank given by  <em>r<sub>i</sub>&ndash;0.5)/(N+1)</em>,
where <em>r<sub>i</sub></em> is the rank of the <em>i</em>th observations. I have used Blom's  1958 formula, which is used heavily in SAS software.)
</p><p>
You could create a similar plot by using PROC RANK and the DATA step.  The nice thing about the spread plot is that the range of the data is evident. The range for the centered <tt>Gamma</tt> variable is [-3,8]. Furthermore, notice that the mean value is greater than the median value, which is a standard <a href="http://www.amstat.org/publications/jse/v13n2/vonhippel.html" title="A rule of thumb that is not always true">rule of thumb for continuous skewed distributions</a>. In contrast, the <tt>Normal</tt> and <tt>Uniform</tt> variables are both visually symmetric. The <tt>Normal</tt> variable clearly has two moderate tails, whereas the <tt>Uniform</tt> variable appears to be a bounded distribution.
</p><p>
A plot that uses quantiles to compare distributions is more powerful than the technique of comparing histograms. However, the quantile plot requires more skill to interpret.
</p><p>

Although the spread plot is not a well-known display, it appears prominently in the output of several popular SAS procedures. My next blog post will discuss how to interpret a spread plot when it appears in the output of SAS regression procedures.</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/data-analysis/">Data Analysis</a>, <a href="http://blogs.sas.com/content/iml/tag/statistical-graphics/">Statistical Graphics</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=IG33PUzcNHo:iqFVy4dm2RU:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=IG33PUzcNHo:iqFVy4dm2RU:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=IG33PUzcNHo:iqFVy4dm2RU:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=IG33PUzcNHo:iqFVy4dm2RU:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/IG33PUzcNHo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/06/10/compare-data-distributions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/06/10/compare-data-distributions/</feedburner:origLink></item>
		<item>
		<title>Using simulation to compute a power curve</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/YeuKqQDLwyo/</link>
		<comments>http://blogs.sas.com/content/iml/2013/06/05/simulation-power-curve/#comments</comments>
		<pubDate>Wed, 05 Jun 2013 09:25:37 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Sampling and Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8423</guid>
		<description><![CDATA[Last week I showed how to use simulation to estimate the power of a statistical test. I used the two-sample t test to illustrate the technique. In my example, the difference between the means of two groups was 1.2, and the simulation estimated a probability of 0.72 that the t [...]]]></description>
			<content:encoded><![CDATA[<p>
Last week I showed <a href="http://blogs.sas.com/content/iml/2013/05/30/simulation-power/">how to use simulation to estimate the power of a statistical test</a>. I used the two-sample <em>t</em> test to illustrate the technique.  In my example,  the difference between the means of two groups was 1.2,  and the simulation estimated a probability of 0.72 that the <em>t</em> test would detect the difference.
</p><p>
Of course, the value 1.2 was completely arbitrary. You could just as well ask for the probability of detecting a difference of size 0.5, 1.0, or 2.0.  You can repeat the simulation for any value of the difference and thereby construct a power curve. That's what I do in this article, which is based on material in Chapter 5 of  <a href="http://support.sas.com/publishing/authors/wicklin.html"><em>Simulating Data with SAS</em></a>.
</p><p>
Here is the result of running the previous simulation for differences of size 0, 0.1, 0.2, ..., 2.0. 
 The points along the curve are power estimates from the simulation, along with 95% confidence intervals. The size of the error bars reflects the number of simulations used for each point estimate, which is 5,000 for this simulation. The underlying curve is the result of running PROC POWER, which can solve this problem exactly. For more complicated statistical tests, an exact power curve is not available  and simulation is the only way to estimate power.
</p>
<a href="http://blogs.sas.com/content/iml/files/2013/05/simttest.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/simttest.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8427" /></a>


<p>
The simulation takes about 30 seconds to run on a desktop PC from 2010. You can <a href='http://blogs.sas.com/content/iml/files/2013/06/simttest.txt'>download the complete SAS program that runs the simulation</a>.  The program is virtually identical to the program I discussed last week, except that there is now a loop over the parameter that controls the mean difference (<tt>do Delta = 0 to 2 by 0.1</tt>), and I have included the <tt>Delta</tt> parameter in two BY statements.
</p>
<p>
A few comments about the power curve:
</p>
<ul>
<li>In this simulation, the two populations have unit standard deviations. The graph shows that when the true difference between the means is 0.5, there is a 20% chance of detecting the difference with such a small sample size (<em>n</em><sub>1</sub>=<em>n</em><sub>2</sub>=10). The chance rises to 56% when the difference is 1. To have an 80% chance, the means must differ by about 1.35.</li>
<li>When there is no difference between the population means, the graph shows that there is a 5% chance that the <em>t</em> test will reject the null hypothesis (which is that the means are equal). This is consistent with the fact that the test is performed at the &alpha; = 5% significance level.</li>
<li>Notice that the confidence intervals for the power are smaller when the curve is flatter and larger when the curve is steeper. You can exploit this to make your simulations run faster by using fewer simulation trials in certain parameter ranges. See the Chapter 6, "Strategies for Efficient and Effective Simulation," of <em>Simulating Data with SAS</em>.</li>
</ul>
<p>
<a href='http://blogs.sas.com/content/iml/files/2013/06/simttest.txt'>Download the program</a> and experiment with changing the parameters. Change <tt>NumSamples</tt> to 1000 to make your experiments run faster. What does the curve look like when (<em>n</em><sub>1</sub>=<em>n</em><sub>2</sub>=20)? </p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/sampling-and-simulation/">Sampling and Simulation</a>, <a href="http://blogs.sas.com/content/iml/tag/statistical-programming/">Statistical Programming</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=YeuKqQDLwyo:IuwVCEvDrEE:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=YeuKqQDLwyo:IuwVCEvDrEE:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=YeuKqQDLwyo:IuwVCEvDrEE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YeuKqQDLwyo:IuwVCEvDrEE:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/YeuKqQDLwyo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/06/05/simulation-power-curve/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/06/05/simulation-power-curve/</feedburner:origLink></item>
		<item>
		<title>Passing values from PROC IML into SAS procedures</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/Ibnb2QqnP34/</link>
		<comments>http://blogs.sas.com/content/iml/2013/06/03/passing-values-into-procedures/#comments</comments>
		<pubDate>Mon, 03 Jun 2013 09:25:18 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Getting Started]]></category>
		<category><![CDATA[Tips and Techniques]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8220</guid>
		<description><![CDATA[A SAS user told me that he computed a vector of values in the SAS/IML language and wanted to use those values on a statement in a SAS procedure. The particular application involved wanting to use the values on the ESTIMATE and CONTRAST statements in a SAS regression procedure, but [...]]]></description>
			<content:encoded><![CDATA[<p>
A SAS user told me that he computed a vector of values in the SAS/IML language and wanted to use those values on a statement in a SAS procedure. The particular application involved wanting to use the values on the ESTIMATE and CONTRAST statements in a SAS regression procedure, but my solution applies in general to any statement for any SAS procedure.
</p><p>
There are two ways to pass values from PROC IML to other procedures. One solution stores the values in a macro variable at run time.
The other solution calls the SAS procedure from within the SAS/IML program and passes the SAS/IML vector as a parameter.
</p>
<p>
To give a concrete example, suppose that you have a set of numbers in a SAS/IML vector, and you want to use the numbers as a list of percentile values in PROC UNIVARIATE, as follows:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Goal: specify a list of numbers to the PCTLPTS= option */</span>
<span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=sashelp.cars noprint;
   <span style="color: #0000ff;">var</span> MPG_City;
   <span style="color: #0000ff;">output</span> out=pctl pctlpre=p pctlpts=<span style="color: #2e8b57; font-weight: bold;">2.5</span> <span style="color: #2e8b57; font-weight: bold;">10</span> <span style="color: #2e8b57; font-weight: bold;">50</span> <span style="color: #2e8b57; font-weight: bold;">90</span> <span style="color: #2e8b57; font-weight: bold;">97.5</span>; <span style="color: #006400; font-style: italic;">/* numbers go here */</span>
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>



<p>
In this example, we assume that the numbers on the PCTLPTS= option are the result of some SAS/IML computation. 
</p>
<h3>The macro variable solution</h3>
<p>
The first solution converts the vector into a string of values that are separated by blanks. That string can then be placed in a macro variable. The macro variable can be referenced in the SAS procedure. 
The following steps create a macro variable from values in a SAS/IML row vector:
</p>
<ol>
<li>Convert the numeric vector into a character vector. You can use the <a href="http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p12zqzvwx4dv6kn1p9crijxswolk.htm">PUTN function</a> in Base SAS or the SAS/IML <a href="http://support.sas.com/documentation/cdl/en/imlug/64248/HTML/default/viewer.htm#imlug_langref_sect043.htm">CHAR function</a> to convert a numeric vector.
</li>
<li>Use the <a href="http://support.sas.com/documentation/cdl/en/imlug/64248/HTML/default/viewer.htm#imlug_langref_sect257.htm">ROWCAT function</a> to concatenate the values into a single blank-delimited string of values. (If your values are in a column vector, transpose the vector before calling the ROWCAT function.) 
</li>
<li>
Use the <a>SYMPUT function</a> to store the string of values in a macro variable.
</li></ol>
<p>
You can then call PROC UNIVARIATE and reference the macro variable. The program follows:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;  
pctl = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2.5</span> <span style="color: #2e8b57; font-weight: bold;">10</span> <span style="color: #2e8b57; font-weight: bold;">50</span> <span style="color: #2e8b57; font-weight: bold;">90</span> <span style="color: #2e8b57; font-weight: bold;">97.5</span><span style="color: #66cc66;">&#125;</span>;               <span style="color: #006400; font-style: italic;">/* these are the values  */</span>
s = rowcat<span style="color: #66cc66;">&#40;</span> char<span style="color: #66cc66;">&#40;</span>pctl<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> symputx<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;PctList&quot;</span>, s<span style="color: #66cc66;">&#41;</span>;                     <span style="color: #006400; font-style: italic;">/* create macro variable */</span>
<span style="color: #000080; font-weight: bold;">quit</span>;
&nbsp;
<span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=sashelp.cars noprint;
   <span style="color: #0000ff;">var</span> MPG_City;
   <span style="color: #0000ff;">output</span> out=pctl pctlpre=p pctlpts=<span style="color: #0000ff; font-weight: bold;">&amp;PctList</span>; <span style="color: #006400; font-style: italic;">/* use SAS/IML values here */</span>
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>



<p>
I have used the same trick to <a href="http://blogs.sas.com/content/iml/2011/06/08/calling-base-sas-functions-from-sasiml-how-to-handle-argument-lists/">pass values from SAS/IML vectors into Base SAS functions that expect a comma-delimited list.</a> Use the macro variable solution when you are finished with PROC IML and are ready to use other procedures.
</p>

<h3>An easier way: Use the SUBMIT statement</h3>
<p>
There is even an easier way that passes the values of a SAS/IML vector directly into a SAS procedure without creating a macro variable and without leaving PROC IML: use the SUBMIT and ENDSUBMIT statements.
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc iml</span>;  
pctl = <span style="color: #66cc66;">&#123;</span><span style="color: #2e8b57; font-weight: bold;">2.5</span> <span style="color: #2e8b57; font-weight: bold;">10</span> <span style="color: #2e8b57; font-weight: bold;">50</span> <span style="color: #2e8b57; font-weight: bold;">90</span> <span style="color: #2e8b57; font-weight: bold;">97.5</span><span style="color: #66cc66;">&#125;</span>;
&nbsp;
submit  pctl;
   <span style="color: #000080; font-weight: bold;">proc univariate</span> <span style="color: #000080; font-weight: bold;">data</span>=sashelp.cars noprint;
      <span style="color: #0000ff;">var</span> MPG_City;
      <span style="color: #0000ff;">output</span> out=pctl pctlpre=p pctlpts=<span style="color: #0000ff; font-weight: bold;">&amp;pctl</span>;  <span style="color: #006400; font-style: italic;">/* list from SAS/IML vector */</span>
   <span style="color: #000080; font-weight: bold;">run</span>;
endsubmit;</pre></div></div>



<p>
If you include the name of a SAS/IML vector on the SUBMIT statement, the contents of that vector are substituted for the expression  (in this case, &amp;<tt>pctl</tt>) before the SUBMIT block is sent to SAS.  Notice that there is <em>not a macro variable</em> called <tt>pctl</tt>. Although the  &amp;<tt>pctl</tt> expression <em>looks</em> like a macro variable substitution, it is actually accomplished through text substitution at run time, not through a macro preprocessor.
</p><p>
Use the SUBMIT method when you will need to compute further quantities in PROC IML.  
To learn more, watch my video about <a href="http://blogs.sas.com/content/iml/2011/10/24/video-calling-sas-procedures-from-the-sasiml-language/">calling SAS functions from the SAS/IML language</a>.
</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/getting-started/">Getting Started</a>, <a href="http://blogs.sas.com/content/iml/tag/tips-and-techniques/">Tips and Techniques</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Ibnb2QqnP34:O9JY5JFYHLk:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Ibnb2QqnP34:O9JY5JFYHLk:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Ibnb2QqnP34:O9JY5JFYHLk:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Ibnb2QqnP34:O9JY5JFYHLk:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/Ibnb2QqnP34" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/06/03/passing-values-into-procedures/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/06/03/passing-values-into-procedures/</feedburner:origLink></item>
		<item>
		<title>Using simulation to estimate the power of a statistical test</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/Uf02bGTDuts/</link>
		<comments>http://blogs.sas.com/content/iml/2013/05/30/simulation-power/#comments</comments>
		<pubDate>Thu, 30 May 2013 09:20:56 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Efficiency]]></category>
		<category><![CDATA[Sampling and Simulation]]></category>
		<category><![CDATA[Statistical Programming]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8263</guid>
		<description><![CDATA[The power of a statistical test measures the test's ability to detect a specific alternate hypothesis. For example, educational researchers might want to compare the mean scores of boys and girls on a standardized test. They plan to use the well-known two-sample t test. The null hypothesis is that the [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://en.wikipedia.org/wiki/Statistical_power" title="Wikipedia article on statistical power">power of a statistical test</a> measures the test's ability to detect a specific alternate hypothesis. For example, educational  researchers might want to compare the mean scores of boys and girls on a standardized test. They plan to use the well-known two-sample <em>t</em> test. The null hypothesis is that the means of the two groups are equal. Will the <em>t</em> test be able to detect a difference in the two group means (if it exists) by rejecting the null hypothesis?
</p><p>
The ability to detect differences in the group means depends on the sample sizes in the study, but it is safe to say that the <em>t</em> test is unlikely to detect small differences in the students' mean scores, and it is more likely to detect larger differences. 
</p><p>
In general, the <em>power of a test</em> is the probability that the test will
reject the null hypothesis when a specified alternative is true. For the <em>t</em> test, power means the probability that the test can detect a mean difference of a specified magnitude.
In general, this can be a difficult computation because it requires knowing the
sampling distribution of the test statistic under
the alternative hypothesis.  For simple tests (such as the two-sample <em>t</em> test), the sampling distribution is known, but for more complicated  statistical tests the power computation might be  available only by using simulation methods. 
</p><p>
This article describes how to use simulation to estimate the power of the <em>t</em> test. For this simple test, we can check the simulation by using the exact answer, as provided by the POWER procedure in SAS. However, the simulation will illustrate general ideas that you can use to estimate the power of more complicated statistical tests. This article is based on material in Chapter 5 of <a href="http://support.sas.com/publishing/authors/wicklin.html"><em>Simulating Data with SAS</em></a>.
</p>
<h3>Formulation of the problem</h3>
<p>
In this simulation, I assume that each population is normally distributed. For simplicity, assume the population for Group 1 is N(0, 1) and the population for Group 2 is N(&delta;, 1), where &delta; &gt; 0 is the difference between the population means.
</p><p>
The null hypothesis for the <em>t</em> test is that &delta; = 0.  Given two samples, the t test will either reject the null hypothesis at the &alpha; = 0.05 significance level or it won't.  In the simulation approach, you simulate many samples, and estimate the probability of rejecting the null hypothesis by using the empirical proportion of simulated samples that rejected the null hypothesis. (This uses <a href="http://en.wikipedia.org/wiki/Law_of_large_numbers">the law of large numbers</a>.)
</p><p>
The following steps estimate the power for the two-sample pooled <em>t</em> test: 
</p>
<ol>
<li>Simulate data from the model for each group's population. These are the samples. The populations are chosen so that the true difference between the population means is  &delta; &gt; 0. (The null hypothesis is false.)
</li><li>
Run the TTEST procedure on the samples. For each sample, record whether the <em>t</em> test rejects the null hypothesis.  
</li><li>
Count how many times the <em>t</em> test rejects the null hypothesis. This proportion is an estimate for the power of the test.
</li></ol>

<h3>Simulating the data</h3>
<p>
The following SAS statements define parameters for the simulation and use the DATA step to simulate 5,000 simulated trials.  All of the data are in a single data set, and the <tt>SampleID</tt> variable identifies which observations belong to which simulated trial. In this simulation, each group has 10 observations and the true difference between the population means is 1.2, which is a little larger than the standard deviations of the populations.
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">%let</span> n1 = <span style="color: #2e8b57; font-weight: bold;">10</span>;             <span style="color: #006400; font-style: italic;">/* group sizes*/</span>
<span style="color: #0000ff;">%let</span> n2 = <span style="color: #2e8b57; font-weight: bold;">10</span>;
<span style="color: #0000ff;">%let</span> NumSamples = <span style="color: #2e8b57; font-weight: bold;">5000</span>;   <span style="color: #006400; font-style: italic;">/* number of simulated samples */</span>  
<span style="color: #0000ff;">%let</span> Delta = <span style="color: #2e8b57; font-weight: bold;">1.2</span>;         <span style="color: #006400; font-style: italic;">/* true size of mean difference in population */</span>
&nbsp;
<span style="color: #000080; font-weight: bold;">data</span> PowerSim<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">drop</span>=i<span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">321</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> SampleID = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #0000ff; font-weight: bold;">&amp;NumSamples</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #0000ff; font-weight: bold;">&amp;n1</span>;
      c=<span style="color: #2e8b57; font-weight: bold;">1</span>;  <span style="color: #0000ff;">x</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, <span style="color: #2e8b57; font-weight: bold;">0</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;      <span style="color: #006400; font-style: italic;">/* Group 1: x ~ N(0,1) */</span>
      <span style="color: #0000ff;">output</span>;
   <span style="color: #0000ff;">end</span>;
   <span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #0000ff; font-weight: bold;">&amp;n2</span>;
      c=<span style="color: #2e8b57; font-weight: bold;">2</span>;  <span style="color: #0000ff;">x</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span>, <span style="color: #0000ff; font-weight: bold;">&amp;Delta</span>, <span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* Group 2: x ~ N(Delta, 1) */</span>
      <span style="color: #0000ff;">output</span>;
   <span style="color: #0000ff;">end</span>;
<span style="color: #0000ff;">end</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>




<h3>Analyzing the Simulated Data</h3>
<p>
As I have written previously, <a href="http://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way/">use BY-group processing to carry out efficient simulation and analysis in SAS</a>. Also, be sure to  <a href="http://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations/">suppress the display of tables and graphs during the analysis by using the the %ODSoff and macro</a>.  The following SAS statements define the <a href="http://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations/">%ODSOff and %ODSOn macros</a>, and analyze all data for the simulated trials:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">%macro</span> ODSOff<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* Call prior to BY-group processing */</span>
   ods graphics off;  ods exclude all;  ods noresults;
<span style="color: #0000ff;">%mend</span>;
&nbsp;
<span style="color: #0000ff;">%macro</span> ODSOn<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* Call after BY-group processing */</span>
   ods graphics <span style="color: #0000ff;">on</span>;  ods exclude none;  ods results;
<span style="color: #0000ff;">%mend</span>;
&nbsp;
<span style="color: #006400; font-style: italic;">/* 2. Compute (pooled) t test for each sample */</span>
%ODSOff 
<span style="color: #000080; font-weight: bold;">proc ttest</span> <span style="color: #000080; font-weight: bold;">data</span>=PowerSim; 
   <span style="color: #0000ff;">by</span> SampleID; 
   class c; 
   <span style="color: #0000ff;">var</span> <span style="color: #0000ff;">x</span>; 
   ods <span style="color: #0000ff;">output</span> ttests=TTests<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">where</span>=<span style="color: #66cc66;">&#40;</span>method=<span style="color: #a020f0;">&quot;Pooled&quot;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>; 
<span style="color: #000080; font-weight: bold;">run</span>; 
%ODSOn</pre></div></div>



<p>
The TTEST procedure creates an output data set that contains 5,000 rows, one for each simulated trial. The data set includes a variable named <tt>Probt</tt>, which gives the result of the <em>t</em> test on each trial.
</p>
<h3>Count the rejections to estimate power</h3>
<p>
The last step in the simulation is to estimate the power, which is the probability of rejecting the null hypothesis.  The following SAS statements create an indicator variable that has the value 1 if the <em>t</em> test rejected the null hypothesis at the 0.05 significance level, and the value 0 otherwise. (Alternatively, you could define a SAS format.) You can then use PROC FREQ to count the proportion of trials for which the <em>t</em> test rejected the null hypothesis.
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* Construct indicator var for obs that reject H0 at 0.05 significance */</span> 
<span style="color: #000080; font-weight: bold;">data</span> Results; 
   <span style="color: #0000ff;">set</span> TTests; 
   RejectH0 = <span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">Probt</span> &lt;= <span style="color: #2e8b57; font-weight: bold;">0.05</span><span style="color: #66cc66;">&#41;</span>; 
<span style="color: #000080; font-weight: bold;">run</span>; 
&nbsp;
<span style="color: #006400; font-style: italic;">/* 3. Compute proportion: (# that reject H0)/NumSamples and CI */</span> 
<span style="color: #000080; font-weight: bold;">proc freq</span> <span style="color: #000080; font-weight: bold;">data</span>=Results; 
   tables RejectH0 / nocum binomial<span style="color: #66cc66;">&#40;</span>level=<span style="color: #a020f0;">'1'</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>




<a href="http://blogs.sas.com/content/iml/files/2013/05/t_simttest.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/t_simttest.png" alt="" width="209" height="263" class="aligncenter size-full wp-image-8405" /></a>
<p>
The FREQ procedure indicates that the power of the two-sample <em>t</em> test is about 72%.  The 95% confidence interval for that estimate is [0.708, 0.733].
This estimate is for the scenario of samples of sizes 10, where one sample is drawn from N(0,1) and the other is drawn from N(1.2, 1). 
</p><p>
As mentioned earlier, you can use PROC POWER to find the exact power for the two-sample <em>t</em> test, as follows:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc power</span>;
  twosamplemeans  power = .           <span style="color: #006400; font-style: italic;">/* missing ==&gt; &quot;compute this&quot; */</span>
    meandiff=<span style="color: #2e8b57; font-weight: bold;">1.2</span> stddev=<span style="color: #2e8b57; font-weight: bold;">1</span> ntotal=<span style="color: #2e8b57; font-weight: bold;">20</span>;  <span style="color: #006400; font-style: italic;">/* 20 obs in the two samples  */</span>
<span style="color: #000080; font-weight: bold;">run</span>;</pre></div></div>



<a href="http://blogs.sas.com/content/iml/files/2013/05/t_simttest2.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/t_simttest2.png" alt="" width="126" height="86" class="aligncenter size-full wp-image-8406" /></a>
<p>
The estimate from the simulation is very close to the true power of the <em>t</em> test.
</p><p>
Next week I will extend this simulation to estimate points along a power curve.
</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/efficiency/">Efficiency</a>, <a href="http://blogs.sas.com/content/iml/tag/sampling-and-simulation/">Sampling and Simulation</a>, <a href="http://blogs.sas.com/content/iml/tag/statistical-programming/">Statistical Programming</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Uf02bGTDuts:6quApaa7HoY:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Uf02bGTDuts:6quApaa7HoY:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=Uf02bGTDuts:6quApaa7HoY:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=Uf02bGTDuts:6quApaa7HoY:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/Uf02bGTDuts" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/05/30/simulation-power/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/05/30/simulation-power/</feedburner:origLink></item>
		<item>
		<title>New heat maps in the REG procedure</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/LyG2F1jDbhM/</link>
		<comments>http://blogs.sas.com/content/iml/2013/05/28/heatmaps-in-re/#comments</comments>
		<pubDate>Tue, 28 May 2013 09:23:46 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Data Analysis]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8179</guid>
		<description><![CDATA[Has anyone noticed that the REG procedure in SAS/STAT 12.1 produces heat maps instead of scatter plots for fit plots and residual plots when the regression involves more than 5,000 observations? I wasn't aware of the change until a colleague informed me, although the change is discussed in the "Details" [...]]]></description>
			<content:encoded><![CDATA[<p>
Has anyone noticed that the REG procedure in SAS/STAT 12.1 produces heat maps instead of scatter plots for fit plots and residual plots when the regression involves more than 5,000 observations?  I wasn't aware of the change until a colleague informed me, although <a href="http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_reg_details40.htm">the change is discussed in the "Details" section</a> of the PROC REG documentation for SAS/STAT 12.1.
</p>
<p>
Here is how the fit plot looks for fewer than 5,000 observations:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #006400; font-style: italic;">/* simulate data from a linear regression model */</span>
<span style="color: #000080; font-weight: bold;">data</span> RegData;
<span style="color: #0000ff;">call</span> streaminit<span style="color: #66cc66;">&#40;</span><span style="color: #2e8b57; font-weight: bold;">1</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">do</span> i = <span style="color: #2e8b57; font-weight: bold;">1</span> to <span style="color: #2e8b57; font-weight: bold;">6000</span>;
   <span style="color: #0000ff;">x</span> = rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Uniform&quot;</span><span style="color: #66cc66;">&#41;</span>;
   y = <span style="color: #2e8b57; font-weight: bold;">1</span> + <span style="color: #2e8b57; font-weight: bold;">2</span>*<span style="color: #0000ff;">x</span> + <span style="color: #2e8b57; font-weight: bold;">3</span>*rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span><span style="color: #66cc66;">&#41;</span>;
   <span style="color: #0000ff;">if</span> i &lt; <span style="color: #2e8b57; font-weight: bold;">20</span> <span style="color: #0000ff;">then</span> y = y + <span style="color: #2e8b57; font-weight: bold;">8</span>*rand<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">&quot;Normal&quot;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* add a few outliers */</span>
   <span style="color: #0000ff;">output</span>;
<span style="color: #0000ff;">end</span>;
&nbsp;
ods <span style="color: #0000ff;">select</span> FitPlot<span style="color: #66cc66;">&#40;</span>persist<span style="color: #66cc66;">&#41;</span>;
<span style="color: #000080; font-weight: bold;">proc reg</span> <span style="color: #000080; font-weight: bold;">data</span>=RegData<span style="color: #66cc66;">&#40;</span>obs=<span style="color: #2e8b57; font-weight: bold;">4000</span><span style="color: #66cc66;">&#41;</span> plots<span style="color: #66cc66;">&#40;</span>only<span style="color: #66cc66;">&#41;</span>=FitPlot;
   model y = <span style="color: #0000ff;">x</span>;
<span style="color: #000080; font-weight: bold;">quit</span>;</pre></div></div>




<a href="http://blogs.sas.com/content/iml/files/2013/05/regheatmap1.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/regheatmap1.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8212" /></a>

<p>
With fewer than 5,000 observations, I get the usual fit plot that consists of a scatter plot overlaid with a curve of predicted value, a band for the confidence interval for the mean, and dashed lines that indicate the confidence intervals for individual predictions. (The confidence limits are barely visible. Click the graph to enlarge it.) However, watch what happens when I use more than 5,000 observations:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #000080; font-weight: bold;">proc reg</span> <span style="color: #000080; font-weight: bold;">data</span>=RegData plots<span style="color: #66cc66;">&#40;</span>only<span style="color: #66cc66;">&#41;</span>=FitPlot;
   model y = <span style="color: #0000ff;">x</span>;
<span style="color: #000080; font-weight: bold;">quit</span>;
ods <span style="color: #0000ff;">select</span> all;</pre></div></div>




<a href="http://blogs.sas.com/content/iml/files/2013/05/regheatmap2.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/regheatmap2.png" alt="" width="400" height="300" class="aligncenter size-full wp-image-8213" /></a>

<p>
The scatter plot is gone, replaced by a heat map that shows the density of the data. The predicted values are still present, although the graphical style used to draw it is different, which results in a red line. The confidence intervals are gone.
</p><p>
Overall, this is a nice feature and I think that the change is a good idea.
The reason for the change is easy to understand: scatter plots suffer from overplotting when there are many points, so it is more useful to visualize the density of the observations than the individual observations. Furthermore, although scatter plots are very fast to construct, when there are many points a heat map (which bins observations) is faster to compute and render than  a scatter plot.
</p><p>
The plot does not currently include confidence intervals, but there is no reason why these can't be added in a future release. However, the confidence interval for the mean predictions will usually be tiny for large data sets&mdash;already it is barely visible in the plot of 4,000 points.
</p>
<h3>Controlling the appearance of the heat map</h3>
<p>
Prior to SAS/STAT 12.1, the REG procedure created a fit plot as a scatter plot for small data sets (less than 5,000 points). For larger sample sizes, the procedure suppressed the fit plot.  The behavior was controlled by using MAXPOINTS= option on the PLOTS= option on the PROC REG statement.  </p><p>
In SAS/STAT 12.1, the MAXPOINTS= option accepts two arguments, and the default values are
MAXPOINTS=5000&nbsp;150000. The first argument specifies the data size for which heat maps are used instead of scatter plots. The <a href="http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_reg_syntax01.htm">documentation of the MAXPOINTS= option</a> states that
"when the number of points exceeds [the first number] but does not exceed [the second number] divided by the number of independent variables, heat maps are displayed instead of scatter plots for the fit and residual plots." In other words, if you have a regression with <em>k</em> explanatory variables, heat maps are used when the number of observations is between 5,000 and 150,000/<em>k</em>.  Of course, you can use the MAXPOINTS= option to change either or both of those values.
</p><p>
Any comments on this new behavior in PROC REG?
</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/data-analysis/">Data Analysis</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=LyG2F1jDbhM:SJO4JgkrreA:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=LyG2F1jDbhM:SJO4JgkrreA:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=LyG2F1jDbhM:SJO4JgkrreA:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=LyG2F1jDbhM:SJO4JgkrreA:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/LyG2F1jDbhM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/05/28/heatmaps-in-re/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/05/28/heatmaps-in-re/</feedburner:origLink></item>
		<item>
		<title>Turn off ODS when running simulations in SAS</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/YKzBBaf1GlE/</link>
		<comments>http://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations/#comments</comments>
		<pubDate>Fri, 24 May 2013 09:24:34 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Sampling and Simulation]]></category>
		<category><![CDATA[Tips and Techniques]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8377</guid>
		<description><![CDATA[In my article "Simulation in SAS: The slow way or the BY way," I showed how to use BY-group processing rather than a macro loop in order to efficiently analyze simulated data with SAS. In the example, I analyzed the simulated data by using PROC MEANS, and I use the [...]]]></description>
			<content:encoded><![CDATA[<p>
In my article <a href="http://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way/">"Simulation in SAS: The slow way or the BY way,"</a> I showed how to use BY-group processing rather than a macro loop in order to efficiently analyze simulated data with SAS. In the example, I analyzed the simulated data by using PROC MEANS, and I use the NOPRINT option to suppress the ODS output that the procedure would normally produce.
</p><p>
About 50 SAS/STAT procedures support the NOPRINT option in the PROC statement. When you specify the NOPRINT option, ODS is temporarily disabled while the procedure runs. This prevents SAS from displaying tables and graphs that would otherwise be produced for each BY group.  For a simulation that computes  statistics for thousands of BY groups, suppressing the display of tables results in a substantial savings of time.
</p><p>
Newer SAS procedures do not always support a NOPRINT statement. However, you can still suppress the ODS output. The following macros encapsulate statements that turn the ODS system off and on. I call the %ODSOff macro before I start the BY-group analysis; I call the %ODSOn macro after the analysis completes.
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;"><span style="color: #0000ff;">%macro</span> ODSOff<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* Call prior to BY-group processing */</span>
ods graphics off;
ods exclude all;
ods noresults;
<span style="color: #0000ff;">%mend</span>;
&nbsp;
<span style="color: #0000ff;">%macro</span> ODSOn<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>; <span style="color: #006400; font-style: italic;">/* Call after BY-group processing */</span>
ods graphics <span style="color: #0000ff;">on</span>;
ods exclude none;
ods results;
<span style="color: #0000ff;">%mend</span>;</pre></div></div>




<p>
For example, if I were using PROC ROBUSTREG  to analyze many samples of simulated data, I might use the following pseudo-code:
</p>


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;">%ODSOff
<span style="color: #000080; font-weight: bold;">proc robustreg</span> <span style="color: #000080; font-weight: bold;">data</span>=MySimData;
   <span style="color: #0000ff;">BY</span> SampleID;
   model y = <span style="color: #0000ff;">x</span>;
   ods <span style="color: #0000ff;">output</span> ParameterEstimates = OutputStats;  <span style="color: #006400; font-style: italic;">/* &lt;== insert name of ODS table */</span>
<span style="color: #000080; font-weight: bold;">run</span>;
%ODSOn</pre></div></div>



<p>
Even though ODS is suppressed to the display destinations (such as LISTING and HTML), you can capture the statistics that result from each analysis by using an ODS OUTPUT statement, which saves an ODS table to a SAS data set. Other ways to save statistics include using an OUTPUT statement, an OUT= or OUTEST= data set, and so forth.
</p><p>

Be aware that some SAS procedures (such as PROC MIXED) write a NOTE to the SAS log as part of their normal operation.  The NOTE might say something like "NOTE: Convergence criteria met." For these procedures, you will also want to turn off notes, lest they fill the SAS log:


<div class="wp_syntax"><div class="code"><pre class="sas" style="font-family:monospace;">%ODSOff
<span style="color: #0000ff;">options</span> nonotes;  <span style="color: #006400; font-style: italic;">/* use NONOTES to suppress notes to the log */</span>
<span style="color: #000080; font-weight: bold;">proc mixed</span> ...;
model y = ...;
<span style="color: #000080; font-weight: bold;">run</span>;
<span style="color: #0000ff;">options</span> notes;   <span style="color: #006400; font-style: italic;">/* turn NOTES back on */</span>
%ODSOn</pre></div></div>



</p><p>
The material in this blog post is taken from my book <a href="http://support.sas.com/publishing/authors/wicklin.html"><em>Simulating Data with SAS</em></a>, which contains many more tips and techniques for the efficient simulation of data.</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/sampling-and-simulation/">Sampling and Simulation</a>, <a href="http://blogs.sas.com/content/iml/tag/tips-and-techniques/">Tips and Techniques</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=YKzBBaf1GlE:8dGRcSHDmoE:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=YKzBBaf1GlE:8dGRcSHDmoE:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=YKzBBaf1GlE:8dGRcSHDmoE:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=YKzBBaf1GlE:8dGRcSHDmoE:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/YKzBBaf1GlE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations/</feedburner:origLink></item>
		<item>
		<title>Timing performance improvements due to vectorization</title>
		<link>http://feedproxy.google.com/~r/TheDoLoop/~3/DkURpd5w9cE/</link>
		<comments>http://blogs.sas.com/content/iml/2013/05/22/timing-performance-vectorization/#comments</comments>
		<pubDate>Wed, 22 May 2013 09:26:11 +0000</pubDate>
		<dc:creator>Rick Wicklin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[vectorization]]></category>

		<guid isPermaLink="false">http://blogs.sas.com/content/iml/?p=8161</guid>
		<description><![CDATA[Last week I discussed a program that had three nested loops that used scalar operations in the innermost loop. I mentioned that this program was not vectorized, and would therefore be slow in a matrix language such as SAS/IML, MATLAB, or R. I then went through a series of steps [...]]]></description>
			<content:encoded><![CDATA[<p>
Last week I discussed <a href="http://blogs.sas.com/content/iml/2013/05/15/vectorize-computations/">a program that had three nested loops</a> that used scalar operations in the innermost loop.  I mentioned that this program was not vectorized, and would therefore be slow in a matrix language such as SAS/IML, MATLAB, or R. I then  went through a series of steps in which I rewrote the program to be more efficient by using basic linear algebra operations such as dot products (level-1 BLAS), matrix-vector multiplication (level-2 BLAS), and matrix-matrix multiplication (level-3 BLAS). At each step, the number of loops decreases, and the efficiency increases.
</p>
<p>
Someone asked me what kind of speed-ups can be expected by vectorizing a SAS/IML program. In general, the answer depends on the size of your data and the particular operations that you are performing, but it is straightforward to run a series of examples that compare the performance at each step of the vectorization of the previous example.
</p>
<p>
In this post, I generate a square matrix of size <em>N</em> for <em>N</em>=100, 200, 400, 600, 800, and 1000.
For each size, I time how long it takes to compute an operation that is equivalent to X`*X.
The operations are as follows:
</p>
<ul>
<li>Level-0: The original program, which consists of three nested loops and scalar operations.
</li><li>Level-1: A program that contains two nested loops and dot products of vectors.
</li><li>Level-2: A program that contains one loops and matrix-vector multiplication.
</li><li>Level-3: A program that contains one matrix-matrix multiplication.
</li></ul>

<p>You can 
<a href='http://blogs.sas.com/content/iml/files/2013/05/BLASTiming.txt'>download the program that performs the experiment</a>. 
</p><p>
On my desktop computer, the results are summarized by the following plots. The first plot shows all of the times. However, the graph is dominated by the slow times that are associated with the original program that had three nested loops and only scalar operations.  The middle plot omits the times for the original program. In this view you can see that the level-1 operation is about 25 times faster than the level-0 operations for N=1000. Again, however, the graph is dominated by the slowest method, and so the last plot shows only the timings for the level-2 and level-3 operations. The level-2 operations are about 16 times faster than the level-3 operations for N=1000. The level-3 operations are almost three times faster than level-2.  
</p>
<a href="http://blogs.sas.com/content/iml/files/2013/05/BLASTiming.png"><img src="http://blogs.sas.com/content/iml/files/2013/05/BLASTiming.png" alt="" width="400" height="533" class="aligncenter size-full wp-image-8165" /></a>

<p>In summary, the slowest method (scalar operations in three nested loops) is about 1,000 times slower than the equivalent level-3 operation, which does not require any loops.  This is good motivation: the time you invest to fully vectorize your code can pay dividends by running 1,000 times faster.
</p><p>
If you are adept at interpreting logarithmic scales, you can also see 
<a href="http://blogs.sas.com/content/iml/files/2013/05/BLASTiming2.png">the results plotted on a Log10 axis</a>.
</p><p>
No matter how you visualize the results, the conclusion is the same: if you want your program to scale to handle large data, vectorize the program to take advantage of the SAS/IML matrix operations.
</p><div class="entry-utility"><span class="tag-links">tags: <a href="http://blogs.sas.com/content/iml/tag/vectorization/">vectorization</a></span></div><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=DkURpd5w9cE:HWdV1rib008:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=DkURpd5w9cE:HWdV1rib008:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?i=DkURpd5w9cE:HWdV1rib008:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/TheDoLoop?a=DkURpd5w9cE:HWdV1rib008:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/TheDoLoop?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/TheDoLoop/~4/DkURpd5w9cE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://blogs.sas.com/content/iml/2013/05/22/timing-performance-vectorization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://blogs.sas.com/content/iml/2013/05/22/timing-performance-vectorization/</feedburner:origLink></item>
	</channel>
</rss>
