The post Discover power laws by log-transforming data appeared first on The DO Loop.

]]>
A recent issue of *Astronomy* magazine mentioned Kepler's third law of planetary motion, which states "the square of a planet's orbital period is proportional to the cube of its average distance from the Sun" (*Astronomy*, Dec 2016, p. 17). The article included a graph (shown at the right) that shows the period and distance for several planetary bodies. The graph is plotted on a log-log scale and shows that the planetary data falls on a straight line through the origin.

I sometimes see Kepler's third law illustrated by using a graph of the cubed distances versus the squared periods. In a cubed-versus-squared graph, the planets fall on a straight line with unit slope through the origin. Since power transformations and log transformations are different, I was puzzled. How can both graphs be correct?

After a moment's thought, I realized that the magazine had done something very clever. Although it is true that a graph of the squared-periods versus the cubed-distances will CONFIRM the relationship AFTER it has been discovered, the magazine's graph gives insight into how a researcher can actually DISCOVER a power-law relationship in the first place! To discover the values of the exponents in a power-law relationship, log-transform both variables and fit a regression line.

*How to discover a power law? Log-transform the data!*

Click To Tweet

Suppose that you suspect that a measured quantity Y is related by a power law to another quantity X. That is, the quantities satisfy the relationship
Y^{m} = *A* X^{n}
for some integer values of the unknown parameters *m* and *n* and constant *A*.
If you have data for X and Y, how can use discover the values of *m* and *n*?

One way is to use linear regression on the log-transformed data. Take the logarithms of both size and simplify to obtain
log(Y) = *C* + (*n/m*) log(X)
where *C* is a constant. You can use ordinary least squares regression to estimate values of the constant *C* and the ratio *n/m*.

For example, let's examine how a modern statistician could quickly discover Kepler's third law by using logarithms and regression. A NASA site for teachers provides the period of revolution (in years) and the mean distance from the Sun (in astronomical units) for several planetary bodies. Some of the data (Uranus, Neptune, and Pluto) were not known to Kepler. The following SAS DATA step reads the data and computes the log (base 10) of the distances and periods:

data Kepler; input Name $8. Period Distance; logDistance = log10(Distance); logPeriod = log10(Period); label logDistance="log(Mean Distance from Sun) (AU)" logPeriod ="log(Orbital Period) (Years)"; datalines; Mercury 0.241 0.387 Venus 0.616 0.723 Earth 1 1 Mars 1.88 1.524 Jupiter 11.9 5.203 Saturn 29.5 9.539 Uranus 84.0 19.191 Neptune 165.0 30.071 Pluto 248.0 39.457 ; |

The graph in *Astronomy* magazine plots distances on the vertical axis and periods horizontally, but it is equally valid to flip the axes.
It seems more natural to compute a linear regression of the period as a function of the distance, and in fact this how Kepler expressed his third law:

The proportion between the periodic times of any two planets is precisely one and a half times the proportion of the mean distances.

Consequently, the following call to PROC REG in SAS estimates the power relationship between the distance and period:

proc reg data=Kepler plots(only)=(FitPlot ResidualPlot); model logPeriod = logDistance; run; |

The linear fit is almost perfect. The R^{2} value (not shown) is about 1, and the root mean square error is 0.0005.
The table of parameter estimates is shown. The intercept is statistically insignificant. The estimate for the ratio (*n/m*) is 1.49990, which looks suspiciously like a decimal approximation for 3/2.
Thus a simple linear regression reveals the powers used in Kepler's third law: the second power of the orbital
period is proportional to the third power of the average orbital distance.

Not only can a modern statistician easily discover the power law, but it is easy to create a scatter plot that convincingly shows the nearly perfect fit. The following call to PROC SGPLOT in SAS creates the graph, which contains the same information as the graph in *Astronomy* magazine. Notice that I used custom tick labels for the log-scaled axes:

title "Kepler's Third Law"; title2 "The Squared Period Is Proportional to the Cubed Distance"; proc sgplot data=Kepler; scatter y=logPeriod x=logDistance / datalabel=Name datalabelpos=bottomright datalabelattrs=(size=12); lineparm x=0 y=0 slope=1.5 / legendlabel="log(Period) = 3/2 log(Distance)" name="line"; yaxis grid values=(-1 0 1 2 3) valuesdisplay=("0.1" "1" "10" "100" "1000") offsetmax=0; xaxis grid values=(-1 0 1 2) valuesdisplay=("0.1" "1" "10" "100"); keylegend "line" / location=inside opaque; run; |

This article shows how a modern statistician can discover Kepler's third law using linear regression on log-transformed data. The regression line fits the planetary data to high accuracy, as shown by the scatter plot on a log-log scale.

It is impressive that Kepler discovered the third law without having access to these modern tools. After publishing his first two laws of planetary motion in 1609, Kepler spent more than a decade trying to find the third law. Kepler said that the third law "appeared in [his] head" in 1618.

Kepler did not have the benefit of drawing a scatter plot on a log-log scale because Descartes did not create the Cartesian coordinate system until 1637. Kepler could not perform linear regression because Galton did not describe it until the 1880s.

However, Kepler did know about logarithms, which John Napier published in 1614.
According to Kevin Brown, (*Reflections on Relativity*, 2016)
"Kepler was immediately enthusiastic about logarithms" when he read Napier's work in 1616.
Although historians cannot be sure, it is plausible that Kepler used logarithms to discover his third law.
For more information about Kepler, Napier, and logarithms, read
Brown's historical commentary.

tags: Data Analysis, History

The post Discover power laws by log-transforming data appeared first on The DO Loop.

]]>The post Append data to add markers to SAS graphs appeared first on The DO Loop.

]]>This article discusses two ways to combine data sets in order to create ODS graphics. An alternative is to use the SG annotation facility to add extra curves or markers to the graph. Personally, I prefer to use the techniques in this article for simple features, and reserve annotation for adding highly complex and non-standard features.

In a previous article, I discussed how to structure a SAS data set so that you can overlay curves on a scatter plot.

The diagram at the right shows the main idea of that article. The X and Y variables contain the original data, which are the coordinates for a scatter plot. Secondary information was appended to the end of the data. The X1 and Y1 variables contain the coordinates of a custom scatter plot smoother. The X2 and Y2 variables contain the coordinates of a different scatter plot smoother.This structure enables you to use the SGPLOT procedure to overlay two curves on the scatter plot. You use a SCATTER statement and two SERIES statements to create the graph. See the previous article for details.

In addition to overlaying curves, I sometimes want to add special markers to the scatter plot. In this article I will show how to add a marker that shows the location of the sample mean. This article shows how to use PROC MEANS to create an output data set that contains the coordinates of the sample mean, then append that data set to the original data.

*Add special markers to a graph using PROC SGPLOT #SASTip*

Click To Tweet

The following statements use PROC MEANS to compute the sample mean for four variables in the SasHelp.Iris data set, which contains the measurements for 150 iris flowers. To emphasize the general syntax of this computation, I use macro variables, but that is not necessary:

%let DSName = Sashelp.Iris; %let VarNames = PetalLength PetalWidth SepalLength SepalWidth; proc means data=&DSName noprint; var &VarNames; output out=Means(drop=_TYPE_ _FREQ_) mean= / autoname; run; |

The AUTONAME option on the OUTPUT statement tells PROC MEANS to append the name of the statistic to the variable names. Thus the output data set contains variables with names like PetalLength_Mean and SepalWidth_Mean. As shown in the diagram in the previous section, this enables you to append the new data to the end of the old data in "wide form" as follows:

data Wide; set &DSName Means; /* add four new variables; pad with missing values */ run; ods graphics / attrpriority=color subpixel; proc sgplot data=Wide; scatter x=SepalWidth y=PetalLength / legendlabel="Data"; ellipse x=SepalWidth y=PetalLength / type=mean; scatter x=SepalWidth_Mean y=PetalLength_Mean / legendlabel="Sample Mean" markerattrs=(symbol=X color=firebrick); run; |

The first SCATTER statement and the ELLIPSE statement use the original data. Recall that the ELLIPSE statement draws an approximate confidence ellipse for the mean of the population. The second SCATTER statement uses the sample means, which are appended to the end of the original data. The second SCATTER statement draws a red marker at the location of the sample mean.

You can use this same method to plot other sample statistics (such as the median) or to highlight special values such as the origin of a coordinate system.

In some situations it is more convenient to append the secondary data in "long form." In the long form, the secondary data set contains the same variable names as in the original data. You can use the SAS data step to create a variable that identifies the original and supplementary observations. This technique can be useful when you want to show multiple markers (sample mean, median, mode, ...) by using the GROUP= option on one SCATTER statement.

The following call to PROC MEANS does not use the AUTONAME option. Therefore the output data set contains variables that have the same name as the input data. You can use the IN= data set option to create an ID variable that identifies the data from the computed statistics:

/* Long form. New data has same name but different group ID */ proc means data=&DSName noprint; var &VarNames; output out=Means(drop=_TYPE_ _FREQ_) mean=; run; data Long; set &DSName Means(in=newdata); if newdata then GroupID = "Mean"; else GroupID = "Data"; run; |

The DATA step created the GroupID variable, which has the values "Data" for the original observations and the value "Mean" for the appended observations. This data structure is useful for calling PROC SGSCATTER, which supports the GROUP= option, but does not support multiple PLOT statements, as follows:

ods graphics / attrpriority=none; proc sgscatter data=Long datacontrastcolors=(steelblue firebrick) datasymbols=(Circle X); plot (PetalLength PetalWidth)*(SepalLength SepalWidth) / group=groupID; run; |

In conclusion, this article demonstrates a useful technique for adding markers to a graph. The technique requires that you concatenate the original data with supplementary data. Appending and merging data is a technique that is used often when creating ODS statistical graphics in SAS. It is a great technique to add to your programming toolbox.

The post Append data to add markers to SAS graphs appeared first on The DO Loop.

]]>The post Goodness-of-fit tests: A cautionary tale for large and small samples appeared first on The DO Loop.

]]>All measures of goodness-of-fit suffer the same serious drawback. When the sample size is small, only the most aberrant behaviors will be identified as lack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distributions may be very small and technically unimportant to the inferential conclusions.

In short, goodness-of-fit (GOF) tests are not very informative when the sample size is very small or very large.

I thought it would be useful to create simulated data that demonstrate the statements by Johnson and Wichern. Obviously I can't show "all measures of goodness-of-fit," so this article uses tests for normality. You can construct similar examples for other GOF tests.

*All measures of goodness-of-fit suffer the same serious drawback. #StatWisdom*

Click To Tweet

As I showed last week, the distribution of a small sample can look quite different from the population distribution. A GOF test must avoid falsely rejecting the bulk of these samples, so the test necessarily rejects "only the most aberrant behaviors."

To demonstrate how GOF tests work with small samples, let's generate four samples of size N=25 from the following populations:

- A normal N(4,2) distribution. The population mean and standard deviation are 4 and 2, respectively.
- A gamma(4) distribution. The population mean and standard deviation are 4 and 2, respectively.
- A shifted exponential(2) distribution. The population mean and standard deviation are 4 and 2, respectively.
- A lognormal(1.25, 0.5) distribution. The population mean and standard deviation are 3.96 and 2.11, respectively.

The following SAS DATA step creates the four samples. The `Distribution` variable identifies the observations in each sample.
You can use the SGPANEL procedure to visualize the sample distributions and overlay a normal density estimate, as follows:

data Rand; call streaminit(1234); N = 25; do i = 1 to N; Distribution = "Normal "; x = rand("Normal", 4, 2); output; Distribution = "Gamma "; x = rand("Gamma", 4); output; Distribution = "Exponential"; x = 2 + rand("Expo", 2); output; Distribution = "Lognormal "; x = rand("Lognormal", 1.25, 0.5); output; end; run; proc sgpanel data=Rand; panelby Distribution / rows=4 layout=rowlattice onepanel novarname; histogram x; density x / type=normal; run; |

We know that three of the four distributions are not normal, but what will the goodness-of-fit tests say? The NORMAL option in PROC UNIVARIATE computes four tests for normality for each sample. The following statements run the tests:

ods select TestsForNormality; proc univariate data=Rand normal; class Distribution; var x; run; |

The results (not shown) are that the exponential sample is rejected by the tests for normality (at the α=0.05 level), but the other samples are not. The samples are too small for the test to rule out the possibility that the gamma and lognormal samples might actually be normal. This actually makes sense: the distribution of these samples do not appear to be very different from some of the normal samples in my previous blog post.

As Johnson and Wichern said, a large sample might appear to be normal, but it might contain small deviations that cause a goodness-of-fit test to reject the hypothesis of normality. Maybe the tails are a little too fat. Perhaps there are too many or too few outliers. Maybe the values are rounded. For large samples, a GOF test has the power to detect these small deviations and therefore reject the hypothesis of normality.

I will demonstrate how rounded values can make a GOF test reject an otherwise normal sample. The following DATA step creates a random sample of size N=5000. The X variable is normally distributed; the R variable is identical to X except values are rounded to the nearest tenth.

data RandBig; call streaminit(1234); N = 5000; do i = 1 to N; x = rand("Normal", 4, 2); r = round(x, 0.1); /* same, but round to nearest 0.1 */ output; end; run; |

There is little difference between the X and R variables. The means and standard deviations are almost the same. The skewness and kurtosis are almost the same. Histograms of the variables look identical. Yet because the sample size is 5000, the GOF tests *reject* the hypothesis of normality for the R variable at the 95% confidence level. The following call to PROC UNIVARIATE computes the analysis for both X and R:

ods select Moments Histogram TestsForNormality; proc univariate data=RandBig normal; var x r; histogram x r / Normal; run; |

Partial results are shown. The first table is for the variable X. The goodness-of-fit tests fail to reject the null hypothesis, so we correctly accept that the X variable is normally distributed. The second table is for the variable R. The GOF tests reject the hypothesis that R is normally distributed, merely because the values in R are rounded to the nearest 0.1 unit.

Rounded values occur frequently in practice, so you could argue that the variables R and X are not substantially different, yet normality is rejected for one variable but not for the other.

And it is not just rounding that can cause GOF tests to fail. Other small and seemingly innocuous deviations from normality would be similarly detected.

In conclusion, be aware of the cautionary words of Johnson and Wichern. For small samples, goodness-of-fit tests do not reject a sample unless it exhibits "aberrant behaviors." For very large samples, the GOF tests "invariably produce statistically significant lack of fit," regardless of whether the deviations from the target distributions are practically important.

tags: Simulation, Statistical Thinking

The post Goodness-of-fit tests: A cautionary tale for large and small samples appeared first on The DO Loop.

]]>The post Sampling variation in small random samples appeared first on The DO Loop.

]]>*A small random sample might look quite different than the population #StatWisdom*

Click To Tweet

In this article I recreate the panel. In the following SAS DATA step I create 20 samples, each of size N. I think the original panel showed samples of size N=15 or N=20. I've used N=15, but you can change the value of the macro variable to explore other sample sizes. If you change the random number seed to 0 and rerun the program, you will get a different panel every time. The SGPANEL procedure creates a 4 x 5 panel of the resulting histograms. Click to enlarge.

%let N = 15; data Normal; call streaminit(93779); do ID = 1 to 20; do i = 1 to &N; x = rand("Normal"); output; end; end; run; title "Random Normal Samples of Size &N"; proc sgpanel data=Normal; panelby ID / rows=4 columns=5 onepanel; histogram x; run; |

Each sample is drawn from the standard normal distribution, but the panel of histogram reveals a diversity of shapes. About half of the ID values display the typical histogram for normal data: a peak near x=0 and a range of [-3, 3]. However, the other ID values look less typical. The histograms for ID=1, 19, and 20 seem to have fewer negative values than you might expect. The distribution is very flat (almost uniform) for ID=3, 9, 13, and 16.

Because histograms are created by binning the data, they are not always the best way to visualize a sample distribution. You can create normal quantile-quantile (Q-Q) plots to compare the empirical quantiles for the simulated data to the theoretical quantiles for normally distributed data. The following statements use PROC RANK to create the normal Q-Q plots, as explained in a previous article about how to create Q-Q plots in SAS:

proc rank data=Normal normal=blom out=QQ; by ID; var x; ranks Quantile; run; title "Q-Q Plots for Random Normal Samples of Size &N"; proc sgpanel data=QQ noautolegend; panelby ID / rows=4 columns=5 onepanel; scatter x=Quantile y=x; lineparm x=0 y=0 slope=1; colaxis label="Normal Quantiles"; run; |

The Q-Q plots show that the sample distributions are well-modeled by a standard normal distribution, although the deviation in the lower tail is apparent for ID=1, 19, and 20. This panel shows why it is important to use Q-Q plots to investigate the distribution of samples: the bins used to create histograms can make us see shapes that are not really there. The SAS documentation includes a section on how to interpret Q-Q plots.

In conclusion, small random normal samples can display a diversity of shapes. Statisticians understand this sampling variation and routinely caution that "the standard errors are large" for statistics that are computed on a small sample. However, viewing a panel of histograms makes the concept of sampling variation more real and less abstract.

tags: Simulation, Statistical Thinking

The post Sampling variation in small random samples appeared first on The DO Loop.

]]>The post Highlight forecast regions in graphs appeared first on The DO Loop.

]]>- Use the ATTRPRIORITY=NONE option on the ODS GRAPHICS statement to make sure that the current ODS style will change line patterns, and use the STYLEATTRS statement to set the line patters for the plot.
- Add an indicator variable to the data set that indicates which times are in the "past" (the data region) and which times are in the "future" (the forecast region).
- Use the BLOCK statement in PROC SGPLOT to add a background color that differentiates the past and future regions.
- Use the SERIES statement to plot the model forecast and use the GROUP= option to visually differentiate the past predictions (solid line) from future predictions (dashed line).

A simple "toy" example is the best way to show the essential features of the desired graph. The following DATA step creates a curve in two regions. For x ≤ 6.28, the curve is a sine curve. For x > 6.28, the curve is linear. These two domains correspond to the "Historical" and "Forecast" levels of the indicator variable `BlockID`. The graph is shown to the left.

data Example; pi = constant('pi'); BlockID = "Historical"; do x = 0 to 6.28 by 0.01; y = sin(x); output; end; BlockID = "Forecast"; do x = 6.28 to 8 by 0.01; y = x - 2*pi; output; end; run; ods graphics / attrpriority=none; title "Background Colors and Line Styles for Forecast"; proc sgplot data=Example noautolegend; styleattrs DATACOLORS=(verylightgrey verylightred) /* region */ DATALINEPATTERNS=(solid dash) /* line patterns */; block x=x block=BlockID / transparency=0.75; series x=x y=y / group=BlockID lineattrs=(color=black); run; |

The graph emphasizes the forecast region by using color and a line pattern. The ATTRPRIORITY=NONE option ensures that the line patterns alternate between groups. For details, see Sanjay Matange's article about the interactions between the ATTRPRIORITY= option and the STYLEATTRS statement. For rapidly oscillating models, you might want to use the DOT line pattern instead of the DASH line pattern.

I've previous written about how to use the BLOCK statement to emphasize different regions in the domain of a graph.

Of course, this example is very simplistic. The next section shows how you can apply the ideas to a more realistic example.

Many SAS procedures create suitable graphs when you enable ODS GRAPHICS. In particular, many SAS/ETS procedures (such as PROC ARIMA) can create graphs that look similar to this example. The following classic example is taken from the PROC ARIMA documentation. The data are the log-transformed number of passengers who flew on commercial airlines in the US between 1949 and 1961. Based on these data, the ARMIA model forecasts an additional 24 months of passenger traffic.

data seriesg; input x @@; xlog = log( x ); date = intnx( 'month', '31dec1948'd, _n_ ); format date monyy.; label xlog="log(passengers)"; datalines; 112 118 132 129 121 135 148 148 136 119 104 118 115 126 141 135 125 149 170 170 158 133 114 140 145 150 178 163 172 178 199 199 184 162 146 166 171 180 193 181 183 218 230 242 209 191 172 194 196 196 236 235 229 243 264 272 237 211 180 201 204 188 235 227 234 264 302 293 259 229 203 229 242 233 267 269 270 315 364 347 312 274 237 278 284 277 317 313 318 374 413 405 355 306 271 306 315 301 356 348 355 422 465 467 404 347 305 336 340 318 362 348 363 435 491 505 404 359 310 337 360 342 406 396 420 472 548 559 463 407 362 405 417 391 419 461 472 535 622 606 508 461 390 432 ; proc arima data=seriesg plots(only)=forecast(forecasts); identify var=xlog(1,12); estimate q=(1)(12) noint method=ml; forecast id=date interval=month out=forearima; run; |

You can see that the ODS graph uses a dashed line to separate the historical (data) region from the forecast region. However, the graph uses a solid line to display all predicted values, even the forecast.

In the previous PROC ARIMA call, I used the OUT= option on the FORECAST statement to create a SAS data set that contains the predicted values and confidence region. The following DATA step adds an indicator variable to the data:

data forecast; set forearima; if date <= '01JAN1961'd then BlockID = "Historical"; else BlockID = "Forecast"; run; |

You can now create the modified graph by using the STYLEATTRS, BLOCK, and SERIES statements. In addition, a BAND statement adds the confidence limits for the predicted values. A SCATTER statement adds the data values. The XAXIS and YAXIS values overlay a grid on the graph.

ods graphics / attrpriority=none; title "ARIMA Model and Forecast"; proc sgplot data=forecast noautolegend; styleattrs DATACOLORS=(verylightgrey verylightred) /* region */ DATALINEPATTERNS=(solid dot) /* line patterns */; block x=date block=BlockID / transparency=0.75; band x=date lower=L95 upper=U95; scatter x=date y=xlog; series x=date y=Forecast / group=BlockID lineattrs=(color=black); xaxis grid display=(nolabel); yaxis grid; run; |

The final graph is a customized version of the default graph that is created by using PROC ARIMA. The presentation highlights the forecast region by using a different background color and a different line style.

If you are content with this one-time modification, then you are done. If you want to create this graph every time that you run PROC ARIMA, read the SAS/STAT chapter "ODS Graphics Template Modification" or read Warren Kuhfeld's paper about how to modify the underlying template to customize the graph every time that the procedure runs.

tags: Statistical Graphics

The post Highlight forecast regions in graphs appeared first on The DO Loop.

]]>The post Need to log-transform a distribution? There's a SAS function for that! appeared first on The DO Loop.

]]>
The presenter computed the expression in SAS by using an expression that looked like

`y = LOG(PDF(" distribution", x, params));`

In other words, he computed the PDF and then transformed the density by applying the LOG function.

There is a better way. It is more efficient and more accurate to use the LOGPDF function to compute the log-PDF directly.

*Special functions for log-transformed distributions #SASTip*

Click To Tweet

SAS provides several functions for computing with log-transformed distributions. In addition to the LOGPDF function, you can use the LOGCDF function to compute probabilities for the log-distribution.

Let's use the gamma distribution to see why the LOGPDF function is usually more efficient.
The gamma density function with unit scale parameter is given by

f(x; a) = x^{a-1} exp(-x) / Γ(a)

where Γ(a) is the value of the gamma function evaluated at a.

The log-transform of the gamma density is

log(f(x; a)) = (a-1)log(x) – x – log(Γ(a))

This formula, which is used when you compute LOGPDF("gamma", x, 2), is cheap to evaluate and does not suffer from numerical underflow when x is large. In particular, recall that
in double-precision arithmetic, exp(-x) underflows if x > 709.78, so the PDF function cannot support extremely large values of x. In contrast, the formula for the log-PDF does not experience any numerical problems when x is large. The log-PDF function is also very fast to compute.

The following DATA step illustrates how to use the LOGPDF function to compute the log-gamma density. It computes the log-gamma PDF two ways: by using the LOGPDF function and by using the log-transform of the gamma PDF. The results show that the numerical values are nearly the same, but that the PDF function fails for large values of x:

data LogPDF(drop=a); a = 2; do x = 100, 700, 720, 1000; LogOfPDF = log( pdf("Gamma", x, a) ); LOGPDF = logpdf("Gamma", x, a); /* faster and more accurate */ diff = LOGPDF - LogOfPDF; output; end; label LOGOfPDF = "LOG(PDF(x))"; run; proc print noobs label; run; |

By the way, SAS provides other functions that are optimized for computing the log-transformation of several well known funcions and quantities:

- The LOGBETA function computes the logarithm of the beta function. You should use
`logbeta(a,b)`instead of`log(beta(a,b))`. - The LGAMMA function computes the logarithm of the gamma function. You should use
`lgamma(a)`instead of`log(gamma(a))`. - The LCOMB function computes the
logarithm of the number of combinations of
*n*objects taken*k*at a time. You should use`lcomb(n,k)`instead of`log(comb(n,k))`. - The LFACT function computes the
quantity log(
*n*!) for specified values of*n*. You should use`lfact(n)`instead of`log(fact(n))`.

In general, when software provides a function for directly computing the logarithm of a quantity, you should use it. The special-purpose function is typically faster, more accurate, and will handle arguments that are beyond the domain of the non-specialized function.

tags: Numerical Analysis

The post Need to log-transform a distribution? There's a SAS function for that! appeared first on The DO Loop.

]]>The post Visualize the ages of US presidents appeared first on The DO Loop.

]]>*Ages of US presidents. Who was oldest? Youngest? #DataViz*

Click To Tweet

- You don't have to recode the dates in the table. You can use the ANYDTDTE informat to read dates such as "Feb 22, 1732" and "Mar 4, 1797."
- You don't have to read the ages into SAS. If you have the president's birth date, SAS can use YRDIF function the compute the president's age at any subsequent date.

- You might have heard the claim that Donald Trump will be the oldest person to be inaugurated. The graph shows that Trump will be older than Ronald Reagan was at his 1980 inauguration. However, the graph also shows that Ronald Reagan served two terms, so Reagan was older than Trump at his second inauguration. Reagan was almost 78 when he left office, which makes him the oldest person to serve as president!
- Who was the youngest president? Many people might guess John F. Kennedy. True, he was the youngest to be
*elected*presidents, but the graph shows that Teddy Roosevelt was about nine months younger when he assumed the office after McKinley's assassination. - On several occasions the incoming president was almost exactly the same age as the outgoing president. For example, in 2001 Bill Clinton (age 54.4 years) was succeeded by George W. Bush (age 54.5 years). Bush's arrow appears to be a continuation of Clinton's.
- In contrast, can you find the biggest age gap between an outgoing president and his successor? Hint: It was a 26-year difference in age!

The post Visualize the ages of US presidents appeared first on The DO Loop.

]]>The post One informat to rule them all: Read any date into SAS appeared first on The DO Loop.

]]>
The ANYDTDTE*w*. informat is a flexible alternative to older informats such as DATE*w*., MMDDYY*w*., and YYMMDD*w*. If your dates are in a specific form, the older informats work great and serve to document that all dates must be in that standard form. If the dates are not standardized or you need to read a string like "July 4, 1776", the ANYDTDTE informat is a godsend.

The following SAS DATA step shows that the ANYDTDTE*w*. format combines several older formats into a "super format" that attempts to convert a character string into a date. The
ANYDTDTE format
can not only replace many of the older formats, but it can be used to convert a string like
"Jul 4, 1776" into a date, as follows:

data Dates; input @1 Style $8. @9 Value anydtdte12.; format Value DATE10.; datalines; DATE 04JUL1776 MMDDYY 07041776 MMDDYY 07/04/1776 YYMMDD 17760704 N/A Jul 4, 1776 N/A July 4, 1776 ; proc print noobs; run; |

As you can see, the ANYDTDTE informat reads six different strings, but converts all of them to the SAS date value that corresponds to 04JUL1776.

The string 07/04/1776 can be interpreted as "April 7, 1776" or "July 4, 1776," depending upon the local convention. Europeans tend to interpret the string as DD/MM/YYYY whereas the US convention is to use MM/DD/YYYY. How does the ANYDTDTE*w*. informat guess which interpretation might be correct?

The answer is that the informat looks at the DATESTYLE SAS option. By default, the DATESTYLE option uses the LOCALE system option to guess which style to use. You can use PROC OPTIONS to see the value of these options, which are printed to the SAS log:

proc options option=(DATESTYLE LOCALE) value; run; |

Option Value Information For SAS Option DATESTYLE Value: MDY Scope: Default How option value set: Locale Option Value Information For SAS Option LOCALE Value: EN_US ... |

For my system, the DATESTYLE option is set to MDY, which means that the string "07/04/1776" will be interpreted MM/DD/YYYY. If you need to read dates that obey a different convention, you can use the global OPTIONS statement to set the DATESTYLE option:

options DATESTYLE=DMY; /* change default style convention */ /* Restore default convention: options DATESTYLE=Locale; */ |

There are two other SAS infomats that are similar to the ANYDTDTE informat:

- The ANYDTDTM
*w*. informat creates SAS datetime value from various input styles. - The ANYDTTME
*w*. informat creates SAS time value from various input styles.

Here's a tip to help you remember these seemingly cryptic names. The first part of the name is "ANYDT", which means that the input string can be ANY datetime (DT) value. The end of the name refers to the numerical value that is produced by the informat. The resulting value can be a date (DTE), a datetime (DTM), or a time (TME) value. Thus the three informats all have the mnemonic form ANYDTXXX where the XXX suffix refers to the value that is produced.

The post One informat to rule them all: Read any date into SAS appeared first on The DO Loop.

]]>The post Visualize a torus in SAS appeared first on The DO Loop.

]]>*Projections of a 3-D torus: Four visualization techniques for high-dimensional data. #DataViz*

Click To Tweet

A torus is the product of two circles, so it can be parameterized by two angles, usually called θ and φ. You can create a regular grid of points in the square (θ, φ) ∈ [0, 2π] x [0, 2π] and map the points into Euclidean three-space as shown in the following SAS DATA step:

data Torus; R = 8; /* radius to center of tube */ A = 3; /* radius of tube */ pi = constant('pi'); step = 2*pi/50; /* create torus as parametric image of [0, 2*pi] x [0,2*pi] */ do theta = 0 to 2*pi-step by step; do phi = 0 to 2*pi-step by 2*pi/50; x = (R + A*cos(phi)) * cos(theta); y = (R + A*cos(phi)) * sin(theta); z = A*sin(phi); output; end; end; keep x y z; run; title "Projections of a Standard Torus"; proc sgscatter data=Torus; matrix x y z; run; |

The SGSCATTER procedure displays the projection of the torus onto the three coordinate planes. In the (x,y) plane you can see that the object has a hole, but the projection onto the other coordinate planes are not very insightful because there is a lot of overplotting.

One way to avoid overplotting is to visualize the torus as a 3-D cloud of points. The SAS/IML Studio environment supports a rotating 3-D scatter plot, as shown to the left. In SAS/IML Studio you can interactively rotate the 3-D cloud of points, change the marker colors, and perform other techniques in exploratory data analysis.

Alternatively, if you want to look at planar projections with PROC SGSCATTER, you can rotate the torus so that the projections onto the coordinate planes are not degenerate.

My previous article defined SAS/IML functions that create rotation matrices. The following SAS/IML program is almost identical to the program in the previous article, so I will not explain each statement. The program reads the Torus data set, rotates the points in a particular way, and writes the rotated coordinates to a SAS data set.

proc iml; /* choose rotation matrix as product of canonical rotations */ load module=Rot3D; /* see http://blogs.sas.com/content/iml/2016/11/07/rotations-3d-data.html */ pi = constant('pi'); Rz = Rot3D(-pi/6, "Z"); /* rotation matrix for (x,y) plane */ Rx = Rot3D(-pi/3, "X"); /* rotation matrix for (y,z) plane */ Ry = Rot3D( pi/6, "Y"); /* rotation matrix for (x,z) plane */ P = Rx*Ry*Rz; /* cumulative rotation */ use Torus; /* read data (points on a torus) */ read all var {x y z} into M; close Torus; Rot = M * P`; /* apply rotation and write to data set */ create RotTorus from Rot[colname={"Px" "Py" "Pz"}]; append from Rot; close; QUIT; |

You can use PROC SGSCATTER to project these rotated points onto the coordinate planes. The TRANSPARENCY= option creates semi-transparent markers that give the illusion of a ghostly see-through torus. This can be an effective technique for visualizing any point cloud, since it enables you to see regions in which overplotting occurs. The statements also use the COLORRESPONSE= option to color the markers by the (original) Z variable. The COLORRESPONSE= option requires SAS 9.4m3.

data Coords; merge Torus RotTorus; run; title "Projections of a Rotated Torus"; proc sgscatter data=Coords; matrix Px Py Pz / markerattrs=(size=12 symbol=CircleFilled) transparency=0.75 colorresponse=Z colormodel=ThreeColorRamp; /* COLORRESPONSE= requires 9.4m3 */ run; |

The transparent markers serve a second function. The torus is composed of a sequence of circles. Therefore, the last circles (near θ = 2π) will obscure the first circles (near θ = 0) if opaque markers are used. The parametric construction is very apparent if you remove the TRANSPARENCY= option.

If you want to plot without transparency, you should sort the data, which is a standard technique in the graphics community where it is called z-ordering or depth sorting. For example, in the (Px,Py) plane you can sort by the Pz variable so that negative Pz values are plotted first and positive Pz values are plotted on top of the earlier markers. Unfortunately, you can only use this sorting trick to plot one pair of coordinates at a time. The following code demonstrates this trick for the (Px, Py) plane:

proc sort data=Coords out=SortCoords; by Pz; run; title "Projection of a Rotated Torus"; title2 "Sorted by Distance Above Coordinate Plane"; proc sgplot data=SortCoords; scatter x=Px y=Py / markerattrs=(size=12 symbol=CircleFilled) colorresponse=Z colormodel=ThreeColorRamp; run; |

In conclusion, PROC SGSCATTER enables you to visualize high-dimensional data via projections onto coordinate planes. I demonstrated the techniques on a 3-D torus, but the techniques apply generally to any high-dimensional data. Visualization tricks in this article include:

- If the projections of the data onto coordinate planes are degenerate, rotate the points.
- Use semi-transparency to reduce the effect of overplotting.
- Use the COLORRESPONSE= option to color the markers by one of the variables. This requires SAS 9.4m4.
- If you do not want to use transparency, sort the data in a direction perpendicular to the projected plane. That makes the rendering look more realistic.

Your high-dimensional point clouds might not look as cool as this torus, but if you use these visualization tips, the data will be easier to interpret and understand. The SGSCATTER procedure in SAS is an easy-to-use routine for investigating relationships among several variables.

tags: Math, SAS/IML Studio, Statistical Graphics

The post Visualize a torus in SAS appeared first on The DO Loop.

]]>The post Rotation matrices and 3-D data appeared first on The DO Loop.

]]>In three dimensions there are three canonical rotation matrices:

- The matrix R
_{x}(α) rotates points counterclockwise by the angle α about the X axis. Equivalently, the rotation occurs in the (y, z) plane. - The matrix R
_{y}(α) rotates points counterclockwise by the angle α in the (x, z) plane. - The matrix R
_{z}(α) rotates points counterclockwise by the angle α in the (x, y) plane.

Each of the following SAS/IML functions return a rotation matrix. The RotPlane function takes an angle and a pair of integers. It returns the rotation matrix that corresponds to a counterclockwise rotation in the (x_{i}, x_{j}) plane. The Rot3D function has a simpler calling syntax. You specify and axis (X, Y, or Z) to get a rotation matrix in the plane that is perpendicular to the specified axis:

proc iml; /* Rotate a vector by a counterclockwise angle in a coordinate plane. [ 1 0 0 ] Rx = [ 0 cos(a) -sin(a)] ==> Rotate in the (y,z)-plane [ 0 sin(a) cos(a)] [ cos(a) 0 -sin(a)] Ry = [ 0 1 0 ] ==> Rotate in the (x,z)-plane [ sin(a) 0 cos(a)] [ cos(a) -sin(a) 0] Rz = [ sin(a) cos(a) 0] ==> Rotate in the (x,y)-plane [ 0 0 1] */ start RotPlane(a, i, j); R = I(3); c = cos(a); s = sin(a); R[i,i] = c; R[i,j] = -s; R[j,i] = s; R[j,j] = c; return R; finish; start Rot3D(a, axis); /* rotation in plane perpendicular to axis */ if upcase(axis)="X" then return RotPlane(a, 2, 3); else if upcase(axis)="Y" then return RotPlane(a, 1, 3); else if upcase(axis)="Z" then return RotPlane(a, 1, 2); else return I(3); finish; store module=(RotPlane Rot3D); quit; |

NOTE: Some sources define rotation matrices by leaving the object still and rotating a camera (or observer). This is mathematically equivalent to rotating the object in the opposite direction, so if you prefer a camera-based rotation matrix, use the definitions above but specify the angle -α. Note also that some authors change the sign for the R_{y} matrix; the sign depends whether you are rotating about the positive or negative Y axis.

*3-D rotation matrices are simple to construct and use in a matrix language #SASTip*

Click To Tweet

Every rotation is a composition of rotations in coordinate planes. You can compute a composition by using matrix multiplication. Let's see how rotations work by defining and rotating some 3-D data. The following SAS DATA step defines a point at the origin and 10 points along a unit vector in each coordinate direction:

data MyData; /* define points on coordinate axes */ x = 0; y = 0; z = 0; Axis="O"; output; /* origin */ Axis = "X"; do x = 0.1 to 1 by 0.1; /* points along unit vector in x direction */ output; end; x = 0; Axis = "Y"; do y = 0.1 to 1 by 0.1; /* points along unit vector in y direction */ output; end; y = 0; Axis = "Z"; do z = 0.1 to 1 by 0.1; /* points along unit vector in z direction */ output; end; run; proc sgscatter data=Mydata; matrix X Y Z; run; |

If you use PROC SGSCATTER to visualize the data, the results (not shown) are not very enlightening. Because the data are aligned with the coordinate directions, the projection of the 3-D data onto the coordinate planes always projects 10 points onto the origin. The projected data does not look very three-dimensional.

However, you can slightly rotate the data to obtain nondegenerate projections onto the coordinate planes. The following computations form a matrix P which represents a rotation of the data by -π/6 in one coordinate plane followed by a rotation by -π/3 in another coordinate plane:

proc iml; /* choose any 3D projection matrix as product of rotations */ load module=Rot3D; pi = constant('pi'); Rz = Rot3D(-pi/6, "Z"); /* rotation matrix for (x,y) plane */ Rx = Rot3D(-pi/3, "X"); /* rotation matrix for (y,z) plane */ Ry = Rot3D( 0, "Y"); /* rotation matrix for (x,z) plane */ P = Rx*Ry*Rz; /* cumulative rotation */ print P; |

For a column vector, v, the rotated vector is P*v. However, the data in the SAS data set is in row vectors, so use the transposed matrix to rotate all observations with a single multiplication, as follows:

use MyData; read all var {x y z} into M; read all var "Axis"; close; RDat = M * P`; /* rotated data */ |

Yes, that's it. That one line rotates the entire set of 3-D data. You can confirm the rotation by plotting the projection of the data onto the first two coordinates:

title "Rotation and Projection of Data"; Px = RDat[,1]; Py = RDat[,2]; call scatter(Px, Py) group=Axis option="markerattrs=(size=12 symbol=CircleFilled)"; |

Alternatively, you can write the rotated data to a SAS data set. You can add reference axes to the plot if you write the columns of the P` matrix to the same SAS data set. The columns are the rotated unit vectors in the coordinate directions, so using the VECTOR statement to plot those coordinates adds reference axes:

create RotData from RDat[colname={"Px" "Py" "Pz"}]; append from RDat; close; A = P`; /* rotation of X, Y, and Z unit vectors */ create Axes from A[colname={"Ax" "Ay" "Az"}]; append from A; close; Labels = "X":"Z"; create AxisLabels var "Labels"; append; close; QUIT; data RotData; /* merge all data sets */ merge MyData Rot Axes AxisLabels; run; proc sgplot data=RotData; vector x=Ax y=Ay / lineattrs=(thickness=3) datalabel=Labels; scatter x=Px y=Py / group=Axis markerattrs=(size=12 symbol=CircleFilled); xaxis offsetmax=0.1; yaxis offsetmax=0.1; run; |

All the data points are visible in this projection of the (rotated) data onto a plane. The use of the VECTOR statement to add coordinate axes is not necessary, but I think it's a nice touch.

This article is about rotation matrices, and I showed how to use matrices to rotate a 3-D cloud of observations. However, I don't want to give the impression that you have to use matrix operations to plot 3-D data! SAS has several "automatic" 3-D visualization methods that more convenient and do not require that you program rotation matrices. The visualization methods include

- The G3D procedure in SAS/GRAPH software creates 3-d scatter plots.
- The SAS/IML Studio application, which is part of SAS/IML software, supports a point-and-click interface to create rotating scatter plots, and you can also use the RotatingPlot class to create a 3-d scatter plot programmatically.

I also want to mention that Sanjay Matange created a 3-D scatter plot macro that uses ODS graphics to visualize a 3-D point cloud. Sanjay also uses rotation matrices, but because he uses the DATA step and PROC FCMP, his implementation is longer and less intuitive than the equivalent operations in the SAS/IML matrix language. In his blog he says that his macro "is provided for illustration purposes."

In summary, the SAS/IML language makes it convenient to define and use rotation matrices. An application is rotating a 3-D cloud of points. In my next blog post I will present a more interesting visualization example.

The post Rotation matrices and 3-D data appeared first on The DO Loop.

]]>