The post Ahh, that's smooth! Anti-aliasing in SAS statistical graphics appeared first on The DO Loop.

]]>
Recall that *anti-aliasing* is a graphical technique in which some pixels along a rendered curve are set to an intermediate color, which makes a curve look smoother. For example, if a curve is being drawn by using a black pen, some of the neighboring pixels along the rendered curve are set to shades of grey, which tricks the eye into seeing a smooth curve instead of a jagged, pixellated curve.

The ANTIALIAS= and ANTIALIASMAX= options were added to the ODS GRAPHICS statement in SAS 9.2. A typical usage follows:

ods graphics / antialias=on antialiasmax=5000; |

The ANTIALIAS= option specifies whether to anti-alias. By default, anti-aliasing is on. Because it can be expensive to anti-alias many thousands of graphical elements, the ANTIALIASMAX= option enables you to specify the maximum number of elements (markers or curve points) that can be in a plot before anti-aliasing is disabled for that plot. The default value is ANTIALIASMAX=4000 for SAS 9.4m3. However, the default is only 600 for earlier releases, so you might want to bump up that value when you need a presentation-quality graphic that has thousands of graphical elements. If SAS disables anti-aliasing for a plot because the plot contains too many elements, the SAS log will contain a note similar to the following:

NOTE: Marker and line anti-aliasing has been disabled because the threshold has been reached. You can set ANTIALIASMAX=1000 in the ODS GRAPHICS statement to restore anti-aliasing.

A related option is turning on subpixel rendering by using the SUBPIXEL option. The SUBPIXEL option was added to the ODS GRAPHICS statement in SAS 9.4m3, but it has been available on the PROC SGPLOT statement for several 9.4 releases.

The SAS documentation for the ODS GRAPHICS statement says that the SUBPIXEL option "produces smoother curves and more precise bar spacing." There is a section in the documentation titled "Subpixel Rendering," which demonstrates the impact that subpixel rendering can have on curves and bar charts.

The documentation says that subpixel rendering "is enabled by default for image output, unless the graph contains a scatter plot or a scatter-plot matrix. In those cases, subpixel rendering is disabled by default."

For me, subpixel rendering solves a problem that I've experienced when I create a large bar chart with many categories. The number of bars, the width of the bars, and the dimensions of the graph determine whether the number of pixels between bars is uniform or whether some gaps are larger than others. Sometimes you will see small uneven gaps between the bars, as shown on the left side of the following plot. However, subpixel rendering improves the plot tremendously, as shown on the right side:

In SAS 9.4m3 and beyond, the SUBPIXEL option applies to all plot types. Prior to SAS 9.4m3, the option applied only to line charts and bar charts; see the documentation of the PROC SGPLOT statement for the specific plots that were supported.

I think the best way to learn about anti-aliasing and subpixel rendering is to try it out yourself! These ODS options apply to all ODS statistical graphics, including those that are created by SAS analytical procedures. Remember, however, that the option was only available in the PROC SGPLOT statement for 9.4 releases prior to m3.

The following SAS statements enable you to play with the options and see the differences for a simple loess curve overlaid on a scatter plot:

ods graphics / reset ANTIALIAS=off; /* anti-aliasing off */ proc sgplot data=Sashelp.ENSO; loess y=Pressure x=Month / smooth=0.3 degree=2; run; ods graphics / ANTIALIAS=on ANTIALIASMAX=10000 SUBPIXEL=off; /* anti-aliasing on */ proc sgplot data=Sashelp.ENSO; loess y=Pressure x=Month / smooth=0.3 degree=2; run; ods graphics / ANTIALIAS=on ANTIALIASMAX=10000 SUBPIXEL=on; /* SAS 9.4m3 */ proc sgplot data=Sashelp.ENSO; loess y=Pressure x=Month / smooth=0.3 degree=2; run; |

The following resources provide further information about anti-aliasing and subpixel rendering in ODS graphics:

- Dan Heath (2010) wrote a paper about creating presentation-quality ODS graphics that discusses anti-aliasing and other tips.
- The documentation for the Graph Template Language has a section "Using Anti-Aliasing" that contains additional plots that demonstrate anti-aliasing.
- The documentation for the ODS SG procedures contains a section "Subpixel Rendering."

tags: Statistical Graphics

The post Ahh, that's smooth! Anti-aliasing in SAS statistical graphics appeared first on The DO Loop.

]]>The post Loess regression in SAS/IML appeared first on The DO Loop.

]]>Although PROC LOESS satisfies 99.99% of SAS users who want to fit a loess model, some research statisticians might want to extend or modify the standard loess algorithm. Researchers like to ask "what if" questions like "what if I used a different weighting function?" or "what if I change the points at which the loess model is evaluated?" Although the loess algorithm is complicated, it is not hard to implement a basic version in a matrix language like SAS/IML.

*Implement a basic version of loess regression in SAS/IML. #SAStip*

Click To Tweet

Recent blog posts have provided some computational modules that you can use to implement loess regression. For example, the PairwiseNearestNbr module finds the *k* nearest neighbors to a set of evaluation points.
The functions for weighted polynomial regression computes the loess fit at a particular point.

You can
download a SAS/IML program that defines the nearest-neighbor and weighted-regression modules.
The following call to PROC IML loads the modules and defines a function that fits a loess curve at
points in a vector, *t*. Each fit uses a local neighborhood that contains *k* data values. The local weighted regression is a degree-*d* polynomial:

proc iml; load module=(PairwiseNearestNbr PolyRegEst PolyRegScore); /* 1-D loess algorithm. Does not handle missing values. Input: t: points at which to fit loess (column vector) x, y: nonmissing data (column vectors) k: number of nearest neighbors used for loess fit d: degree of local regression model Output: column vector L[i] = f(t[i]; k, d) where f is loess model */ start LoessFit(t, x, y, k, d=1); m = nrow(t); Fit = j(m, 1); /* Fit[i] is predicted value at t[i] */ do i = 1 to m; x0 = t[i]; run PairwiseNearestNbr(idx, dist, x0, x, k); XNbrs = X[idx]; YNbrs = Y[idx]; /* X and Y values of k nearest nbrs */ /* local weight function where dist[,k] is max dist in neighborhood */ w = 32/5 * (1 - (dist / dist[k])##3 )##3; /* use tricubic weight function */ b = PolyRegEst(YNbrs, XNbrs, w`, d); /* param est for local weighted regression */ Fit[i] = PolyRegScore(x0, b); /* evaluate polynomial at x0 */ end; return Fit; finish; |

This algorithm provides some features that are not in PROC LOESS. You can use this function to evaluate a loess fit at arbitrary X values, whereas PROC LOESS evaluates the function only at quantiles of the data. You can use this function to fit a local polynomial regression of any degree (for example, a zero-degree polynomial), whereas PROC LOESS fits only first- and second-degree polynomials. Although I hard-coded the standard tricubic weight function, you could replace the function with any other weight function.

On the other hand, PROC LOESS supports many features that are not in this proof-of-concept function, such as automatic selection of the smoothing parameter, handling missing values, and support for higher-dimensional loess fits.

Let's use polynomials of degree 0, 1, and 2 to compute three different loess fits. The LoessData data set is defined in my previous article:

use LoessData; read all var {x y}; close; /* read example data */ s = 0.383; /* specify smoothing parameter */ k = floor(nrow(x)*0.383); /* num points in local neighborhood */ /* grid of points to evaluate loess curve (column vector) */ t = T( do(min(x), max(x), (max(x)-min(x))/50) ); Fit0 = LoessFit(t, x, y, k, 0); /* loess fit with degree=0 polynomials */ Fit1 = LoessFit(t, x, y, k, 1); /* degree=1 */ Fit2 = LoessFit(t, x, y, k, 2); /* degree=2 */ create Sim var {x y t Fit0 Fit1 Fit2}; append; close; QUIT; |

You can use PROC SGPLOT to overlay these loess curves on a scatter plot of the data:

title "Overlay loess curves computed in SAS/IML"; proc sgplot data=Sim; label Fit0="Loess (Deg=0)" Fit1="Loess (Deg=1)" Fit2="Loess (Deg=2)"; scatter x=x y=y; series x=t y=Fit0 / curvelabel; series x=t y=Fit1 / curvelabel lineattrs=(color=red); series x=t y=Fit2 / curvelabel lineattrs=(color=ForestGreen); xaxis grid; yaxis grid; run; |

The three curves are fairly close to each other on the interior of the data. The degree 2 curve wiggles more than the other two curves because it uses a higher-degree polynomial. The over- and undershooting becomes even more pronounced if you use cubic or quartic polynomials for the local weighted regressions.

The curious reader might wonder how these curves compare to curves that are created by PROC LOESS or by the LOESS statement in PROC SGPLOT. In the attached program I show that the IML implementation produces the same predicted values as PROC LOESS when you evaluate the models at the same set of points.

Most SAS data analysts are happy to use PROC LOESS. They don't need to write their own loess algorithm in PROC IML. However, this article shows that IML provides the computational tools and matrix computations that you need to implement sophisticated algorithms, should the need ever arise.

The post Loess regression in SAS/IML appeared first on The DO Loop.

]]>The post What is loess regression? appeared first on The DO Loop.

]]>Loess regression is a nonparametric technique that uses local weighted regression to fit a smooth curve through points in a scatter plot. Loess curves are can reveal trends and cycles in data that might be difficult to model with a parametric curve. Loess regression is one of several algorithms in SAS that can automatically choose a smoothing parameter that best fits the data.

In SAS, there are two ways to generate a loess curve. When you want to see statistical details for the fit, use the LOESS procedure. If you just want to overlay a smooth curve on a scatter plot, you can use the LOESS statement in PROC SGPLOT.

This article discusses the 1-D loess algorithm and shows how to control features of the loess regression by using PROC LOESS and PROC SGPLOT. You can also use PROC LOESS to fit higher dimensional data; the PROC LOESS documentation shows an example of 2-D loess, which fits a response surface as a function of two explanatory variables.

The loess algorithm, which was developed by Bill Cleveland and his colleagues in the late '70s through the 'early 90s, has had several different incarnations. Assume that you are fitting the loess model at a point *x*_{0}, which is not necessarily one of the data values. The following list describes the main steps in the loess algorithm as implemented in SAS:

**Choose a smoothing parameter:**The smoothing parameter,*s*, is a value in (0,1] that represents the proportion of observations to use for local regression. If there are*n*observations, then the*k*= floor(*n*s*) points closest to*x*_{0}(in the X direction) form a*local neighborhood*near*x*_{0}.**Find the k nearest neighbors to**I recently showed a SAS/IML module that can compute nearest neighbors.*x*_{0}:**Assign weights to the nearest neighbors:**The loess algorithm uses a tricubic weight function to weight each point in the local neighborhood of*x*_{0}. The weight for the*i*_th point in the neighborhood is

w_{i}= (32/5) (1- (d_{i}/ D)^{3})^{3}

where D is the largest distance in the neighborhood and d_{i}is the distance to the*i*_th point. (The weight function is zero outside of the local neighborhood.) The graph of the weight function is shown below:

The weight function gives more weight to observations whose X value is close to*x*_{0}and less weight to observations that are farther away.**Perform local weighted regression:**The points in the local neighborhood of*x*_{0}are used to fit and score a local weighted regression model at*x*_{0}.

These four steps implement the basic loess method.
The SAS procedures add a fifth step: **optimize the smoothing parameter** by fitting multiple loess models. You can use a criterion such as the AICC or GCV to balance the tradeoff between a tight fit and a complex model.
For details, see the documentation for selecting the smoothing parameter.

The previous section told you how to fit a loess model at a particular point *x*_{0}.
PROC LOESS provides two choices for the locations at which you can evaluate the model:

- By default, PROC LOESS evaluates the model at a data-dependent set of points,
*V*, which are vertices of a*k*-d tree. Think of the points of*V*as a grid of X values. However, the grid is not linearly spaced in X, but is approximately linear in the quantiles of the data. - You can evaluate the model at each unique X data value by using the DIRECT option on the MODEL statement.

If you want to score the model on a set of new observations, you cannot use the direct method.
When you score new observations by using the SCORE statement, PROC LOESS uses linear or cubic interpolation between the points of *V* and the new observations. You can specify the interpolation scheme by using the INTERP= option on the MODEL statement.

The MODEL statement for the LOESS procedure provides many options for controlling the loess regression model. The LOESS statement in PROC SGPLOT provides only a few frequently used options. In some instances, PROC SGPLOT uses different default values, so it is worthwhile to compare the two statements.

**Choose a smoothing parameter:**In both procedures, you can choose a smoothing parameter by using the SMOOTH= option.**Fit the local weighted regression:**In both procedures, you can control the degree of the local weighted polynomial regression by using the DEGREE= option. You can choose a linear or a quadratic regression model. Both procedures use the tricubic function to determine weights in the local neighborhood.**Choose an optimal smoothing parameter:**PROC LOESS provides the SELECT= option for controlling the selection of the optimal smoothing parameter. PROC SGPLOT does not provide a choice: it always optimizes the AICC criterion with the PRESEARCH suboption.**Evaluate the fit:**Both procedures evaluate the fit at a set of data-dependent values, then uses interpolation to evaluate the fit at other locations.- In PROC LOESS, you can use the SCORE statement to interpolate at an arbitrary set of points. You use the INTERP= option in the MODEL statement to specify whether to use linear or cubic interpolation.
- In PROC SGPLOT, the interpolation is performed on a uniform grid of points. The default grid contains 201 points between min(x) and max(x), but you can use the MAXPOINTS= option to change that number. You use the INTERPOLATION= option to specify linear or cubic interpolation.

The following SAS DATA step creates 30 observations for X and Y variables. The call to PROC LOESS creates a loess curve to the data and creates a fit plot, a residual plot, and a panel of diagnostic plots. Only the fit plot is shown:

data LoessData; input x y @@; datalines; 11.7 2.3 19.9 8.1 11.8 4.6 17.1 5.1 16.5 4.8 5.6 1.7 12.9 5.4 7.6 3.0 9.0 4.8 17.5 5.0 10.4 1.3 16.9 2.8 5.6 1.8 18.7 6.9 3.7 1.7 7.4 2.3 2.0 2.7 14.8 5.2 3.0 0.0 16.8 4.2 15.0 6.6 19.9 5.5 1.9 1.1 14.8 5.8 12.4 3.0 14.0 6.4 11.7 3.6 8.2 2.9 18.8 6.2 0.3 1.8 ; ods graphics on; ods select FitPlot; proc loess data=LoessData plots=FitPlot; model y = x / interp=linear /* LINEAR or CUBIC */ degree=1 /* 1 or 2 */ select=AICC(presearch); /* or SMOOTH=0.383 */ run; |

For the PROC LOESS call, all options are the default values except for the PRESEARCH suboption in the SELECT= option. You can create the same fit plot by using the LOESS statement in PROC SGPLOT. The default interpolation scheme in PROC SGPLOT is cubic, so the following statements override that default option:

title "PROC SGPLOT with LOESS Statement"; proc sgplot data=LoessData noautolegend; loess x=x y=y / interpolation=linear /* CUBIC or LINEAR */ degree=1 /* 1 or 2 */ ; /* default selection or specify SMOOTH=0.383 */ xaxis grid; yaxis grid; run; |

The two plots are shown side by side. The one on the left was created by PROC LOESS. The one on the right was created by PROC SGPLOT.

In conclusion, SAS provides two ways to overlay a smooth loess curve on a scatter plot. You can use PROC LOESS when you want to see the details of statistical aspects of the fit and the process that optimizes the smoothing parameter. You can use the SGPLOT procedure when you care less about the details, but simply want an easy way to show a nonlinear relationship between a response and an explanatory variable.

The post What is loess regression? appeared first on The DO Loop.

]]>The post The empty-space distance plot appeared first on The DO Loop.

]]>Recently I showed an algorithm that enables you to find the distance between a set of locations and a fixed set of reference sites. The set of locations (the people) can be arbitrary, so a good way to summarize the problem is to draw a contour plot that shows the distance from every point in a region to the nearest reference site (a hospital or store). In SAS, you can create this graph by using the SPP procedure, which analyzes spatial point processes. The graph is called an empty-space plot.

Some people love big cities and the diverse opportunities that they provide. They want to know how far it is to the nearest big city. Other people hate the noise and congestion; they want to be as far away from a big city as possible. To demonstrate how to create an empty-space plot, let's look at the spatial distribution of US cities that have population more than 200,000.

The following SAS DATA step reads data from the Sashelp.USCity data set, which is distributed as part of SAS/GRAPH software. The BigCities data contains only cities in the contiguous US for which the population is greater than 200,000. The call to PROC SGPLOT creates a scatter plot that shows the projected longitude and latitude (measures in radians) of these 86 cities.data BigCities; set maps.uscity; /* part of SAS/GRAPH */ where statecode not in ("AK" "HI" "PR") & pop > 200000; if pop>500000 then Name=City; /* label really big cities */ else Name = " "; run; title "Large US Cities"; title2 "Population > 200,000"; proc sgplot data=BigCities; scatter x=x y=y / markerattrs=(symbol=CircleFilled) datalabel=Name; run; |

The scatter plot supports a well-known fact: cities are not settled in random locations. People tend to settle near beneficial geographic features, such as proximity to rivers, harbors, and other strategic resources. Therefore the cities are not spread uniformly throughout the country. There are clusters of cities on both coasts and there is a large empty space in the rugged northwest. The large empty space contain locations that are much farther away from a big city than would be expected for a random uniform assortment of 86 cities.

The following call to PROC SPP create an empty-space plot, which reveals regions that are far from a big city. The F option will be explained in a moment:

proc spp data=BigCities plots(equate)=(emptyspace(obs=on) F); process BigCities = (x, y / area=(-0.37,-0.19,0.32, 0.24)) / F; run; |

The dark blue color indicates geographic coordinates that are far from any large city. The darkest blue appears to be in the northwestern states, such as Montana, Idaho, and Wyoming. Western Texas also stands out. Those are great places to live for people who want to be far from large cities. There is also dark blue in regions that are uninhabitable (oceans and the Great Lake) or are not part of the US (Mexico and Canada).

I think the image is beautiful, but it is useful as well. It gives a visual summary of locations that are far from the reference sites (big cities).

If you want a more statistical analysis, the F option in the PROCESS statement and in the PLOTS= option tells PROC SPP to lay down a regular grid of points and compute the distance from each grid position to the nearest big city. (By default the grid is 50 x 50.) The graph shows the cumulative distribution of these distances in blue. You can compare the distribution of these distances to the expected distribution for randomly selected locations, which is shown in red and labeled "complete spatial randomness."

You can see that the empirical distribution rises more slowly than the theoretical curve. This indicates that the observed locations are more clustered than you would expect if the cities were randomly located. For details about the F option, see the documentation for PROC SPP.

The main purpose of this example is to demonstrate that PROC SPP makes it easy to compute the empty-space plot, which summarizes the distance between an arbitrary location and a pattern of points. In this article the points were large US cities, but obviously you could analyze other point patterns. The SPP procedure has many options that indicate whether the pattern appear to be randomly distributed or clustered. The procedure indicates that the locations of large US cities are not random.

tags: Data Analysis, Spatial Data

The post The empty-space distance plot appeared first on The DO Loop.

]]>The post WHERE operators in SAS: Multiple comparisons and fuzzy matching appeared first on The DO Loop.

]]>A common use of the IN operator is to specify a list of US states and territories that should be included or excluded in an analysis. For example, the following DATA step reads the Sashelp.Zipcode data, but excludes zip codes for the states of Alaska ("AK"), Hawaii ("HI"), and US territories such as Puerto Rico ("PR"), Guam ("GU"), and so on:

data Lower48; set Sashelp.Zipcode(where=( /* exclude multiple states and territories */ Statecode not in ("AK" "HI" "VI" "GU" "FM" "MP" "MH" "PW")) ); run; |

In my previous article about how to use the WHERE clause in SAS/IML, my examples used scalar comparisons such as `where(sex="F")` to select only females in the data. The SAS/IML language does not support the IN operator, but there is another compact way to include or exclude multiple values. Because
SAS/IML is a matrix-vector language, many operations support vector arguments.
In particular, the WHERE clause in the SAS/IML language enables you to use the ordinary equal operator (=) and specify a vector of values on the right hand side!

For example, the following statement reads in all US zip codes in the contiguous US and creates a scatter plot of their locations:

proc iml; excludeList = {"AK" "HI" "PR" "VI" "GU" "FM" "MP" "MH" "PW"}; use Sashelp.Zipcode where(Statecode ^= excludeList); /* vector equiv of NOT IN */ read all var {X Y "City" "Statecode"}; close; title "Centers of US ZIP Codes"; call scatter(X, Y) group=Statecode option="markerattrs=(size=2)" label={"Longitude" "Latitude"} procopt="noautolegend"; |

The WHERE clause skips observations for which the `Statecode` variable matches *any* of the values in the `excludeList` vector.
The scatter plot reveals the basic shape of the contiguous US. You can see that the plot does not display locations from Alaska, Hawaii, or US territories.

As long as we're talking about the WHERE clause in SAS, let's discuss some string-matching operators that might not be familiar to some SAS programmers. I'll use SAS/IML for the examples, but these operators are generally supported (in scalar form) in all SAS WHERE clauses. The operators are

- The "contains" operator (
**?**) - The "not contains" operator (
**^?**) - The "begins with" operator (
**=:**) - The "sounds like" operator (
**=***)

All these operators are documented in the list of WHERE clause operators in SAS/IML.

*WHERE operators in #SAS: string matching and fuzzy matching*

Click To Tweet

The "contains" operator (**?**) and the "not contains" operator (**^?**) match a substring that appears anywhere in the target character variable.

The "begins with" operator (**=:**) matches substrings that appear at the beginning of a target variable. For example, the following statements select observations for which the state begins with the letter "B", "C", or "D". (There are no US states that begin with "B.") Notice that the "begin with" operator is also vectorized in SAS/IML:

use Sashelp.Zipcode where(Statecode =: {"B" "C" "D"}); /* =: "begins with" */ read all var {X Y "City" "Statecode"}; close; u = unique(Statecode); print u; |

Perhaps the most unusual operator in the WHERE clause in SAS is the "sounds like" operator (**=***), which does "fuzzy matching" of English words. The operator finds English words that are similar to the specified target words by using the SOUNDEX function in SAS. The SOUNDEX function is often used to select different names that sound alike but have different spelling, such as "John" and "Jon" or "Lynn" and "Lynne."
In the following WHERE clause, the "sounds like" operator is used to select observations for which the city sounds similar to either "Cary" or "Asheville." The selected observations are plotted in a scatter plot after eliminating duplicate rows.

use Sashelp.Zipcode where(City =* {"Cary" "Asheville"}); /* =* "sounds like" */ read all var {X Y "City" "Statecode"}; close; start UniqueRows(x); cols = 1:ncol(x); /* sort by all columns */ call sortndx(ndx, x, cols); /* ndx = permutation of rows that sorts matrix */ uRows = uniqueby(x, cols, ndx); /* locate unique rows of sorted matrix */ return ( ndx[uRows] ); /* rows in original matrix */ finish; r = UniqueRows(City||Statecode); /* get row numbers in x for unique rows */ call scatter(X[r], Y[r]) group=Statecode[r] datalabel=City[r] option="markerattrs=(symbol=CircleFilled)" procopt="noautolegend"; |

According to the "sounds like" operator, the name "Cary" sounds like "Carey," "Cory," and "Cherry." The name "Asheville" sounds like "Ashville," "Ashfield," and "Ash Flat."

In summary, the WHERE clause in SAS/IML works a little differently than the more-familiar version in the SAS DATA step. Both versions enable you to selectively include or exclude observations that satisfy one or more conditions. However, the SAS/IML WHERE clause is vectorized. You can specify a vector of conditions for operators, thus reproducing the functionality of the IN operator.

This article also demonstrates a few lesser-known string operators, such as "contains" (**?**), "not contains" (**^?**), "begins with" (**=:**), and "sounds like" (**=***).

tags: SAS Programming

The post WHERE operators in SAS: Multiple comparisons and fuzzy matching appeared first on The DO Loop.

]]>The post Visualize a weighted regression appeared first on The DO Loop.

]]>Technically, an "unweighted" regression should be called an "equally weighted " regression since each ordinary least squares (OLS) regression weights each observation equally. Similarly, an "unweighted mean" is really an equally weighted mean.

Recall that weights are not the same as frequencies. When talking about weights, it is often convenient to assume that the weights sum to unity. This article uses standardized weights, although you can use specify any set of weights when you use a WEIGHT statement in a SAS procedure.

One way to look at a weighted regression is to assume that a weight is
related to the variance of an observation. Namely, the
i_th weight, *w*_{i}, indicates that the variance of the i_th value of the dependent variable is σ^{2} / *w*_{i}, where σ^{2} is a common variance. Notice that an observation that has a small weight (near zero) has a relatively large variance. Intuitively, the observed response is not known with much precision; a weighted analysis makes that observation less influential.

*Weighted regression assumes that some response values are more precise than others. #StatWisdom*

Click To Tweet

For example, the following SAS data set defines (x,y) values and weights (w) for 11 observations. Observations whose X value is close to x=10 have relatively large weights. Observations far from x=10 have small weights. The last observation (x=14) is assigned a weight of zero, which means that it will be completely excluded from the analysis.

data RegData; input y x w; datalines; 2.3 7.4 0.058 3.0 7.6 0.073 2.9 8.2 0.114 4.8 9.0 0.144 1.3 10.4 0.151 3.6 11.7 0.119 2.3 11.7 0.119 4.6 11.8 0.114 3.0 12.4 0.073 5.4 12.9 0.035 6.4 14.0 0 ; |

You can use PROC SGPLOT to visualize the observations and weights. In fact, PROC SGPLOT supports a REG statement that adds a regression line to the plot. The REG statement supports adding both weighted and unweighted regression lines:

proc sgplot data=RegData; scatter x=x y=y / filledoutlinedmarkers markerattrs=(size=12 symbol=CircleFilled) colorresponse=w colormodel=TwoColorRamp; /* color markers by weight */ reg x=x y=y / nomarkers legendlabel="Unweighted Regression"; /* usual OLS regression */ reg x=x y=y / weight=w nomarkers legendlabel="Weighted Regression"; /* weighted regression */ xaxis grid; yaxis grid; run; |

Each marker is colored by its weight. Dark blue markers (observations for which 8 < *x* < 12) have relatively large weights. Light blue markers have small weights and do not affect the weighted regression model very much.

You can see the effect of the weights by comparing the weighted and unweighted regression lines. The unweighted regression line (in blue) is pulled upward by the observations near *x*=13 and *x*=14. These observations have small weights, so the weighted regression line (in red) is not pulled upwards.

If you want to compute parameter estimates and other statistics, you can use the REG procedure in SAS/STAT to run the weighted and unweighted regression models. The following statement run PROC REG twice: once without a WEIGHT statement and once with a WEIGHT statement:

proc reg data=RegData plots(only)=FitPlot; Unweighted: model y = x; ods select NObs ParameterEstimates FitPlot; quit; proc reg data=RegData plots(only)=FitPlot; Weighted: model y = x; weight w; ods select NObs ParameterEstimates FitPlot; quit; |

The output (not shown) indicates that the unweighted regression model is Y = -0.23 + 0.36*X. In contrast, the weighted regression model is Y = 2.3 + 0.085*X. This confirms that the slope of the weighted regression line is smaller than the slope of the unweighted line.

You can read the SAS documentation to find the formulas that are used for a weighted OLS regression model.
The formula for the parameter estimates is a weighted version of the normal equations: **b** = (**X`WX**)^{-1}(**X`WY**), where **Y** is the vector of observed responses, **X** is the design matrix, and **W** is the diagonal matrix of weights.

In SAS/IML it is more efficient to use elementwise multiplication than to multiply with a diagonal matrix. If you work through the matrix equations, you will discover that weighted regression is easily accomplished by using the normal equations on the matrices that result from multiplying the rows of **Y** and **X** by the square root of the weights. This is implemented in the following SAS/IML module:

proc iml; /* weighted polynomial regression for one regressor. Y, X, and w are col vectors */ start PolyRegEst(Y, X, w, deg); Yw = sqrt(w)#Y; /* mult rows of Y by sqrt(w) */ XDesign = j(nrow(X), deg+1); do j = 0 to deg; /* design matrix for polynomial regression */ Xdesign[,j+1] = X##j; end; Xw = sqrt(w)#Xdesign; /* mult rows of X by sqrt(w) */ b = solve(Xw`*Xw, Xw`*Yw); /* solve normal equation */ return b; finish; use RegData; read all var {Y X w}; close; /* read data and weights */ /* loop to perform one-variable polynomial regression for deg=0, 1, and 2 */ do deg = 0 to 2; b = PolyRegEst(Y, X, w, deg); d = char(deg,1); /* '0', '1', or '2' */ labl = "Estimates (deg=" + d + ")"; print b[L=labl rowname=("b0":("b"+d))]; /* print estimates for each model */ end; |

The output shows the parameter estimates for three regression models: a "mean model" (degree 0), a linear model (degree 1), and a quadratic model (degree 2). Notice that the parameter estimates for the weighted linear regression are the same as estimates computed by PROC REG in the previous section.

The previous section describes how to use SAS/IML to compute parameter estimates of weighted regression models, and you can also use SAS/IML to score the regression models. The scoring does not require knowing the weight variable, only the parameter estimates. The following module uses Horner's method to evaluate a polynomial on a grid of points:

/* Score regression fit on column vector x */ start PolyRegScore(x, coef); p = nrow(coef); y = j(nrow(x), 1, coef[p]); /* initialize to coef[p] */ do j = p-1 to 1 by -1; y = y # x + coef[j]; end; return(y); finish; |

You can compute predicted values for each model on a grid of points. You can then write the predicted values to a SAS data set and combine the predicted values and the original data. Lastly, you can use the SERIES statement in PROG SGPLOT to overlay the three regression models on the original data:

t = T( do(min(x), max(x), (max(x)-min(x))/25) ); /* uniform grid in range(X) */ Yhat = j(nrow(t), 3); do d = 0 to 2; b = PolyRegEst(Y, X, w, d); /* weighted regression model of degree d */ Yhat[,d+1] = PolyRegScore(t, b); /* score model on grid */ end; Z = t || Yhat; /* write three predicted curves to data set */ create RegFit from Z[c={"t" "Pred0" "Pred1" "Pred2"}]; append from Z; QUIT; data RegAll; /* merge predicted curves with original data */ label w="Weight" Pred0="Weighted Mean" Pred1="Weighted Linear Fit" Pred2="Weighted Quadratic Fit"; merge RegData RegFit; run; title "Weighted Regression Models"; /* overlay weighted regression curves and data */ proc sgplot data=RegAll; scatter x=x y=y / filledoutlinedmarkers markerattrs=(size=12 symbol=CircleFilled) colorresponse=w colormodel=TwoColorRamp; series x=t y=Pred0 / curvelabel; series x=t y=Pred1 / curvelabel; series x=t y=Pred2 / curvelabel; xaxis grid; yaxis grid; run; |

A visualization of the weighted regression models is shown to the left. The weighted linear fit is the same line that was shown in the earlier graph. The weighted mean and the weighted quadratic fit are the zero-degree and second-degree polynomial models, respectively. Of course, you could also create these curves in SAS by using PROC REG or by using the REG statement in PROC SGPLOT.

Weighted regression has many applications. The application featured in this article is to fit a model in which some response values are known with more precision than others. We saw that it is easy to create a weighted regression model in SAS by using PROC REG or PROC SGPLOT. It is only slightly harder to write a SAS/IML function to use matrix equations to "manually" compute the parameter estimates. No matter how you compute the model, observations that have relatively large weights are more influential in determining the parameter estimates than observations that have small weights.

tags: Data Analysis, Matrix Computations

The post Visualize a weighted regression appeared first on The DO Loop.

]]>The post Let PROC FREQ create graphs of your two-way tables appeared first on The DO Loop.

]]>- Many SAS regression procedures (and PROC PLM) create effect plots.
- PROC SURVEYREG creates a hexagonal bin plot.
- PROC REG creates heat maps when a scatter plot would suffer from overplotting.
- PROC LOGISTIC creates odds-ratio plots.

Recently a SAS customer asked how to use PROC SGPLOT to produce a stacked bar chart. I showed him how to do it, but I also mentioned that the same graph could be produced with less effort by using PROC FREQ.

PROC FREQ is a workhorse procedure that can create dozens of graphs. For example, PROC FREQ can create a mosaic plot and a clustered bar chart to visualize frequencies and relative frequencies in a two-way table. Next time you use PROC FREQ, add the PLOTS=ALL option to the TABLES statement and see what you get!

One of my favorite plots for two-way categorical data is the stacked bar chart. As I told the SAS customer, it is simple to create a stacked bar chart in PROC FREQ. For example, the following statements create a horizontal bar chart that orders the categories by frequency:

proc freq data=sashelp.Heart order=freq; tables weight_status*smoking_status / plots=freqplot(twoway=stacked orient=horizontal); run; |

See the documentation for the PLOTS= option in the TABLES statement for a description of all the plots that PROC FREQ can create. PROC FREQ creates many plots that are associated with a particular analysis, such as the "deviation plot," which shows the relative deviations between the observed and expected counts when you request a chi-square analysis of a one-way table.

The ODS graphics in SAS are designed to create many—but not all—of the visualizations that are relevant to an analysis. Some attributes of a graph (for example, the title and the legend placement) are determined by a stored template and can't be modified by using the procedure syntax. Advanced GTL gurus might want to learn how to edit the ODS templates. Less ambitious users might choose to use a statistical procedure to automatically create graphs during data exploration and modeling, but then use PROC SGPLOT to create the final graph for a report.

For example, the following call to PROC FREQ writes the two-way frequency counts to a SAS data set. From the data you can create graphs that are similar to the one that PROC FREQ creates, but you can change the order and colors of the bars, alter the placement of the legend, add text, and more. The following call to PROC SGPLOT shows one possibility. Click on the graph to see the full-size version.

proc freq data=sashelp.heart order=freq noprint; tables smoking_status*weight_status / out=FreqOut(where=(percent^=.)); run; ods graphics /height=500px width=800px; title "Counts of Weight Categories by Smoking Status"; proc sgplot data=FreqOut; hbarparm category=smoking_status response=count / group=weight_status seglabel seglabelfitpolicy=none seglabelattrs=(weight=bold); keylegend / opaque across=1 position=bottomright location=inside; xaxis grid; yaxis labelpos=top; run; |

If you want to create a stacked bar chart or some other visualization of a two-way table, you might be tempted to immediately start using PROC SGPLOT. The purpose of this article is to remind you that SAS statistical procedures, including PROC FREQ, often create graphs as part of their output. Even if a statistical procedure cannot provide all the bells and whistles of PROC SGPLOT, it often is a convenient way to visualize your data during the preliminary stages of an analysis. If you need a highly customized graph for a final report, the SAS procedure can output the data for the graph. You can then use PROC SGPLOT or the GTL to create a customized graph.

tags: 9.4, Data Analysis, Statistical Graphics

The post Let PROC FREQ create graphs of your two-way tables appeared first on The DO Loop.

]]>The post Distances between observations in two groups appeared first on The DO Loop.

]]>The answer is yes, and this problem occurs frequently in applications. Typically the first group contains the locations of randomly placed individuals and the second contains the locations of resources. For each individual, you want to find the closest resources. Examples include:

- The first group contains the positions of houses. The second group contains the locations of stores. A retailer might want to identify the stores that are closest to customers, so that shipping costs are minimized.
- The first group contains the locations of a vehicle accidents on a highway. The second group contains the locations of hospitals who can treat the injured victims.
- The first group contains the location of cell phone users. The second group contains the locations of cell towers.

This article solves the problem in a straightforward way: compute all the pairwise distances between the points in each group. Distances within groups are not needed, so they are not computed.

*Distances between observations in two groups*

Click To Tweet

I previously showed how to compute the pairwise distance between points in different sets. Let S be a set of *n* d-dimensional points and let R be another set of *m* points. The PairwiseDist function in SAS/IML (shown below) returns an *n* x *m* matrix, D, of distances such that D[i,j] is the distance from the i_th point in S to the j_th point in R. The PairwiseNearestNbr function is the same
algorithm that was used to find nearest neighbors, so I will not re-explain how the function works. Suffice it to say that it returns two matrices: a matrix of row numbers and a matrix or distances.

proc iml; /* http://blogs.sas.com/content/iml/2013/03/27/compute-distance.html */ /* compute Euclidean distance between points in S and points in R. S is a n x d matrix, where each row is a point in d dimensions. R is a m x d matrix. The function returns the n x m matrix of distances, D, such that D[i,j] is the distance between S[i,] and R[j,]. */ start PairwiseDist(S, R); if ncol(S)^=ncol(R) then return (.); /* different dimensions */ n = nrow(S); m = nrow(R); idx = T(repeat(1:n, m)); /* index matrix for S */ jdx = shape(repeat(1:m, n), n); /* index matrix for R */ diff = S[idx,] - R[jdx,]; return( shape( sqrt(diff[,##]), n ) ); /* sqrt(sum of squares) */ finish; /* Compute indices (row numbers) of k nearest neighbors. INPUT: S an (n x d) data matrix R an (m x d) matrix of reference points k specifies the number of nearest neighbors (k>=1) OUTPUT: idx an (n x k) matrix of row numbers. idx[,j] contains the row numbers (in R) of the j_th closest elements to S dist an (n x k) matrix. dist[,j] contains the distances between S and the j_th closest elements in R */ start PairwiseNearestNbr(idx, dist, S, R, k=1); n = nrow(S); idx = j(n, k, .); dist = j(n, k, .); D = PairwiseDist(S, R); /* n x m */ do j = 1 to k; dist[,j] = D[ ,><]; /* smallest distance in each row */ idx[,j] = D[ ,>:<]; /* column of smallest distance in each row */ if j < k then do; /* prepare for next closest neighbors */ ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]); D[ndx] = .; /* set elements to missing */ end; end; finish; |

To illustrate how you can use these functions, consider two sets S and R. The set R is the set of resources locations (stores, hospitals, towers,...). For this example, R contains four points that are the vertices of a square of side length 2 centered at the origin. For each point in S, the following SAS/IML statements compute the closest vertex in R and the second-closest-vertex in R:

/* R = vertices of a square */ R = {-1 -1, /* Pt 1: lower left corner */ 1 -1, /* Pt 2: lower right corner */ 1 1, /* Pt 3: upper right corner */ -1 1}; /* Pt 4: upper left corner */ S = {-0.3 -0.2, /* points inside square */ 0.9 0.6, -0.5 0.8}; k = 2; /* find nearest and second nearest points */ run PairwiseNearestNbr(idx, dist, S, R, k); print idx[c={"Closest" "2nd Closest"} r=("S1":"S3")]; |

The first column in the table shows the indices (row numbers) of the vertices that are nearest to each point of S. The second column shows the second-closest vertices. For example, Pt 1 (the lowest left corner) is the nearest vertex to the first row of S, and Pt 4 (upper left corner) is the second nearest. In a similar way, Pt 3 is the closest vertex to the second row of S, and Pt 4 is the closest vertex to the third row of S.

The PairwiseNearestNbr module also returns the `dist` matrix, which contains the corresponding distances.
The i_th row of `dist` contains the distances from the i_th row of S to the two closest vertices.

In a similar way, you can generate random points in the square and color each point in a scatter plot according to the nearest point in the reference set. The ODS GRAPHICS statement with ATTRPRIORITY=NONE forces the ODS style to cycle though colors and symbols.

ods graphics / attrpriority=NONE width=400px height=400px; /* cycle symbols and colors */ /* Generate 500 random points in square [-1,1] x [-1,1] */ call randseed(12345); S = R // randfun({500 2}, "Uniform", -1, 1); k = 2; run PairwiseNearestNbr(idx, dist, S, R, k); Corner = idx[,1]; /* index of nearest vertex */ title "Points colored by nearest corner"; call scatter(S[,1], S[,2]) group=Corner grid={x y} procopt="aspect=1"; |

The scatter plot shows what you already knew: points in the first quadrant (green Xs) are closest to the vertex at (1,1). Points in the second quadrant (brown triangles) are closest to the vertex at (-1, 1), and so forth. Any points that are equidistant from two or more vertices (such as the origin) are assigned one of the colors arbitrarily. The points were sorted (not shown) so that the order in the legend agrees with the order of the points in R.

The call to the PairwiseNearestNbr subroutine used *k*=2, which requests the closest and the second closest points. The second-closest indices are located in the second column of the `idx` matrix. If you change the title and set `Corner = idx[,2]`, you can create a scatter plot in which each point is colored by the *second-closest* vertex. That coloration is shown in the scatter plot below.
See if you can understand why each marker in the following scatter plot is colored the way it is. For example, for the point in the first quadrant, their second-closest vertex is either the second vertex (1, -1) or the fourth vertex (-1, 1).

In summary, last week I showed how you can find the nearest neighbors among a single set of observations. This article solves a similar problem in which observations are split into two different groups. The functions in this article find the nearest observation in a "reference" or "resource" group for each observation in another group.

I'll conclude by mentioning that you can use these computations to compute the nearest distance between two groups of observations. Simply use *k*=1 and take the minimum distance that is computed by the PairwiseNearestNbr subroutine. This gives the smallest distance between any point in the first group and any point in the second group.

tags: Data Analysis

The post Distances between observations in two groups appeared first on The DO Loop.

]]>The post Create an ogive in SAS appeared first on The DO Loop.

]]>*#StatWisdom: How to create an ogive (rhymes with 'slow jive')*

Click To Tweet

An ogive is also called a cumulative histogram. You can create an ogive from a histogram by accumulating the frequencies (or relative frequencies) in each histogram bin. The
height of an ogive curve at *x* is found by summing the heights of the histogram bins to the left of *x*.

A histogram estimates the density of a distribution; the ogive estimates the cumulative distribution. Both are easy to construct by hand. Both are coarse estimates that depend on your choice of a bin widths and anchor position.

Ogives are not used much by professional statisticians because modern computers make it easy to compute and visualize the exact empirical cumulative distribution function (ECDF). However, if you are a student learning to analyze data by hand, an ogive is an easy way to approximate the ECDF from binned data. They are also important if you do not have access to the original data, but you have a histogram that appeared in a published report. (See "How to approximate a distribution from published quantiles.")

To demonstrate the construction of an ogive, let's consider the distribution of the MPG_CITY variable in the Sashelp.Cars data set. This variable contains the reported fuel efficiency (in miles per gallon) for 428 vehicle models. The following call to PROC UNIVARIATE in Base SAS uses the OUTHIST= option in the HISTOGRAM statement to create a data set that contains the frequencies and relative frequencies of each bin. By default, the frequencies are reported for the midpoints of the intervals. To create an ogive you need the *endpoints* of each bin, so use the ENDPOINTS option as follows:

proc univariate data=sashelp.cars; var mpg_city; histogram mpg_city / grid vscale=proportion ENDPOINTS OUTHIST=OutHist; /* cdfplot mpg_city / vscale=proportion; */ /* optional: create an ECDF plot */ run; |

The histogram shows that most vehicles get between 15 and 25 mpg in the city. The distribution is skewed to the right, with a few vehicles getting as much as 59 or 60 mpg. A few gas-guzzling vehicles get less than 15 mpg.

You can construct an ogive from the relative frequencies in the 11 histogram bins.
The height of the ogive at *x*=10 (the leftmost endpoint in the histogram) is zero. The height at *x*=15 is the height of the first bar. The height at *x*=20 is the sum of the heights of the first two histogram bars, and so on.

Each row in the OutHist data set contains a left-hand endpoint and the relative frequency (height) of the bar.
However, to construct an ogive, you need to associate the bar height with the *right-hand* endpoints. This is because at the left-hand endpoint none of the density for the bin has accumulated, and for the right-hand endpoint all of the density has accumulated.

Consequently, to construct an ogive from the OUTHIST= data set, you can do the following:

- Associate zero with the leftmost endpoint of the bins.
- Adjust the counts and proportions in the OutHist data so that they are associated with the right-hand endpoint of each bin. You can use the LAG function to do this.
- Accumulate the relative frequencies in each bin to form the cumulative frequencies.
- Add a new observation to the OutHist data that contains the rightmost endpoint of the bins.

The following SAS DATA step carry out these adjustments:

data Ogive; set outhist end=EOF; ogiveX = _MinPt_; /* left endpoint of bin */ dx = dif(ogiveX); /* compute bin width */ prop = lag(_OBSPCT_); /* move relative frequency to RIGHT endpoint */ if _N_=1 then prop = 0; /* replace missing value by 0 for first obs */ cumProp + prop/100; /* accumulate proportions */ output; if EOF then do; /* append RIGHT endpoint of final bin */ ogiveX = ogiveX + dx; cumProp = 1; output; end; drop dx _:; /* drop variables that begin with underscore */ run; |

The Ogive data set contains all the information that you need to graph an ogive. The following call to PROC SGPLOT uses a VLINE statement, which treats the endpoints of the bins as discrete values. You could also use the SERIES statement, which treats the endpoints as a continuous variable, but might not put a tick mark at each bin endpoint.

title "Cumulative Distribution of Binned Values (Ogive)"; proc sgplot data=Ogive; vline OgiveX / response=cumProp markers; /* series x=_minpt_ y=cumProp / markers; */ xaxis grid label="Miles per Gallon (City)"; yaxis grid values=(0 to 1 by 0.1) label="Cumulative Proportion"; run; |

You can use the graph to estimate the percentiles of the data. For example:

- The 20th percentile is approximately 17 because the curve appears to pass through the point (17, 0.20). In other words, about 20% of the vehicles get 17 mpg or less.
- The 50th percentile is approximately 19 because the curve appears to pass through the point (19, 0.50).
- The 90th percentile is approximately 27 because the curve appears to pass through the point (27, 0.90). Only 10% of the vehicles have a fuel efficiency greater than 27 mpg.

You might wonder how well the ogive approximates the empirical CDF. The following graph overlays the ogive and the ECDF for this data. You can see that the two curves agree closely at the ogive values, shown by the markers. However, there is some deviation because the ogive assume a linear accumulation (a uniform distribution) of data values within each histogram bin. Nevertheless, this coarse piecewise linear curve that is based on binned data does a good job of showing the basic shape of the empirical cumulative distribution.

tags: Data Analysis

The post Create an ogive in SAS appeared first on The DO Loop.

]]>The post Simulate data from a generalized Gaussian distribution appeared first on The DO Loop.

]]>
The generalized Gaussian distribution
has a standardized probability density of the form f(*x*) = B exp( -|A*x*|^{α} ), where A(α) and B(α) are known functions of the exponent parameter α > 0.
When 0 < α < 2, the generalized Gaussian distribution (GGD) is a heavy-tailed distribution that has finite moments. The distribution has applications in finance and signal processing.

The following SAS statements evaluate the GGD density function for four values of the shape parameter α. The SGPLOT procedure graphs the density functions (adapted from Nardon and Pianca (2009)):

data GenGauss; mu=0; sigma=1; /* location=mu and scale=sigma */ do alpha = 0.5, 1, 2, 5; A = sqrt( Gamma(3/alpha) / Gamma(1/alpha) ) / sigma; /* A(alpha, sigma) */ B = (alpha/2) / Gamma(1/alpha) * A; /* B(alpha, sigma) = normalizing constant */ do x = -3 to 3 by 0.05; pdf = B*exp(-(A*abs(x-mu))**alpha); output; end; end; run; title "Generalized Gaussian Densities"; title2 "(*ESC*){unicode mu}=0; (*ESC*){unicode sigma}=1"; proc sgplot data=GenGauss; series x=x y=pdf / group=alpha; yaxis grid label="Density of Generalized Gaussian"; keylegend / across=1 opaque location=inside position=right; run; |

The density for α=1/2 is sharply peaked and has heavy tails. The distribution for α=1 is the Laplace distribution, which also has a sharp peak at the mean. The distribution for α=2 is the standard normal distribution. For α=5, the distribution is platykurtic: it has a broad flat central region and thin tails.

This article uses only the standardized distribution that has zero mean and unit variance. However, the functions are all written to support a location parameter μ and a scale parameter σ.

Nardon and Pianca (2009) describe an algorithm for simulating random variates from the generalized Gaussian distribution: simulate from a gamma distribution, raise that variate to a power, and then randomly multiply by ±1. You can implement the simulation in the SAS DATA step or in the SAS/IML language. The following SAS/IML program defines a function for generating random GGD values:

proc iml; start Rand_GenGauss(N, mu=0, sigma=1, alpha=0.5); n1 =N[1]; n2 = 1; /* default dimensions for returned matrix */ if nrow(N)>1 | ncol(N)>1 then n2 = N[2]; /* support n1 x n2 matrices */ x = j(n1,n2); sgn = j(n1,n2); /* simulate for mu=0 and sigma=1, then scale and translate */ A = sqrt( Gamma(3/alpha) / Gamma(1/alpha) ); b = A##alpha; call randgen(x, "Gamma", 1/alpha, 1/b); /* X ~ Gamma(1/alpha, 1/b) */ call randgen(sgn, "Bernoulli", 0.5); sgn = -1 + 2*sgn; /* random {-1, 1} */ return mu + sigma * sgn # x##(1/alpha); finish; /* try it out: simulate 10,000 random variates */ call randseed(12345); x = Rand_GenGauss(10000, 0, 1, 1/2); /* X ~ GGD(mu=0, sigma=1, alpha=1/2) */ title "Generalized Gaussian Distribution"; title2 "mu=0, sigma=1, alpha=1/2"; call histogram(x); |

For α=1/2, most of the simulated values are near 0, which is the mean and mode. The bulk of the remaining values are in the interval [-5,5]. However, a very small percentage (about 0.5%) are more extreme than these values. For this sample, the range of the simulated data is about [-14 , 11], so you can see that this distribution is useful for simulations in which the magnitude of errors can be more extreme than normal errors.

The cumulative distribution function for the generalized Gaussian distribution does not have a closed-form solution in terms of elementary functions. However, there are a few values of α for which the equations simplify, including α=1/2. Chapeau-Blondeau and Monir (2002) note that the generalized Gaussian distribution with α=1/2 is especially useful in practice because its quantile function (inverse CDF) can be explicitly expressed in terms of the Lambert W function.

The CDF for the GGD with α=1/2 can be computed separately for *x* ≤ 0 and for *x* > 0, as shown below:

/* CDF for generalized Gaussian distribution with alpha=1/2 */ start CDF_GenGaussHalf(x); y = sqrt(abs(2*sqrt(30)*x)); cdf = 0.5*(1+y)#exp(-y); /* CDF for x <= 0 */ idx = loc(x>0); /* if x > 0, reflect: CDF(x) = 1 - CDF(-x) */ if ncol(idx)>0 then do; cdf[idx] = 1 - cdf[idx]; end; return cdf; finish; p = CDF_GenGaussHalf(-5); /* P(X< -5) for X ~ GDD(0,1, alpha=1/2) */ print p; /* 0.0025654 */ |

You can use the CDF function to evaluate the probability that a random GGD observation is less than -5. The answer is 0.0026. By symmetry, the probability that a random variate is outside of [-5,5] is 0.0052, which agrees with the empirical proportion in the random sample in the previous section.

The inverse CDF for the GGD with α=1/2 can be implemented by calling the Lambert W function. (Technically, the lower branch of the Lambert W function.) You can download functions that implement the Lambert W function in SAS/IML. The following SAS/IML statements assume that you have downloaded the Lambert W functions and stored the modules, so that they are available by using the LOAD statement. Just as the CDF was defined in two parts, the inverse CDF is defined in two parts.

/* Use Lambert W function to compute the quantile function (inverse CDF) for generalized Gaussian distribution with alpha=1/2 */ load module=(LambertWm); /* load function for the Lambert W function */ start Quantile_GenGaussHalf(x); q = j(nrow(x), ncol(x), 0); idx = loc(x<0.5); /* invert CDF for x < 0.5 */ if ncol(idx)>0 then do; y = -2 * x[idx] / exp(1); q[idx] = -1/(2*sqrt(30)) * (1 + LambertWm(y))##2; end; if ncol(idx)=nrow(x)*ncol(x) then return q; /* all found */ idx = loc(x=0.5); /* quantile for x=0.5 */ if ncol(idx)>0 then q[idx] = 0; idx = loc(x>0.5); /* invert CDF for x > 0.5 */ if ncol(idx)>0 then do; y = -2 * (1-x[idx]) / exp(1); q[idx] = 1/(2*sqrt(30)) * (1 + LambertWm(y))##2; end; return q; finish; q = Quantile_GenGaussHalf(0.975); /* find q such that P(X < q) = 0.975 */ print q; /* 2.0543476 */ |

The output (not shown) is q = 2.05, which means that the interval [-2.05, 2.05] contains 95% of the probability for the standardized GGD when α=1/2. Compare this interval with the familiar interval [-1.96, 1.96], which contains 95% of the probability for the standard normal distribution.

To learn more about the generalized Gaussian distribution, with an emphasis on parameter value α=1/2, see the following wwll-written papers, which expand the discussion in this blog post:

- F. Chapeau-Blondeau and A. Monir (2002), "Numerical Evaluation of the Lambert W Function and Application to Generation of Generalized Gaussian Noise With Exponent 1/2,"
*IEEE Trans. on Sig. Processing*. -
Nardon and Pianca (2009; download preprint 2006).
"Simulation techniques for generalized Gaussian densities,"
*J. Stat. Comp. and Sim.*

tags: Simulation

The post Simulate data from a generalized Gaussian distribution appeared first on The DO Loop.

]]>