The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.
]]>The SAS analytical documentation has a new look.
Beginning with the 14.2 release of the SAS analytical products (which shipped with SAS 9.4m4 in November 2016), the HTML version of the online documentation has moved to
a new framework called the Help Center. The URL for the online documentation is easy to remember:
http://support.sas.com/documentation/
This article shows the 14.2 documentation for the SAS analytical products, as highlighted in the adjacent image. Documentation for previous releases is also available.
The 14.2 link takes you to a new page that contains links for the User's Guides for each SAS analytical product, such as SAS/STAT, SAS/ETS, SAS/IML, and so on. When you click on a User's Guide, you are taken to the new Help Center.
An example page for the SAS/STAT documentation is shown in the following image (click to enlarge). As in previous versions of the help system, the Help Center provides drop-down lists (Overview, Getting Started, Syntax, etc.) for quick navigation within a procedure. There are also arrows (now in the upper right corner) that take you to the previous or next page in the book.
The following list describes five features of the Help Center that are either new or that extend features of the older HTML format. The locations of these features are highlighted with red rectangles in the previous image.
Five new features in the #SAS analytical documentation
Click To Tweet
In summary, the new Help Center framework provides additional ways for SAS customers to learn about the syntax, options, and output of SAS analytical procedures. At the present time, only analytical products use the Help Center. The documentation for Base SAS continues to be provided in HTML and PDF formats.
Check out the SAS analytical products 14.2 documentation and let me know what you think. Do you like something that I didn't mention? Post a comment.
The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.
]]>The post Solve mixed integer linear programming problems in SAS appeared first on The DO Loop.
]]>Last month I discussed how to solve linear programming (LP) problems in SAS by using PROC OPTMODEL in SAS/OR or by using SAS/IML software. You can use these same tools to solve mixed integer linear programming problems. The OPTMODEL procedure has the simpler syntax. The MILPSOLVE function in SAS/IML software provides similar functionality in an interactive matrix language.
The previous article solved a two-variable LP problem. If you constrain one of the variables to be an integer, you get a MILP problem, as follows:
The graph shows the feasible region. The interior of the polygon satisfies the constraints. Because x1 must be an integer, the vertical stripes indicate the feasible values of the variables. The color of the stripes indicate the value of the objective function. The green star indicates the optimal solution, which is x = {5, 3.1}.
The following statements show one way to formulate and solve the MILP problem by using the OPTMODEL procedure in SAS/OR software:
proc optmodel; var x1 integer >= 0; /* information about the variables */ var x2 >= 0; max z = 3*x1 + 5*x2; /* define the objective function */ con c1: 3*x1 + -2*x2 <= 10; /* specify linear constraints */ con c2: 5*x1 + 10*x2 <= 56; con c3: 4*x1 + 2*x2 >= 7; solve with milp; /* solve the MILP problem */ print x1 x2; quit; |
The OPTMODEL procedure prints two tables. The first (not shown) describes the optimization algorithm and tells you that the optimal objective value is 30.5. The second table is the solution vector, which is x = {5, 3.1}.
The MILPSOLVE subroutine in the SAS/IML language uses matrices and vectors to specify the problem. The syntax for the MILPSOLVE subroutine is almost identical to the syntax for the LPSOLVE subroutine, so see the previous article for an explanation of each argument. The following SAS/IML program defines and solves the MILP problem:
proc iml; /* information about variables (row of column, doesn't matter) */ colType = {I, C}; /* C, B, or I for cont, binary, int */ LowerB = {0, 0}; /* lower bound constraints on x */ UpperB = {10,10}; /* upper bound constraints on x */ /* objective function */ c = {3 5}; /* vector for objective function c*x */ /* linear constraints */ A = {3 -2, /* matrix of constraint coefficients */ 5 10, 4 2}; b = {10, /* RHS of constraint eqns (column vector) */ 56, 7}; /* specify symbols for constraints: 'L' for less than or equal 'E' for equal 'G' for greater than or equal */ LEG = {L, L, G}; /* control vector for optimization */ ctrl = {-1, /* maximize objective */ 1}; /* print level */ CALL MILPSOLVE(rc, objVal, result, relgap, /* output variables */ c, A, b, /* objective and linear constraints */ ctrl, /* control vector */ coltype, LEG, /*range*/, LowerB, UpperB); print rc objVal, result[rowname={x1 x2}]; |
In the call to MILPSOLVE, the first four arguments are output arguments. The return code (rc) is 0, which indicates that an optimal value was found. The value of the objective function at the optimal value is returned in objVal. The optimal values of the variables are returned in the result argument.
The MILPSOLVE subroutine was introduced in SAS/IML 13.1, which was shipped with SAS 9.4m1. For additional details about the MILPSOLVE subroutine, see the documentation.
The post Solve mixed integer linear programming problems in SAS appeared first on The DO Loop.
]]>The post PUT it there! Six tips for using PUT and %PUT statements in SAS appeared first on The DO Loop.
]]>The PUT statement supports a "named output" syntax that enables you to easily display a variable name and value. The trick is to put an equal sign immediately after the name of a variable: PUT varname=; For example, the following statement displays the text "z=" followed by the value of z:
data _null_; x = 9.1; y = 6; z = sqrt(x**2 + y**2); put z=; /* display variable and value */ run; |
z=10.9 |
You can extend the previous tip to arrays and to sets of variables. The PUT statement enables you to display elements of an array (or multiple variables) by specifying the array name in parentheses, followed by an equal sign in parentheses, as follows:
data _null_; array x[5]; do k = 1 to dim(x); x[k] = k**2; end; put (x[*]) (=); /* put each element of array */ put (x1 x3 x5) (=); /* put each variable/value */ run; |
x1=1 x2=4 x3=9 x4=16 x5=25 x1=1 x3=9 x5=25 |
This syntax is not supported for _TEMPORARY_ arrays. However, as a workaraound, you can use the CATQ function to concatenate array values into a character variable, as follows:
temp = catq('d', ',', of x[*]); /* x can be _TEMPORARY_ array */ put temp=; |
Incidentally, if you ever want to apply a format to the values, the format name goes inside the second set of parentheses, after the equal sign: put (x1 x3 x5) (=6.2);
The previous tip displayed all values on a single line. Sometimes it is useful to display each value on its own line. To do that, put a slash after the equal sign, as follows:
... put (x[*]) (=/); /* put each element on separate lines */ ... |
x1=1 x2=4 x3=9 x4=16 x5=25 |
You can display all values of all variables by using the _ALL_ keyword, as follows:
data _null_; x = 9.1; y = 6; z = sqrt(x**2 + y**2); A = "SAS"; B = "Statistics"; put _ALL_; /* display all variables and values */ run; |
x=9.1 y=6 z=10.9 A=SAS B=Statistics _ERROR_=0 _N_=1 |
Notice that in addition to the user-defined variables, the _ALL_ keyword also prints the values of two automatic variables named _ERROR_ and _N_.
Just as the PUT statement displays the value of an ordinary variable, you can use the %PUT statement to display the value of a macro variable. If you use the special "&=" syntax, SAS will display the name and value of a macro variable. For example, to display your SAS version, you can display the value of the SYSVLONG automatic system macro variable, as follows:
%put &=SYSVLONG; |
SYSVLONG=9.04.01M4P110916
The results above are for my system, which is running SAS 9.4M4. Your SAS version might be different.
You can display the name and value of all user-defined macros by using the _USER_ keyword. You can display the values of all SAS automatic system macros by using the _AUTOMATIC_ keyword.
%let N = 50; %let NumSamples = 1e4; %put _USER_; |
GLOBAL N 50 GLOBAL NUMSAMPLES 1e4
There you have it: six tips to make it easier to display the value of SAS variables and macro variables. Thanks to Jiangtang Hu who pointed out the %PUT &=var syntax in his blog in 2012. For additional features of the PUT and %PUT statements, see:
The post PUT it there! Six tips for using PUT and %PUT statements in SAS appeared first on The DO Loop.
]]>The post Ten posts from 2016 that deserve a second look appeared first on The DO Loop.
]]>I've grouped the articles into three categories: statistical graphics and visualization, statistical computations, and matrix computations. If you are a SAS statistical programmer, these articles deserve a second look.
Ten posts from The DO Loop that deserve a second look #SASTip
Click To Tweet
SAS ODS graphics provides an easy way to create standard graphs for data analysis. The graphs in this list are more sophisticated:
These article show helpful statistical techniques that you should know about:
The SAS DATA step is awesome. For many programming tasks, it is an efficient and effective tool. However, advanced analytical algorithms and multivariate statistics often require matrix-vector computations, which means programming in the SAS/IML language.
There you have it, 10 articles from The DO Loop in 2016 that I think are worth a second look. Did I omit your favorite article? Leave a comment.
The post Ten posts from 2016 that deserve a second look appeared first on The DO Loop.
]]>The post ODS OUTPUT: Store any statistic created by any SAS procedure appeared first on The DO Loop.
]]>The preceding paragraph oversimplifies the SAS Output Delivery System (ODS), but the truth is that ODS is a powerful feature of SAS. You can use ODS to send SAS tables and graphics to various output destinations, including HTML, PDF, RTF, and PowerPoint. You can control the style and attributes of the output, thus creating a customized report. There have been hundreds of papers and books written about ODS. A very basic introduction is Olinger (2000) "ODS for Dummies."
To a statistical programmer the most useful destination is the OUTPUT destination. The OUTPUT destination sends a table or graph to a SAS data set. Consequently, you can programmatically access each element of the output.
The implications of the previous statement are monumental. I cannot overstate the importance of the OUTPUT destination, so let me say it again:
The ODS OUTPUT destination enables you to store any value that is produced by any SAS procedure. You can then read that value by using a SAS program.
The ODS OUTPUT destination answers a common question that is asked by new programmers on SAS discussion forums: "How can I get a statistic into a data set or into a macro variable?" The steps are as follows:
New to #SAS programming? How to get any statistic into a data set.
Click To Tweet
As an example, suppose that you intend to use PROC REG to perform a linear regression, and you want to capture the R-square value in a SAS data set. The documentation for the procedure lists all ODS tables that the procedure can create, or you can use the ODS TRACE ON statement to display the table names that are produced by PROC REG. The data are the 428 vehicles in the Sashelp.Cars data set, which is distributed with SAS:
ods trace on; /* write ODS table names to log */ proc reg data=Sashelp.Cars plots=none; model Horsepower = EngineSize Weight; quit; ods trace off; /* stop writing to log */ |
Output Added: ------------- Name: FitStatistics Label: Fit Statistics Template: Stat.REG.FitStatistics Path: Reg.MODEL1.Fit.Horsepower.FitStatistics ------------- |
By looking at the output, you can see that the third table contains the R-square value. By looking at the SAS log, you can see that the name of the third table is "FitStatistics."
Now that you know the name of the ODS table is "FitStatistics," use the ODS OUTPUT destination to write that table to a SAS data set, as follows:
ods output FitStatistics=Output; /* the data set name is 'Output' */ proc reg data=Sashelp.Cars plots=none; /* same procedure call */ model Horsepower = EngineSize Weight; quit; proc print data=Output noobs; run; |
The output from PROC PRINT shows the structure of the output data set. Notice that the data set often looks different from the original displayed table. The data set contains non-printing columns (like Model and Dependent) that do not appear in the displayed table. The data set also contains columns that contain the raw numerical values and the (formatted) character values of the statistics. The columns cValue1 and nValue1 represent the same information, except that the cValue1 is a character column whereas nValue1 is a numerical column. The same applies to the cValue2 and nValue2 columns. The character values might contain formatted or rounded values.
From the previous PROC PRINT output, you can see that the numerical value of the R-square statistic is in the first row and is in the nValue2 column. You can therefore read and process that value by using a standard WHERE clause. For example, the following statements use the SYMPUTX subroutine to create a macro variable that contains the value of the R-square statistic:
data _null_; set Output; if Label2="R-Square" then call symputx("RSq", nValue2); run; %put RSq = &RSq; |
RSq = 0.6201360929 |
The SAS log shows that the R-square value is now contained in the Rsq macro variable.
Storing the statistic in a macro variable is only one way to use the data set. You could also read the statistics into PROC IML or PROC SQL for further computation, or show the value of the statistic in a graph.
The previous sections show how to save a single table to a SAS data set. It is just as easy to create a data set that contains multiple statistics, one for each level in a BY-group analysis.
Suppose that you want to run several regressions, one for each value of the Origin variable, which has the values "Asia," "Europe," and "USA." The following call to PROC SORT sorts the data by the Origin variable. The sorted data is stored in the CARS data set.
proc sort data=Sashelp.Cars out=Cars; by Origin; run; |
You can then specify Origin on the BY statement in PROC REG to carry out three regression analyses. When you run a BY-group analysis, you might not want to see all of the results displayed on the computer screen, especially if your goal is to save the results in an output data set. You can use the ODS EXCLUDE statement to suppress SAS output.
ods exclude all; /* suppress tables to screen */ ods output FitStatistics=Output2; /* 'Output2' contains results for each BY group */ proc reg data=Cars plots=none; by Origin; model Horsepower = EngineSize Weight; quit; ods exclude none; /* no longer suppress tables */ proc print data=Output2 noobs; where Label2="R-Square"; var Origin Label2 nValue2; run; |
The output from PROC PRINT shows the R-square statistics for each model. Notice that the BY-group variables (in this case, Origin) are added to output data sets when you run a BY-group analysis. You can now use the statistics in programs or graphs.
Some procedures provide an alternative option for creating an output data set that contains statistics. Always check the SAS documentation to see if the procedure provides an option that writes common statistics to an output data set. For example, the documentation for the PROC REG statement states that you can use the OUTEST= option with the RSQUARE option to obtain an output data set that contains the parameter estimates and other model statistics such as the R-square value. Thus for this example, you do not need to use the ODS OUTPUT statement to direct the FitStatistics table to a data set. Instead, you can obtain the statistic as follows:
proc reg data=Cars NOPRINT outest=Output3 RSQUARE; /* statistics in 'Output3' */ by Origin; model Horsepower = EngineSize Weight; quit; proc print data=Output3 noobs; format _RSQ_ 8.6; var Origin _RSQ_; run; |
In summary, the ODS OUTPUT statement enables you to create a data set that contains any statistic that is produced by a SAS procedure. You can use the ODS OUTPUT statement to capture a statistic and use it later in your program.
The post ODS OUTPUT: Store any statistic created by any SAS procedure appeared first on The DO Loop.
]]>The post Is "La Quinta" Spanish for "Next to Denny's"? appeared first on The DO Loop.
]]>Mitch Hedberg's joke resonates with travelers who drive on the US interstate system because many highway exits feature both a La Quinta Inn™ and a Denny's® restaurant within a short distance of each other. But does a statistical data analysis support this anecdotal evidence?
In 2014 John Reiser wrote a blog post that uses the Python language to scrape the web for the locations of La Quinta Inns and Denny's restaurants. He then analyzed the data to show that, yes, in general, a guest at a La Quinta Inn does not have far to travel if he wants to eat at a Denny's restaurant. This work inspired Colin Rundel and Mine Cetckaya-Rundel to assign this analysis as a project for their students at Duke University and to write an article in CHANCE magazine (29(2), 2016) about the assignment.
Rundel and Cetckaya-Rundel posted CSV files on the CHANCE web site that contain the longitude, latitude, and addresses of 851 La Quinta Inns and 1,634 Denny's restaurants in the contiguous US (as of Dec 2015). This article follows their presentation and shows how to analyze the La Quinta-Denny's spatial data in SAS. The analysis is a straightforward extension of the nearest-neighbor techniques shown in article "Distances between observations in two groups." You can download the SAS program that imports the data and creates all the graphs and tables in this article.
You can use PROC HTTP in SAS to read CSV files directly from a URL. After importing the locations of La Quinta Inns and Denny's restaurants from the CHANCE web site, you can use PROC SGPLOT to plot the (unprojected) locations as longitudes and latitudes. To help visualize the locations, the adjacent graph overlays the data on an outline of the lower-48 states in the US. (Click the graph to enlarge.)
In the graph, the La Quinta Inns are represented by blue circles and Denny's by a red cross. In the enlarged version you can see that many circles enclose a red cross, which indicates that the inn and restaurant are very close. On the other hand, there are a few La Quinta locations that seem to be far away from any Denny's. Montana, North Dakota, Louisiana, Kansas, Nevada, and southwest Texas are some geographic regions in which a blue circle is not close to a red cross.
Imagine that a husband and wife are spending their retirement by crisscrossing the US. The wife wants to sleep each night at a La Quinta Inn. The husband wants to eat breakfast each morning at a Denny's restaurant. If they both get their wishes, how far will they need to travel to get breakfast each morning?
I've previously written about how to compute the distance to the nearest neighbor between observations that are in different groups. For these data, the La Quinta Inns form one group and the Denny's restaurants form a second group. Because the coordinates for these data are longitude and latitude, you need to use the GEODIST function in Base SAS to compute the distance ( in kilometers, as the crow flies) between each hotel and the nearest Denny's.
For each of the 851 hotels, you can find the distance to the nearest Denny's. The following table summarizes the distribution of distances, in kilometers:
The table shows that the closest Denny 's is a mere 11 meters from the adjacent La Quinta Inn! For 25% of the inns, the nearest Denny's is within 1.3 km, which is a short walk. Fifty percent of the inns are within 5 km (an easy drive) of a Denny's, and 75% are within 17 km. It seems that the data supports a modified version of Hedberg's joke: “La Quinta” is Spanish for “often close to a Denny’s.”
The following histogram shows the distribution of the 851 distances from each La Quinta Inn to the nearest Denny's. The histogram shows that about 65% of the inns are within 10 km and about 77% are within 20 km. As long as the husband and wife stay at these inns, they will both be happy, well-rested, and well-fed!
Clearly the couple can happily sleep at a La Quinta Inn and eat at a Denny's provided that they avoid the few La Quinta Inns that are not located near a Denny's. The scatter plot and map near the top of this article gives some indications about where these inns are located, but we can identify these inns more precisely. For the sake of the couple's marital bliss, let's enumerate the inns that are farthest from a Denny's.
The adjacent table shows 10 La Quinta Inns that are farthest from a Denny's restaurant. The farthest distance is the La Quinta Inn in Glendive, Montana, which is 282 km from the nearest Denny's. La Quinta Inns in Kansas, Nevada, Texas, and Louisiana are also more than 150 km away from a Denny's.
If possible, the couple should avoid these inns, but what if their travels bring them through these cities? In that case, they need to know that distance and location to the nearest Denny's. The next graph might be helpful.
The following graph shows all the La Quinta Inns. The inns shown in red are those for which the distance to the nearest Denny's is more than 80 km away. For each of these inns, an arrow is drawn from the La Quinta Inn to the location of the nearest Denny's. (I explained this nearest-neighbor plot in a previous article.) This helps the couple know what direction they need to drive in order to reach breakfast—or perhaps lunch! For some La Quinta locations (MT, SC, TN) the couple will need to drive into an adjacent state.
Mitch Hedberg's joke is funny because it has an element of truth. For most La Quinta Inns, the nearest Denny's restaurant is a short walk or drive away. By using nearest-neighbor computations and the GEODIST function in SAS, you can compute the distances from each inn to the nearest Denny's. You can graph the set of distances and compute statistics such as quantiles. You can visualize the co-locations of the inns and restaurants, and even direct travelers to the nearest Denny's. I agree with Rundel and Cetckaya-Rundel that this exercise provides a fun activity in data analysis for students.
For professional SAS programmers, the exercise demonstrates how to conduct a particular kind of co-location analysis. Instead of Denny's restaurants, the professional analyst might be interested in the distance to the nearest hospital, distribution center, or cell-phone tower. SAS statistical graphics and SAS/IML, provides the tools for analyzing the distance between groups of spatial data.
If you want to examine or extend my analysis of these data, you can download the SAS program.
The post Is "La Quinta" Spanish for "Next to Denny's"? appeared first on The DO Loop.
]]>The post The top 10 posts from The DO Loop in 2016 appeared first on The DO Loop.
]]>Top 10 blog posts from The DO Loop in 2016
Click To Tweet
Start with a juicy set of data and an interesting question. Mix in some SAS data analysis and a colorful graph that visualizes the data and tells a story. What have you got? A popular blog post! The following posts generated buzz on social media. They also show off the power of SAS analytics and visualization.
Everyone who uses SAS needs to know tricks and techniques for programming in the DATA step or for working with macros. No wonder these articles were so popular!
If you browse SAS discussion forums, you will see many questions about computing moving statistics, creating dummy variables, and running weighted analyses. I wrote some articles about these topics that resonated with readers in 2016:
Did you miss any of these popular posts? Here's your chance to read (or re-read!) one of these top 10 posts from 2016.
Next week: My picks for articles that did not make this list, but deserve a second look.
The post The top 10 posts from The DO Loop in 2016 appeared first on The DO Loop.
]]>The post The contaminated normal distribution appeared first on The DO Loop.
]]>The contaminated normal distibution was originally studied by John Tukey in the 190s and '50s. As I say in my book Simulating Data with SAS (2013, p. 119), "the contaminated normal distribution is a specific instance of a two-component mixture distribution in which both components are normally distributed with a common mean.... This results in a distribution with heavier tails than normality. This is also a convenient way to generate data with outliers."
Specifically,
a contaminated normal distribution is a mixture of two normal distributions with mixing probabilities (1 - α) and α, where typically 0 < α ≤ 0.1.
You can write the density of a contaminated normal distribution in terms of the component densities. Let φ(x; μ, σ) denote the distribution of the normal distribution with mean μ and standard deviation σ. Then the contaminated normal density is
f(x) = (1 - α)φ(x; μ, σ) + α φ(x; μ, λσ)
where λ > 1 is a parameter that determines the standard deviation of the wider component.
The idea is that the "main" distribution (φ(x; μ, σ)) is slightly "contaminated" by a wider distribution. Tukey (1960) uses λ=3 as a scale multiplier. This article uses α = 0.1, which represents 10% "contamination." Tukey reports that when λ=3 and α=0.1, "the two constituents contribute equal amounts to the variance of the contaminated distribution." In the following sections, μ=0 and σ=1, so that the uncontaminated component is the standard normal distribution.
The following SAS DATA step constructs the density of a contaminated normal distribution as the linear combination of a N(0,1) and a N(0,3) density. The call to the SGPLOT procedure plots the density and the component densities:
%let alpha = 0.1; %let lambda = 3; data CNPDF; label Y1="N(0,1)" Y2="N(0,3)"; do x = -3*&lambda to 3*&lambda by 0.1; Y1 = pdf("Normal", x, 0, 1); /* std normal component */ Y2 = pdf("Normal", x, 0, 1*&lambda); /* contamination */ CN = (1-&alpha)*Y1 + &alpha*Y2; /* contaminated normal */ output; end; run; title "Contaminated Normal Distribution"; title2 "alpha = α lambda = &lambda"; proc sgplot data=CNPDF; label CN = "0.9 N(0,1) + 0.1 N(0,3)"; series x=x y=Y1; series x=x y=Y2; series x=x y=CN / lineattrs=(thickness=3); xaxis grid values=(-9 to 9); yaxis grid; run; |
As shown in the graph, the contaminated normal distribution (shown with a thick line) has heavier tails than the "uncontaminated" normal component.
The book Simulating Data with SAS (2013, p. 120) provides an algorithm for simulating data from a contaminated normal distribution. The algorithm is a special case of simulating from a mixture distribution. You iteratively choose a component with probability α and then generate a value from whichever component is chosen, as follows:
%let alpha= 0.1; /* level of contamination */ %let lambda = 3; /* magnitude of contamination */ %let N = 100; /* size of sample */ data CNRand(keep=x contaminate); call streaminit(12345); do i = 1 to &N; contaminate = rand("Bernoulli", &alpha); if contaminate then x = rand("Normal", 0, &lambda); else x = rand("Normal", 0, 1); output; end; run; proc sgplot data=CNRand; histogram x; density x / type=normal(mu=0 sigma=1) name="normal"; density x / type=kernel name="kernel"; fringe x / group=contaminate lineattrs=(thickness=2); keylegend "kernel" "normal" / location=inside position=topright across=1; yaxis offsetmin=0.035; run; |
The histogram shows the distribution of the simulated sample. A kernel density estimate is overlaid, as is the density for the uncontaminated N(0,1) component. A fringe plot (also called a "rug plot") is shown underneath the histogram so that you can see the actual values in the sample.
The data that are generated by the uncontaminated component are shown in one color; the data from the contaminated component are shown in a different color. Whereas the standard normal density rarely produces data values for which |x| > 3, the sample contains four values that exceed 3 in magnitude. As you can see, all four "extreme" values are generated from the contaminated component. However, the contaminated component also generates two values that are not so extreme.
You can easily compute the cumulative distribution (and therefore probabilities) of the contaminated normal (CN) distribution as a linear combination of the component CDFs. For example, if you want to know the probability that a random observation from the CN distribution exceeds 3 in magnitude, you can compute that probability as follows:
data Prob; x = 3; leftP = (1-&alpha)*cdf("Normal", -x, 0, 1) + &alpha*cdf("Normal", -x, 0, &lambda); totalP = 2*leftP; /* distribution is symmetric */ run; proc print noobs;run; |
The table shows that the probability is 0.034. Almost all of that probability comes from the contaminated component of the distribution.
The quantile function of the CN distribution does not have a closed-form solution. Given a probability P, the quantile of the CN is the value of x for which P = (1-α)CDF(x; μ, σ) + αCDF(x ; μ, λσ). You can use a root-finding method to find the quantile. For example, you can use the FROOT function in SAS/IML, as follows:
proc iml; start CNQuantile(x) global (alpha, lambda, Prob); return Prob - ( (1-alpha)*CDF("Normal", x, 0, 1) + alpha *CDF("Normal", x, 0, lambda) ); finish; Prob = 0.01708; alpha = 0.1; lambda = 3; x = froot("CNQuantile", {-9 0}); /* find quantile for Prob */ print x; |
The computation find that the quantile for 0.01708 is about -3.
Not all researchers agree that a contaminated normal distribution is an appropriate model for non-Gaussian data. Gleason (1993, JASA) provides an overview of the history of the CN distribution, discusses how the parameters contribute to the elongation of the tail, and compares the CN with other long-tailed distributions.
The post The contaminated normal distribution appeared first on The DO Loop.
]]>The post Solve linear programming problems in SAS appeared first on The DO Loop.
]]>A linear programming problem can always be written in a standard vector form. In the standard form, the unknown variables are nonnegative, which is written in vector form as x ≥ 0. Because the objective function is linear, it can be expressed as the inner product of a known vector of coefficients (c) with the unknown variable vector: c^{T}x Because the constraints are also linear, the inequality constraints can always be written as Ax ≤ b for a known constraint matrix A and a known vector of values b.
In practice, it is often convenient to be able to specify the problem in a non-standardized form in which some of the constraints represent "greater than," others represent "less than," and others represent equality. Computer software can translate the problem into a standardized form and solve it.
As an example of how to solve a linear programming problem in SAS, let's pose a particular two-variable problem:
The graph shows the polygonal region in the plane that satisfies the constraints for this problem. The region (called the feasible region) is defined by the two axes and the three linear inequalities. The color in the interior of the region indicates the value of the objective function within the feasible region. The green star indicates the optimal solution, which is x = {5.3, 2.95}.
The theory of linear programming says that an optimal solution will always be found at a vertex of the feasible region, which in 2-D is a polygon. In this problem, the feasible polygon has only five vertices, so you could evaluate the objective function at each vertex by hand to find the optimal solution. For high-dimensional problems, however, you definitely want to use a computer!
The OPTMODEL procedure is part of SAS/OR software. It provides a natural programming language with which to define and solve all kinds of optimization problems. The linear programming example in this article is similar to the "Getting Started" example in the PROC OPTMODEL chapter about linear programming. The following statements use syntax that is remarkably similar to the mathematical formulation of the problem:
proc optmodel; var x{i in 1..2} >= 0; /* information about the variables */ max z = 3*x[1] + 5*x[2]; /* define the objective function */ con c1: 3*x[1] + -2*x[2] <= 10; /* specify linear constraints */ con c2: 5*x[1] + 10*x[2] <= 56; con c3: 4*x[1] + 2*x[2] >= 7; solve; /* solve the LP problem */ print x; |
The OPTMODEL procedure prints two tables. The first (not shown) is a table that describes the algorithm that was used to solve the problem. The second table is the solution vector, which is x = {5.3, 2.95}.
For an introduction to using the OPTMODEL procedure to solve linear programming problems, see the 2011 paper by Rob Pratt and Ed Hughes.
Not every SAS customer has a license for SAS/OR software, but hundreds of thousands of people have access to the SAS/IML matrix language. In addition to companies that license SAS/IML software, SAS/IML is part of the free SAS University Edition, which has been downloaded almost one million times by students, teachers, researchers, and self-learners.
Whereas the syntax in PROC OPTMODEL closely reflects the mathematical formulation, the SAS/IML language uses matrices and vectors to specify the problem. You then pass those matrices as arguments to the LPSOLVE subroutine. The subroutine returns the solution (and other information) in output arguments. The following list summarizes the information that you must provide:
The following SAS/IML program defines and solves the same LP problem as in the previous section. I've added plenty of comments so you can see how the elements in this program compare to the more compact representation in PROC OPTMODEL:
proc iml; /* information about the variables */ LowerB = {0, 0}; /* lower bound constraints on x */ UpperB = {., .}; /* upper bound constraints on x */ /* define the objective function */ c = {3, 5}; /* vector for objective function c*x */ /* control vector for optimization */ ctrl = {-1, /* maximize objective */ 1}; /* print some details */ /* specify linear constraints */ A = {3 -2, /* matrix of constraint coefficients */ 5 10, 4 2}; b = {10, /* RHS of constraint eqns (column vector) */ 56, 7}; LEG = {L, L, G}; /* specify symbols for constraints: 'L' for less than or equal 'E' for equal 'G' for greater than or equal */ /* solve the LP problem */ CALL LPSOLVE(rc, objVal, result, dual, reducost, /* output variables */ c, A, b, /* objective and linear constraints */ ctrl, /* control vector */ LEG, /*range*/ , LowerB, UpperB); print rc objVal, result[rowname={x1 x2}]; |
In the call to LPSOLVE, the first five arguments are output arguments. The first three of these are printed:
Although the LPSOLVE function was not as simple to use as PROC OPTMODEL, it obtains the same solution. The LPSOLVE subroutine supports many features that are not mentioned here. For details, see the documentation for LPSOLVE.
The LPSOLVE subroutine was introduced in SAS/IML 13.1, which was shipped with SAS 9.4m1. The LPSOLVE function replaces the older LP subroutine, which is deprecated. SAS 9.3 customers can call the LP subroutine, which works similarly.
The post Solve linear programming problems in SAS appeared first on The DO Loop.
]]>The post Animate snowfall in SAS appeared first on The DO Loop.
]]>Out of the bosom of the Air,
Out of the cloud-folds of her garments shaken,
Over the woodlands brown and bare,
Over the harvest-fields forsaken,
Silent, and soft, and slow
Descends the snow.
"Snow-flakes" by Henry Wadsworth Longfellow
Happy holidays to all my readers! In my last post I showed how to create a well-known fractal called the Koch snowflake. The snowflake is aptly named because it has six-fold symmetry. But as Longfellow noted, a real snowflake is not stationary, but descends "silent, and soft, and slow."
As a gift to my readers, I decided to create an animated greeting card entirely in SAS. The animated GIF (click to enlarge) uses some of the SAS techniques that I have blogged about in 2016. The falling and rotating snowflakes were created by using matrix computations in the SAS/IML language. The animated GIF was created by using ODS graphics and PROC SGPLOT.
Create an animated greeting card with #SAS
Click To Tweet
If you want to learn how I created this animated GIF with SAS, read on. I've blogged about all the essential techniques in previous posts:
You can download the SAS program that creates the greeting card. Let me know if you adapt this program to create other animated images.
If you like SAS statistical programming or want to learn more about it, subscribe to this blog. In most articles I show how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and more. But programming with SAS can also be fun, and sometimes it takes a less-serious application to make people say, "Wow! That's cool! I didn't know SAS could do that!"
The post Animate snowfall in SAS appeared first on The DO Loop.
]]>