This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
When I’m at a social gathering, someone always asks what type of work I do. I like to keep my social life separate from my work, therefore I usually give a vague answer such as “software” (and quickly change the topic). How vague or specific is your response? How vague […]
The post What are the most common occupations in each US state? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This tip details how to go about removing an unwanted Authentication Domain and all associated Login objects from SAS metadata. A need for this can arise when you have been temporarily (or accidentally/unnecessarily) added a second set of inbound logins for all of your SAS users and you decide you no longer need those extra … Continue reading “Metacoda Plug-ins Tip: Removing an Unwanted Auth Domain and Logins”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
The post Optimization with nonlinear constraints in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This article shows how to perform an optimization in SAS when the parameters are restricted by nonlinear constraints. In particular, it solves an optimization problem where the parameters are constrained to lie in the annular region between two circles. The end of the article shows the path of partial solutions which converges to the solution and verifies that all partial solutions lie within the constraints of the problem.
The term “nonlinear constraints” refers to bounds placed on the parameters. There are four types of constraints in optimization problems. From simplest to most complicated, they are as follows:
This article solves a two-dimensional nonlinearly constrained optimization problem.
The constraint region will be the annular region defined by the two equations
x^{2} + y^{2} ≥ 1
x^{2} + y^{2} ≤ 9
Let f(x,y) be the objective function to be optimized. For this example,
f(x,y) = cos(π r) – Dist( (x,y), (2,0) )
where
r = sqrt(x^{2} + y^{2}) and
Dist( (x,y), (2,0) ) = sqrt( (x-2)^{2} + (y-0)^{2} )
is the distance from (x,y) to the point (2,0).
By inspection, the maximum value of f occurs at (x,y) = (2,0) because the objective function obtains its largest value (1) at that point.
The following SAS statements use a scatter plot to visualize the value of the objective function inside the constraint region. The graph is shown to the right.
/* evaluate the objective function on a regular grid */ data FuncViz; do x = -3 to 3 by 0.05; do y = -3 to 3 by 0.05; r2 = x**2 + y**2; r = sqrt(r2); ObjFunc = cos(constant('pi')*r) - sqrt( (x-2)**2+(y-0)**2 ); if 1 <= r2 <= 9 then output; /* output points in feasible region */ end; end; run; /* draw the boundary of the constraint region */ %macro DrawCircle(R); R = &R; do t = 0 to 2*constant('pi') by 0.05; xx= R*cos(t); yy= R*sin(t); output; end; %mend; data ConstrBoundary; %DrawCircle(1) %DrawCircle(3) run; ods graphics / width=400px height=360px antialias=on antialiasmax=12000 labelmax=12000; title "Objective Function and Constraint Region"; data Viz; set FuncViz ConstrBoundary; run; proc sgplot data=Viz noautolegend; xaxis grid; yaxis grid; scatter x=x y=y / markerattrs=(symbol=SquareFilled size=4) colorresponse=ObjFunc colormodel=ThreeColorRamp; polygon ID=R x=xx y=yy / lineattrs=(thickness=3); gradlegend; run; |
SAS/IML and SAS/OR each provide general-purpose tools for optimization in SAS. This section solves the problem by using SAS/IML. A solution that uses PROC OPTMODEL is more compact and is shown at the end of this article.
SAS/IML supports 10 different optimization algorithms. All support unconstrained, bound-constrained, and linearly constrained optimization. However, only two support nonlinear constraints: the quasi-Newton method (implemented by using the NLPQN subroutine) and the Nelder-Mead simplex method (implemented by using the NLPNMS subroutine).
To specify k nonlinear constraints in SAS/IML, you need to do two things:
For the current example, k = 2 and k_{eq} = 0. Therefore the module that specifies the nonlinear constraints should return two rows. The first row evaluates whether a parameter value is outside the unit circle. The second row evaluates whether a parameter value is inside the circle with radius 3. Both constraints need to be written as “greater than” inequalities. This leads to the following SAS/IML program which solves the nonlinear optimization problem:
proc iml; start ObjFunc(z); r = sqrt(z[,##]); /* z = (x,y) ==> z[,##] = x##2 + y##2 */ f = cos( constant('pi')*r ) - sqrt( (z[,1]-2)##2 + (z[,2]-0)##2 ); return f; finish; start NLConstr(z); con = {0, 0}; /* column vector for two constraints */ con[1] = z[,##] - 1; /* (x##2 + y##2) >= 1 ==> (x##2 + y##2) - 1 >= 0 */ con[2] = 9 - z[,##]; /* (x##2 + y##3) <= 3##2 ==> 9 - (x##2 + y##2) >= 0 */ return con; finish; optn= j(1,11,.); /* allocate options vector (missing ==> 'use default') */ optn[1] = 1; /* maximize objective function */ optn[2] = 4; /* include iteration history in printed output */ optn[10]= 2; /* there are two nonlinear constraints */ optn[11]= 0; /* there are no equality constraints */ ods output grad = IterPath; /* save iteration history in 'IterPath' data set */ z0 = {-2 0}; /* initial guess */ call NLPNMS(rc,xOpt,"ObjFunc",z0,optn) nlc="NLConstr"; /* solve optimization */ ods output close; |
The call to the NLPNMS subroutine generates a lot of output (because optn[2]=4). The final parameter estimates are shown. The routine estimates the optimal parameters as (x,y) = (2.000002, 0.000015), which is very close to the exact optimal value (x,y)=(2,0). The objective function at the estimated parameters is within 1.5E-5 of the true maximum.
I intentionally chose the initial guess for this problem to be (x0,y0)=(-2,0), which is diametrically opposite the true value in the annular region. I wanted to see how the optimization algorithm traverses the annular region. I knew that every partial solution would be within the feasible region, but I suspected that it might be possible for the iterates to “jump across” the hole in the middle of the annulus. The following SAS statements visualize the iteration history of the Nelder-Mead method as it iterates towards an optimal solution:
data IterPath2; set IterPath; Iteration = ' '; if _N_ IN (1,5,6,7,9) then Iteration = put(_N_,2.); /* label certain step numbers */ format Iteration 2.; run; data Path; set Viz IterPath2; run; title "Objective Function and Solution Path"; title2 "Nelder-Mead Simplex Method"; proc sgplot data=Path noautolegend; xaxis grid; yaxis grid; scatter x=x y=y / markerattrs=(symbol=SquareFilled size=4) colorresponse=ObjFunc colormodel=ThreeColorRamp; polygon ID=R x=xx y=yy / lineattrs=(thickness=3); series x=x1 y=x2 / markers datalabel=Iteration datalabelattrs=(size=8) markerattrs=(size=5); gradlegend; run; |
For this initial guess and this optimization algorithm, the iteration path “jumps across” the hole in the annulus on the seventh iteration. For other initial guesses or methods, the iteration might follow a different path. For example, if you change the initial guess and use the quasi-Newton method, the optimization requires many more iterations and follows the “ridge” near the circle of radius 2, as shown below:
z0 = {-2 -0.1}; /* initial guess */
call NLPQN(rc,xOpt,"ObjFunc",z0,optn) nlc="NLConstr"; /* solve optimization */
If you have access to SAS/OR software, PROC OPTMODEL provides a syntax that is compact and readable.
For this problem, you can use a single CONSTRAINT statement to specify both the inner and outer bounds of the annular constraints. The following call to PROC OPTMODEL uses an interior-point method to compute a solution that numerically equivalent to (x,y) = (2, 0):
proc optmodel; var x, y; max f = cos(constant('pi')*sqrt(x^2+y^2)) - sqrt((x-2)^2+(y-0)^2); constraint 1 <= x^2 + y^2 <= 9; /* define constraint region */ x = -2; y = -0.1; /* initial guess */ solve with NLP / maxiter=25; print x y; quit; |
In summary, you can use SAS software to solve nonlinear optimization problems that are unconstrained, bound-constrained, linearly constrained, or nonlinearly constrained. This article shows how to solve a nonlinearly constrained problem by using SAS/IML or SAS/OR software. In SAS/IML, you have to specify the number of nonlinear constraints and write a function that can evaluate a parameter value and return a vector of positive numbers if the parameter is in the constrained region.
The post Optimization with nonlinear constraints in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Even though I’ve worked at SAS for nearly 30 years, I still get excited when great things come together for our customers! This year we are hosting the very first hackathon at our Analytics Experience conference in San Diego – the AnalyticsX Hackathon. Analytics Experience is in its third year […]
The post Announcing the AnalyticsX Hackathon: Are you up for it? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post A quantile regression analysis of chess ratings by age appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
My colleague, Robert Allison, recently published an interesting visualization of the relationship between chess ratings and age. His post was inspired by the article “Age vs Elo — Your battle against time,” which was published on the chess.com website. (“Elo” is one of the rating systems in chess.)
Robert Allison’s article indicates how to download the 2014 Elo ratings for 70,963
active players into a SAS data set.
You can use PROC SGPLOT to create a scatter plot of the rating and age of each player. Because the graph displays so many observations, I followed Robert’s advice and used semi-transparency to bring out hidden features in the data. I also colored each marker according to whether the player is male (blue) or female (light red). The results are shown below:
As pointed out in the previous articles, ratings depend on age in a nonlinear fashion. The ratings for young players tend to be less than for young adults. Ratings tend to be highest for players in their mid to late twenties. After age 30, ratings tend to decrease with age.
The purpose of this article is to use statistical regression to predict ratings from age.
The chess.com article shows a line plot that is formed by binning the players’ ages into five-year age ranges, computing the average rating in each bin, and then connecting the means in each age group. This bin-and-connect-the-means method is less powerful than regression, but approximates the nonlinear relationship between age and the average rating for that age. A more powerful statistical technique is quantile regression. In quantile regression, you model the percentiles of the response variable, such as the 25th, 50th, or 90th percentile of the distribution of ratings for a given age.
In other words, quantile regression enables you to compute separate predictions for the ratings of poor players, for good players, and for great players.
A previous article compares quantile regression with the simpler binning technique.
I used the QUANTREG procedure in SAS to performs quantile regression to model the 10th, 25th, 50th, 75th, and 90th percentiles of rating as a function of age. (Quantiles are numbers between 0 and 1, whereas percentiles are between 0 and 100, but otherwise the terms are interchangeable.) I ignored the gender of the players. The following graph is the default plot that is created by PROC QUANTREG.
The lowest curve shows the predicted rating for the 10th percentile of players (the weak players) as a function of age.
The middle curve shows the predicted rating for the 50th percentile of players (the average players).
The highest curve shows the predicted rating for the 90th percentile, who are excellent chess players.
For the shown percentiles, the predicted chess ratings increase quickly for young players, then peak for players in their mid-twenties. This happens for all percentiles, from the 10th to the 90th. For example, the median player (50th percentile) at age 29 has a rating of about 2050. In contrast, the 90th percentile for a 29-year-old is about 2356.
This plot should remind you of a children’s weight and height charts in a doctor’s office.
Just as the mother of a young child might use the doctor’s chart to estimate her child’s percentile for weight and height, so too can a chess player use this chart to estimate his percentile for rating.
If you are a chess player, find your age on the horizontal axis. Then trace a line straight up until you reach your personal Elo rating. Use the relative position of the surrounding curves to estimate your rating percentile. For example, a 40-year-old with a 1900 rating could conclude from the graph that he is between the 25th and 50th percentile, but closer to the 50th.
As I said, this graph is created by default, but we can do better. In the following section, I add separate curves for women and men. I also add grid lines to make it easier to estimate coordinates in the graph.
Chess is enjoyed by both men and women.
If you add gender to the quantile regression model, you obtain separate curves for men and women for each percentile. The following graph shows the predicted percentiles for the 10th, 50th, and 90th percentiles, for both men and women. Notice that the predicted male ratings are higher than the predicted female ratings at every age and for each quantile.
Notice also that the predicted scores for females tend to peak later in life: about 30 years for females versus 25 for males. After peaking, the female curves drop off more quickly than their male counterparts.
The previous section overlays the male and female percentile curves. This makes it easy to compare males and females, but even with only three quantiles, the graph is crowded. If you want to display five or more percentile curves, it makes sense to separate the graph into a panel that contains two graphs, one for males and one for females. Again, this is similar to the height-weight percentile charts in the pediatrician offices, which often have separate charts for boys and girls.
The following panel shows side-by-side predictions for quantiles of ratings as a function of age. If you are a chess player, you can use these charts to estimate your rating percentile. For your current age and gender, what is your percentile?
In summary, I used data for Elo ratings in chess to build a quantile regression model that predicts percentiles of ratings for a given age and gender.
Using these graphs, you can see the predicted Elo ratings for poor, good, and great chess players for every combination of age and gender. If you are a chess player yourself, you can locate your age-score value in one of the plots and estimate your percentile among your age-gender peer group.
If you are a SAS user and are interested in the details, you can
download the SAS program that creates the analysis and graphs. It uses PROC QUANTREG to predict the quantile curves, and uses tips and techniques from the articles “How to score and graph a quantile regression model in SAS” and “Plot curves for levels of two categorical variables in SAS.”
The post A quantile regression analysis of chess ratings by age appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Elections are in the news again, therefore I have been on the lookout for interesting graphs. I recently found some graphs of the Census Current Population Survey (CPS) Voting and Registration Supplement data, and tried to improve them. Follow along if you’re interested in voter data, or creating better graphs! […]
The post Survey says! …. US voters are _________. appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Hash tables are a very powerful and flexible data structure. Most SAS applications of hash tables focus on just one of their many powerful facilities: table lookup. Hash tables are a fantastic table lookup tool and their use for that should never be diminished. However, hash tables can do so […]
The post Five things you (probably) don’t know you can do with a hash table appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
I peruse many different websites to get my news, and I always keep an eye out for good (or bad) presentations of data. I recently saw a posting on reddit claiming “U.S. GDP is greater than the total of all others combined.” This news seemed too good to be true […]
The post Fact checking a reddit post about GDP appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.
For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.
To motivate the discussion, let’s first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values ‘A’ and ‘B’. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let’s see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:
data TwoGroups; do G1=1 to 3; do G2='A', 'B'; do X = 1 to 10; Y = 10 + G1 + 0.5*(G2='A') + G1*X/20; output; end; end; end; run; title "First Attempt: Does Not Work"; proc sgplot data=TwoGroups; series x=x y=y / group=G1 curvelabel; run; |
The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each “curve” is Z-shaped because the graph traces the curve for G2=’A’ on the range [1, 10] and then draws the curve for G2=’B’ without “picking up the pen.” This happens because the data are sorted by G1, then by G2, then by X.
There are two ways to handle this situation. One way is to try to convert the data from “long form” to “wide form.” For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2=’A’ and G2=’B’, respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.
The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let’s write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:
The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:
/* create a new group variable by concatenating the two existing variables */ proc sort data=TwoGroups; by G1 G2 X; run; data Make2Groups; set TwoGroups; by G1 G2; /* Option 1: automatically create joint levels from original levels */ Label = catt("G1=", G1) || "; " || catt("G2=", G2); /* Option 2: create new categorical variable and use PROC FORMAT to assign values */ if first.G2 then GroupID + 1; /* GroupID = 1, 2, 3, ... numGroups */ run; /* For Option 2: use PROC FORMAT to form labels that encode the two groups. See https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html */ proc format; value GroupFmt 1 = "G1=1; G2='A'" 2 = "G1=1; G2='B'" 3 = "G1=2; G2='A'" 4 = "G1=2; G2='B'" 5 = "G1=3; G2='A'" 6 = "G1=3; G2='B'"; run; |
The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.
title "Two Groups, One SERIES Statement"; proc sgplot data=Make2Groups; /* format GroupID GroupFmt.; */ /* for Option 2 */ series x=x y=y / group=Label grouplc=G2 /* or group=GroupID for Option 2 */ lineattrs=(pattern=solid) curvelabel; run; |
The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.
This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.
This technique also generalizes to other plot types, such as box plots and scatter plots.
The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
There are many quotes with words of wisdom to help you live your life. But sometimes one quote seems to contradict another. For example, “Don’t sweat the small stuff.” … and “The devil is in the details.” When it comes to creating graphs (and perhaps living my life in general), […]
The post Choose your markers carefully! (for scatter plots, that is) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |