The post Optimizing your way in the woods -- Orienteering for real appeared first on Operations Research with SAS.

]]>Orienteering is the sport of finding some physical control points in the woods as fast as possible using only a map and a compass. The standard competition is point-to-point, where all the controls need to be visited in a prescribed order. (There is a subtle optimization problem in there as well; you need to figure out how you will get to the next control, factor in topography, obstacles, trails, fatigue, and so on.) The competition I attended a few months ago was score-orienteering. We were given 25 control points, and we had to visit as many of them in a set time limit as possible. (I chose two hours, but there were one-hour and four-hour options as well.) Each control is worth a certain number of points, and the runner with the most points wins. You can see the locations and the point values of the controls in this figure:

The race starts and ends at the control marked with the red dot.

There is a severe time penalty for not reaching the finish on time: 8 points for the first minute over the time limit and 1 point for each additional minute. There was also a special bonus for reaching both the East-West (6 points) or the North-South (8 points) controls.

We had about 30 minutes to plan our routes, but most of us also had to recalculate (read *improvise*) on the go.

So how does one (with a Ph.D. in optimization) go about solving a problem like this?

*Use optimization to find your way in the woods! #orienteering #optimization @sassoftware…*

Click To Tweet

As usual, the most challenging aspect of this project is getting the data right. We need to get pairwise distances between the control points. This is easy to do on a map: just measure and multiply by the scaling factor. However, there are certain complications. First of all, in reality the distance will not be symmetric; that is, reaching point A from point B can be much faster than reaching B from A if B is at the top of a hill and A is in the valley. Another complication is that the distances are uncertain. Usually the controls cannot be seen from a distance; they need to be found. This adds to the time it takes to reach them. Also, the runner might make a mistake while navigating or stop for a water break. To make it worse, this uncertainty varies from runner to runner.

An overly ambitious consultant might collect data from past races and come up with an approximation for the distribution of the travel times between control points (in both directions). Knowing that, the consultant could calculate the distribution of completion times for various course options and choose the best one according to some criterion:

- most points with at most a small chance of not finishing on time, or
- highest probability of success given the number of points targeted, or
- highest expected value of points finishing within the time limit (this assumes that the runner reoptimizes after each control).

Although such a detailed analysis would solve the problem completely, it isn't practical. This is usually the case not only for orienteering but for most practical engagements: the modeler needs to simplify the problem to make it practical and solvable.

A usual orienteering course is characterized by its length and the number of controls it includes. I use a GPS tracker to see how much I actually run during a race and compare that to the ideal course length. The closer these two numbers are, the better I am at navigating or the easier the course is. In my case, at the time of this race, the ratio was around 1.5, so I figured I would end up running about 50% more than the ideal distance. (That wasn't very good, I've improved since.) I also knew my pace during these races: I could do an average speed of about 6km/h. So before the race I estimated I would run about 12 kilometers, and I could cover a course that is about 8 kilometers long.

This simplification assumes that the course for the score-O race is similar to the courses I've run before. Its power comes from the fact that it accumulates all the uncertainty into just two numbers: the average speed of the runner and the ratio between the ideal and the tracked distance.

Now we can take the ideal pairwise distances and use them in this adjusted setting.

The optimization model is a variant of the traveling salesman problem (TSP). For simplicity of exposition, we use the MTZ formulation. We have a binary variable for each control point to indicate that the control is part of the route. Another set of binary variables indicate whether each arc is used.

set NODES; num x {NODES}; num y {NODES}; num score {NODES} init 0; read data &node_data nomiss into NODES=[control] x y score; str source; for {i in NODES: score[i] = 0} do; source = i; leave; end; set <str,str> ARCS; num distance {ARCS}; read data &arc_data into ARCS=[c1 c2] distance; /* declare decision variables */ var UseNode {NODES} binary; var UseArc {ARCS} binary; |

These two variables are naturally connected by the following constraints:

con LeaveNode {i in NODES}: sum {<(i),j> in ARCS} UseArc[i,j] = UseNode[i]; con EnterNode {i in NODES}: sum {<j,(i)> in ARCS} UseArc[j,i] = UseNode[i]; |

We need to make sure that the tour is contiguous; that is, it doesn't include a subtour. So we introduce an order variable:

var U {NODES} >= 0 <= card(NODES) - 1 integer; /* if UseArc[i,j] = 1 then U[i] + 1 <= U[j] */ /* strengthened by lifting coefficient on UseArc[j,i] */ con MTZ {<i,j> in ARCS: j ne source}: U[i] + 1 - U[j] <= (U[i].ub + 1 - U[j].lb) * (1 - UseArc[i,j]) - (card(NODES) - 2) * UseArc[j,i]; fix U[source] = 0; |

The bonus points require a couple of extra binary variables:

/* East-West and North-South bonuses */ num xmin = min {i in NODES} x[i]; num xmax = max {i in NODES} x[i]; num ymin = min {i in NODES} y[i]; num ymax = max {i in NODES} y[i]; set BONUSES = 1..2; set NODES_b {BONUSES} init {}; for {i in NODES: x[i] = xmin} do; NODES_b[1] = NODES_b[1] union {i}; leave; end; for {i in NODES: x[i] = xmax} do; NODES_b[1] = NODES_b[1] union {i}; leave; end; for {i in NODES: y[i] = ymin} do; NODES_b[2] = NODES_b[2] union {i}; leave; end; for {i in NODES: y[i] = ymax} do; NODES_b[2] = NODES_b[2] union {i}; leave; end; put NODES_b[*]=; var IsBonus {BONUSES} binary; con IsBonusCon {b in BONUSES, i in NODES_b[b]}: IsBonus[b] <= UseNode[i]; |

We can have two objectives, minimize distance given a target point value or maximize points given a limited distance:

max TotalScore = sum {i in NODES} score[i] * UseNode[i] + &east_west_bonus * IsBonus[1] + &north_south_bonus * IsBonus[2]; min TotalDistance = sum {<i,j> in ARCS} distance[i,j] * UseArc[i,j]; |

The problem is modeled using PROC OPTMODEL in SAS/OR 14.3. The optimization problems are solved with the mixed integer linear programming (MILP) solver. It takes on average about half a minute to find the course with the most points given a distance budget, and it takes only a couple of seconds to potentially shorten the course.

I ran the model for various course lengths to get the course with the most points. Then I fixed the number of points and found the shortest course that achieves that. These are listed in the following table:

Distance (m) | Points |
---|---|

2811 | 11 |

3648 | 17 |

4844 | 23 |

5746 | 31 |

6932 | 36 |

7828 | 42 |

8565 | 49 |

9950 | 55 |

10693 | 61 |

11964 | 68 |

12761 | 75 |

13950 | 80 |

14889 | 87 |

15790 | 89 |

16856 | 97 |

17904 | 104 |

18870 | 108 |

19846 | 115 |

20893 | 122 |

You can click on each distance to get the optimal course for that distance. Because I optimized the course lengths, the table can be read from left to right to get the most points given a distance or right to left to get the minimum distance given a target point value.

Browsing those routes reveals some interesting insights. The East-West bonus wasn't worth the effort, even though it looked enticing and definitely doable within two hours. Turns out there weren't enough high-valued control points around those two to make it worth it. On the other hand the North-South bonus was worth it as long as a course longer than 17km (or at least 97 points) was targeted. We can also see that 1km of distance is worth about 5-7 points, which is an important number when one wants to decide how far to run for a control.

Here are a few plots of optimal courses of a given length. First, a course that visits all the controls. It is just under 21km long:

Then let us see what I should have done given the condition I set out to myself that the nominal course length should be less than 8km:

If I had managed to complete this course I would have been tied for third place. Finally, the optimal course for my pace in the four-hour race:

This score would have won the four-hour race. It is interesting that it doesn't require either of the bonuses.

And what happened at the race? I had 24 points halfway through the race. Then I got greedy and decided to go for the East-West bonus. I lost a lot of time on the East control, had to give up the West control, and had only 30 minutes to get back to the finish. I went 8 minutes overtime and finished with only 15 points. On the upside, I ran 13km in 128 minutes, which is right on target for my 12km/120min original estimate. The nominal length of my course was 8.8km, a bit more than I projected, and that is probably why I couldn't finish on time. I should have headed back toward the finish and picked up a few more controls in that area instead. In any case, it was a good lesson in orienteering as well as in optimization.

You can find the race information here and the results here. If you are interested in participating in a similar event this year you can find the necessary information here. I will be there, hopefully armed with a better strategy this year.

Thanks to Rob Pratt for providing the code I used in this blog post. The code and the data can be downloaded from here.

The post Optimizing your way in the woods -- Orienteering for real appeared first on Operations Research with SAS.

]]>The post SAS at the 2017 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>**Technology Workshops on Saturday, October 21:**

- "JMP Pro and the Predictive Modeling Workflow," (1:00-3:30pm, room 330A)
- "Solving Business Problems with SAS Analytics and OPTMODEL," (4:00-6:30pm, room 330A)

**Software Tutorials on Tuesday, October 24 (individual presentation time indicated):**

- "Building and Solving Optimization Models with SAS," Ed Hughes and Rob Pratt (TD71, 2:00-2:45pm, room 371F)
- "JMP Pro: Top Features That Make You Say 'Wow!'," Mia Stephens (TD71, 2:45-3:30pm, room 371F)

**Other SAS-related talks (session time indicated):**

**Sunday, October 22:**

- "Tuning a Bayesian Framework for B2B Pricing," Fang Liang, Maarten Oosten (Vistex) (SB78B, 11:00am-12:30pm, room 380B)
- "Time Series Classification using Normalized Subsequences," Iman Vasheghani Farahani, Alex Chien, Russell Edward King (North Carolina State University) (SD70, 4:30-6:00pm, room 371E)

**Monday, October 23:**

- "Internet of Things: Promises, Challenges and Analytics," Patricia Neri, Ann Olecka (Honeywell), Bill Groves (Honeywell) (MA43, 8:00-9:30am, room 360B)
- "The SAS MILP Solver: Current Status and Future Developments," Imre Pólik, Menal Güzelsoy, Amar Kumar Narisetty, Philipp M. Christophel (MB77, 11:00am-12:30pm, room 372F)

**Tuesday, October 24:**

- "Robust Operational Decisions for Multi Period Industrial Gas Pipeline Networks under Uncertainty," Pelin Çay, Camilo Mancilla (Air Products and Chemicals), Robert H. Storer (Lehigh University), Luis F. Zuluaga (Lehigh University) (TC83, 12:05-1:35pm, room 382C)
- "Robust Principal Component Analysis in Cloud-based Run-time Environment," Kyungduck Cha, Seunghyun Kong, Zohreh Asgharzadeh, Arin Chaudhuri (TD70, 2:00-3:30pm, room 371E)

**Wednesday, October 25:**

- "Pricing Advance Purchase Products," Matthew Maxwell, Jennie Hu (WA23, 7:30-9:00am, room 342E)
- "Effective Methods for Solving the Bicriteria
*P*-center*P*-dispersion Problem," Golbarg Kazemi Tutunchi, Yahya Fathi (North Carolina State University) (WB58, 10:30am-12:00pm, room 362E) - "Warmstart of Interior Point Methods for Second Order Cone Optimization via Rounding Over Optimal Jordan Frames," Sertalp Bilal Çay, Imre Pólik, Tamas Terlaky (Lehigh University) (WC83, 12:30-2:00pm, room 382C)

Two additional important notes:

- Patricia Neri is the 2017 Daniel H. Wagner Prize chair, with presentations in sessions MB40, MC40, and MD40. The winner will be announced and the winning presentation will be repeated during keynote session TK04.
- Sertalp Bilal Çay will chair session WC83.

We hope you can make it to Houston to see these and many other excellent presentations! Don't forget to stop by the SAS and JMP booths to say hello!

The post SAS at the 2017 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>The post The Traveling Salesman Traverses the Plant Genome appeared first on Operations Research with SAS.

]]>One more large number is 3 billion: the estimated number of people (over 40%) that are currently malnourished. Technological/medical advances and new understanding of the human genome have improved both our quality of life and longevity. Now that same attention must be focused on our food sources.

What does this have to do with the software tools developed here at SAS? Well, researchers at places like the USDA and General Mills, to name just a couple, are using it to help solve this problem. Because grains and cereals make up over 80% of the world food sources they are in front-line position to help. They use JMP Genomics and SAS/OR to transform breeding programs with marker-assisted plant breeding.

The idea is to associate desirable traits in crops with locations on their genome so that we can help drive selection of new plant lines to optimize trait outcomes. A basic understanding of genetics is now common, especially with controversies surrounding genetically modified organisms (GMOs). If two blue-eyed people have a child, that child most likely will have blue eyes due to genetic code inherited from his/her parents. Cross two strains of corn that are tall, and the outcome will tend to be tall corn. These are called heritable traits. The goal then is to apply genomic knowledge of heritable traits to assist plant breeders in making the best crosses (mating of two plant lines) to produce crops that not only increase yield, but have less disease, have higher vitamin content, are easier to mill, can grow in drought-tolerant conditions, require less processing, and even *taste better* (yes, hedonic traits can and are being researched for crop improvement programs) *without* direct genetic modification.

By developing tools that use the observed relationships between genetic markers to build a genetic map that facilitates trait association/prediction, crop selection, and simulation. Unlike the human genome, of which the majority has now been mapped, we don't have this “genetic roadmap” for many plant species due to strong complexities in their genome; for example, four, six, or even eight sets of chromosomes instead of two. Instead, we typically use the inheritance patterns between genetic markers to estimate a linkage map.

Markers inherited together are correlated, meaning they are closer on the genetic map. So we need algorithms that can group and order markers to build a genetic map. Markers that are correlated (a small genetic recombination distance) to a certain degree belong on the same chromosome. Once markers are assigned a chromosome, we need to determine the order of those markers that will produce the smallest genetic distance map.

It turns out this is directly analogous to the traveling salesman problem (TSP). Lucky for me, SAS just happens to have a group of experts in operations research that easily recognized this as an optimization problem using minimum spanning forests to determine groups of markers, and TSP algorithms in SAS/OR's OPTMODEL and OPTNET procedures to find optimal marker orders.

*“Hey it’s not that simple! There are also known genetic relationships that might anchor certain markers to groups/orders,”* I said; my new best friends in SAS/OR said, *“No problem, we got this!”* and applied optional node connection constraints in a side-constrained TSP solution.

The Figure above shows an estimated Linkage Map using these methods for an experimental Oat population, where colored/italicized markers were already known/anchored markers. This optimization work and JMP Genomics visualization tools for linkage and consensus maps have since been used in several research publications for genetic mapping of a variety of plants, including this SAS Global Forum paper.

With a reliable map, employing genomic selection methods (modern data mining and predictive modeling techniques for marker-assisted plant selection) becomes feasible in a plant breeding program. Using the arsenal of robust predictive models available in JMP Genomics, the goal is to find the best model to score a new plant variety for a given trait, based on genetic variability. The art of plant breeding comes in developing a new line (seed variety) that can *balance* multiple traits to give the best possible outcome. For example, just driving an increase in yield alone will often lead to the plant/product suffering in other ways.

Our solution in JMP Genomics once again uses our secret weapon found in the capabilities of SAS/OR to create a process that allows breeders to analytically evaluate all possible plant/line crosses and create a cross simulation tool that selects the potential plant crosses that will *simultaneously optimize* multiple traits at once (for example in the plot below, increasing yield while decreasing height in the selected maize plant crosses).

The figure above shows that in five generations of breeding, crossing lines 45 or 99 with line 41 would produce some of the highest predicted yields while effectively reducing the height of the plant. A traditional breeding program can take several years before gain is realized. Each year, breeders have to decide the right crosses to make under the constraints of the amount of land/resources and time available to try to improve a given crop, and then must wait an entire growing season to see how the lines perform. JMP Genomics offers an analytic solution to help breeders find optimal crosses for a set of balanced physical traits and use multi-year simulation results to dramatically speed up this process, in many cases requiring just a few hours on a computer.

Creating sustainable agriculture techniques to produce not just *more* but *healthier *food is one of our most pressing concerns. The new methods outlined above provide just one way that SAS is able to help.

The post The Traveling Salesman Traverses the Plant Genome appeared first on Operations Research with SAS.

]]>The post Fun with Flags: Optimizing Arrangements of Stars with SAS appeared first on Operations Research with SAS.

]]>Suppose we want to arrange \(n\) stars in a rectangular region. Let \(x_i\) and \(y_i\) denote the coordinates of star \(i\), for \(i=1,\dots,n\). We need some way to measure the quality of a particular arrangement. What stands out to me in the current flag is that the stars are spread out in the sense that no two stars are too close to each other. One way to capture this idea with optimization is by using a maximin model that maximizes the minimum Euclidean distance between stars. Explicitly, we want to maximize

\(\min\limits_{i,j} \sqrt{(x_i-x_j)^2+(y_i-y_j)^2}\)

subject to lower and upper bounds on \(x_i\) and \(y_i\).

The following SAS macro variables record the number of stars and the width and height of the rectangular region as proportions of the overall flag height, according to the United States code:

%let n = 50; %let width = 0.76; %let height = 0.5385; |

The following PROC OPTMODEL statements capture the mathematical optimization problem just formulated:

proc optmodel; /* declare parameters */ num n = &n; set POINTS = 1..n; num width = &width; num height = &height; /* declare variables, objective, and constraints */ var X {POINTS} >= 0 <= width; var Y {POINTS} >= 0 <= height; var Maxmin >= 0; max Objective = Maxmin; con MaxminCon {i in POINTS, j in POINTS: i < j}: Maxmin <= (X[i]-X[j])^2 + (Y[i]-Y[j])^2; /* call nonlinear programming solver with multistart option */ solve with nlp / multistart; /* save optimal solution to SAS data set for use with PROC SGPLOT */ create data solution from [i] X Y; quit; |

Note that we use squared distance instead of distance in order to avoid the non-differentiability of the square root function at 0. Also, the MULTISTART option of the nonlinear programming solver increases the likelihood of finding a globally optimal solution for this nonconvex optimization problem with many locally optimally solutions.

The following PROC SGPLOT statements plot the resulting solution:

proc sgplot data=solution aspect=%sysevalf(10/19); styleattrs wallcolor='#3C3B6E'; scatter x=X y=Y / markerattrs=(symbol='StarFilled' color=white size=20); xaxis values=(0 &width) display=none; yaxis values=(0 &height) display=none; run; |

Here is the resulting plot, which is a transpose of the existing design:

The resulting objective value is 0.0116, which corresponds to the squared distance between vertically adjacent pairs of stars.

For comparison, here is the plot of the existing design (objective value 0.0103), where instead the diagonally adjacent stars are the closest:

For each new state admitted to the union, a star is added and the new flag is introduced the next July 4th. Recent proposals to add either Puerto Rico or Washington, D.C. raise the prospect of a 51-star flag. Several researchers have looked for such arrangements among a handful of prespecified patterns. See, for example, this 2012 article in *The American Mathematical Monthly*. Instead, we have a general optimization model for \(n\) stars, so we can change the value of \(n\) to 51 and rerun the code. Here is the resulting plot:

How do you like it in comparison with this popular design that has alternating rows of nine and eight stars?

The U.S. flag adopted on June 14, 1777 had 13 stars. Here's the optimized version:

The Star Spangled Banner flag that inspired Francis Scott Key to write the U.S. national anthem had 15 stars. Here's the optimized version:

Although the problem discussed here was motivated by flag design, the same problem applies to facility location. Suppose you want to build a number of facilities within some prescribed region. If two facilities are too close to each other, they will cannibalize each other's business. So it makes sense to maximize the minimum distance between facilities.

For those in the United States, enjoy the holiday tomorrow!

The post Fun with Flags: Optimizing Arrangements of Stars with SAS appeared first on Operations Research with SAS.

]]>The post Creating Synthetic Data with SAS/OR appeared first on Operations Research with SAS.

]]>Researchers could use synthetic data to, for example, understand the format of the real data, develop an understanding of its statistical properties, build preliminary models, or tune algorithm parameters. Analytical procedures and code developed using the synthetic data can then be passed to the data owner if the collaborators are not permitted to access the original data. Even if the collaborators are eventually granted access, synthetic data allow the work to be conducted in parallel with tasks required to access the real data, allowing an efficient path to the final analysis.

One method for producing synthetic data with SAS was presented at SAS Global Forum 2017 with this corresponding paper (Bogle and Erickson 2017), which implemented an algorithm originally proposed in another paper (Bogle and Mehrotra 2016, subscription required). The implementation uses the OPTMODEL procedure and solvers in SAS/OR to create synthetic data with statistical raw moments similar to the real data.

Most SAS users are familiar with mean, variance, and covariance. The mean is the first-order raw moment, \(\mathbf{E}[X]\). The variance is the second-order marginal central moment, \(\mathbf{E}[(X-\mathbf{E}[X])^2]\), and covariance is the second-order mixed central moment, \(\mathbf{E}[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])]\). Skewness and kurtosis are the third- and fourth-order normalized moments.

This method considers only raw moments, which keeps the optimization problem linear. The more of these moments that are matched, the more like the original data it will be. It is difficult to match the moments exactly, so the algorithm tries to keep them within desired bounds. For example, one of the fourth-order moments (meaning the sum of the exponents is four) could be \(\mathbf{E}[X^2YZ]\), which could have a value of 0.5 in the real data set. The method might try to "match" this moment by constraining that moment in the new data set to be between 0.48 and 0.52. If \(N\) observations are desired, this would ideally mean that the moment value would be constrained between those bounds, \(\displaystyle{0.48 \le \frac{1}{N} \sum_{i=1}^N x_i^2y_iz_i \le 0.52}\).

There are a couple of problems with solving a system of inequalities like this. First, it is not linear or convex, so it would be very hard to solve, especially when some variables need to take integer values. This macro instead creates candidate observations and uses mathematical optimization to select which observations to include. Second, there might not be a solution that solves such a system of inequalities, so the method allows violations to the bounds on the moments and minimizes them.

There are three basic steps to the algorithm. The first step reads in the original data, computes the upper and lower bounds on the moments, and sets up other parameters for the optimization models. The second step uses column generation with a linear program to produce candidate synthetic observations. The third step uses an integer program to select the desired number of synthetic observations from the candidate synthetic observations, while minimizing the largest scaled violation of the moment bounds. The following code is the macro that generates synthetic data by calling a different macro for each of the three steps.

%macro GENDATA(INPUTDATA, METADATA, OUTPUTDATA=SyntheticData, MOMENTORDER=3, NUMOBS=0, MINNUMIPCANDS=0, LPBATCHSIZE=10, LPGAP=1E-3, NUMCOFORTHREADS=1, MILPMAXTIME=600, RELOBJGAP=1E-4, ALPHA=0.95, RANDSEED=0); proc optmodel printlevel=0; call streaminit(&RANDSEED); %PRELIMINARYSTEP(INPUTDATA=&INPUTDATA, METADATA=&METADATA, MOMENTORDER=&MOMENTORDER, ALPHA=&ALPHA); %LPSTEP(MOMENTORDER=&MOMENTORDER, NUMOBS=&NUMOBS, MINNUMIPCANDS=&MINNUMIPCANDS, LPBATCHSIZE=&LPBATCHSIZE, LPGAP=&LPGAP, NUMCOFORTHREADS=&NUMCOFORTHREADS); %IPSTEP(OUTPUTDATA=&OUTPUTDATA, MOMENTORDER=&MOMENTORDER, NUMOBS=&NUMOBS, MINNUMIPCANDS=&MINNUMIPCANDS, MILPMAXTIME=&MILPMAXTIME, RELOBJGAP=&RELOBJGAP); quit; %mend GENDATA; |

You can see that there are several parameters, which are described in the paper. Most of them control the balance of the quality of the synthetic data with the time needed to produce the synthetic data.

The macro demonstrates some useful features of PROC OPTMODEL. One option for the macro is to use PROC OPTMODEL's COFOR loop. This allows multiple LPs to be solved concurrently, which can speed up the LP step. The implementation breaks the assignment of creating candidate observations into NUMCOFORTHREADS groups, and one solve statement from each group can be run concurrently. The use of COFOR reduces the total time from five hours to three hours in the example below.

The code also calls SAS functions not specific to PROC OPTMODEL, including RAND, STD, and PROBIT. The RAND function is used to create random potential observations during the column generation process. The STD and PROBIT functions are used to determine the desired moment range based on the input data and a user-specified parameter.

The examples in the paper are based on the *Sashelp.Heart* data set. The data set is modified to fit the requirements that all variables must be numeric and there can be no missing data.

The example randomly divides the data into Test and Training data sets so that a model based on the Training data set can be tested against a model based on a synthetic data set derived from the Training data set.

The following call of the macro generates the synthetic data based on the Training data set with the appropriate METADATA data set. It allows up to four LPs to be solved concurrently, matches moments up to the third order, allows 20 minutes for the MILP solver, and uses a fairly small range for the moment bounds.

%GENDATA(INPUTDATA=Training, METADATA=Metadata, NUMCOFORTHREADS=4, MOMENTORDER=3, MILPMAXTIME=1200, ALPHA=0.2); |

One way to verify that the synthetic data are similar to the real data is to compare the means, standard deviations, and covariances of the input and synthetic data sets. The paper shows how to do this using PROC MEANS and PROC CORR on a smaller synthetic data set.

In this case we want to compare the training and synthetic data by comparing how well the models they fit score the Test data set. One of the variables indicates if a person is dead, so we can use a logistic regression to predict if a person is dead or alive. Because BMI is a better health predictor than height or weight, we have added a BMI column to the data sets. The following code calls PROC LOGISTIC for the two data sets.

proc logistic data=TrainingWithBmi; model Dead = Male AgeAtStart Diastolic Systolic Smoking Cholesterol BMI / outroc=troc; score data=TestingWithBmi out=valpred outroc=vroc; run; proc logistic data=SyntheticDataWithBmi; model Dead = Male AgeAtStart Diastolic Systolic Smoking Cholesterol BMI / outroc=troc; score data=TestingWithBmi out=valpred outroc=vroc; run; |

This produces quite a bit of output. One useful comparison is the area under the curve (AUC) of the ROC curve, which plots a comparison of sensitivity and specificity of the model. The first plot is for the original training data scored with the testing data from the original data set. The second plot is for the synthetic data scored with the testing data from the original data set. You can see that the curves are fairly similar and there is little difference between the AUCs.

This synthetic data set behaves similarly to the original data set and can be safely shared without risk of exposing personally identifiable information.

What uses do you have for synthetic data?

The post Creating Synthetic Data with SAS/OR appeared first on Operations Research with SAS.

]]>The post Operations Research Talks at SAS Global Forum 2017 appeared first on Operations Research with SAS.

]]>**Monday, April 3**

- 11:00 AM - 11:30 AM, Dolphin Level 3 - Oceanic 1 (Session 0681) "Using SAS/OR® Software to Optimize the Capacity Expansion Plan of a Robust Oil Products Distribution Network," Dr. Shahrzad Azizzadeh, SAS
- 1:00 PM - 1:30 PM, Dolphin Level 3 - Oceanic 1 (Session 1387) "Increasing Revenue in Only Four Months with SAS® Real-Time Decision Manager," Alvaro Velasquez, Telefonica
- 2:30 PM - 3:00 PM, Dolphin Level 1 - The Quad - Lower - Super Demo Station 9 (Super Demo SD921) "Optimization and SAS® Viya™," Manoj Chari, SAS
- 3:30 PM - 4:00 PM, Dolphin Level 5 - The Quad - Upper - Theater 2 (Session 1055) "Using the CLP Procedure to Solve the Agent-District Assignment Problem," Stephen Sloan, Kevin Gillette, Accenture
- 4:00 PM - 4:30 PM, Dolphin Level 3 - Oceanic 1 (Session 0514) "Automated Hyperparameter Tuning for Effective Machine Learning," Patrick Koch, Brett Wujek, Oleg Golovidov, Steven Gardner, SAS

**Tuesday, April 4**

- 11:00 AM - 11:30 AM, Dolphin Level 3 - Asia 4 (Session 0454) "Change Management: Best Practices for Implementing SAS® Prescriptive Analytics," Scott Shuler, SAS
- 3:30 PM - 4:30 PM, Dolphin Level 3 - Oceanic 3 (Session 0302) "Optimize My Stock Portfolio! A Case Study with Three Different Estimates of Risk," Aric LaBarr, NC State University
- 3:30 PM - 4:30 PM, Dolphin Level 3 - Oceanic 1 (Session 0851) "Optimizing Delivery Routes with SAS® Software," Ben Murphy, Zencos, and Bruce Bedford, Oberweis Dairy Inc.
- 4:00 PM - 4:30 PM, Dolphin Level 1 - The Quad - Lower - Super Demo Station 9 (Super Demo SD909) "New Features in SAS® Simulation Studio," Edward P. Hughes, SAS
- 4:00 PM - 4:30 PM, Dolphin Level 5 - Southern Hemisphere IV (Session 2016) "Optimization and SAS® Viya™," Manoj Chari, SAS

**Wednesday, April 5**

- 11:00 AM - 11:30 AM, Dolphin Level 3 - Oceanic 2 (Session 1224) "A Moment-Matching Approach for Generating Synthetic Data in SAS®," Brittany M. Bogle, The University of North Carolina at Chapel Hill, and Jared C. Erickson, SAS
- 11:30 AM - 12:00 PM, Dolphin Level 3 - Oceanic 1 (Session 0593) "Key Components and Finished Products Inventory Optimization for a Multi-Echelon Assembly System," Sherry Xu, Kansun Xia, Ruonan Qiu, SAS
- 1:00 PM - 1:30 PM, Dolphin Level 3 - Asia 1 (Session 1326) "Price Recommendation Engine for Airbnb," Praneeth Guggilla, Singdha Gutha, and Goutam Chakraborty, Oklahoma State University

That's quite a list! It's great to see so much operations research work being done with SAS, and it's even more encouraging to see so many people talking about their successes with these methods.

The post Operations Research Talks at SAS Global Forum 2017 appeared first on Operations Research with SAS.

]]>The post SAS at the 2017 INFORMS Conference on Business Analytics and Operations Research appeared first on Operations Research with SAS.

]]>- Want SAS Skills? Get SAS University Edition, a Powerful and Free Analytical Tool from SAS, James Harroun, 9:00am-10:45am, Trevi
- Data Discovery and Analysis with JMP 13 Pro, Mia Stephens, 11:00am-12:45pm, Verona
- Solving Business Problems with SAS Analytics and OPTMODEL, Rob Pratt and Golbarg Tutunchi, 3:00pm-4:45pm, Turin

- Analyzing Unstructured Text Data with the JMP 13 Text Explorer, Mia Stephens, 9:10am-10:00am, Octavius 23
- Building and Solving Optimization Models with SAS, Rob Pratt, 10:30am-11:20am, Octavius 23
- Strategic Scheduling Intelligence at JPMorgan Chase’s Card Production Center, Jay Carini (JPMorgan Chase), 3:00pm–3:40pm, Octavius Ballroom
- A Deep Learning Tutorial, Mustafa Kabul, 3:50pm-4:40pm, Octavius 5/6

- A Novel SVDD Based Control Chart for Monitoring Multivariate Sensor Data, Anya McGuirk, 1:50pm-2:40pm, Octavius 7/8
- SAS Visual Statistics, Tom Bohannon, 1:50pm-2:40pm, Octavius 21
- Optimizing Rates for Service Agreements by Shaping Loss Functions, Maarten Oosten, 4:40pm-5:30pm, Octavius 19

We look forward to seeing you in Vegas!

The post SAS at the 2017 INFORMS Conference on Business Analytics and Operations Research appeared first on Operations Research with SAS.

]]>The post Solving Kakuro Puzzles with SAS/OR appeared first on Operations Research with SAS.

]]>For example, the 16 in the top left corner indicates that the first two white cells in the top row must add up to 16. Since they must also be distinct, the only two possibilities are 7 + 9 and 9 + 7.

The puzzle is of course meant to be solved by hand by making logical deductions. But you can also use mixed integer linear programming (MILP). One natural formulation involves two sets of decision variables:

\(\begin{align*} \mathrm{V}[i,j] &= \text{the digit in cell $(i,j)$}, \\ \mathrm{X}[i,j,k] &= \begin{cases} 1 & \text{if cell $(i,j)$ contains digit $k$},\\ 0 & \text{otherwise} \end{cases} \end{align*}\)

For more details, see Kalvelagen's post.

You can capture the clues with a SAS data set:

%let n = 8; data indata; input (C1-C&n) ($); datalines; x\x 23\x 30\x x\x x\x 27\x 12\x 16\x x\16 x x x\x 17\24 x x x x\17 x x 15\29 x x x x x\35 x x x x x 12\x x\x x\x x\7 x x 7\8 x x 7\x x\x 11\x 10\16 x x x x x x\21 x x x x x\5 x x x\6 x x x x\x x\3 x x ; |

The following PROC OPTMODEL code declares sets and parameters, reads the data, declares the variables and constraints, calls the MILP solver, and outputs the solution to a SAS data set:

proc optmodel; set ROWS; set COLS = 1..&n; str clue {ROWS, COLS}; /* read two-dimensional data */ read data indata into ROWS=[_n_]; read data indata into [i=_n_] {j in COLS} <clue[i,j] = col("c"||j)>; /* create output data set for use by PROC SGPLOT */ create data puzzle from [i j]=(ROWS cross COLS) label=(compress(clue[i,j],'x')) color=(clue[i,j]='x'); /* parse clues */ num num_clues init 0; set CLUES = 1..num_clues; set <num,num> CELLS_c {CLUES} init {}; num clue_length {c in CLUES} = card(CELLS_c[c]); num clue_sum {CLUES}; str prefix, suffix; num curr; for {i in ROWS, j in COLS: clue[i,j] ne 'x'} do; prefix = scan(clue[i,j],1,'\'); suffix = scan(clue[i,j],2,'\'); if prefix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(prefix, best.); curr = i + 1; do until (curr not in ROWS or clue[curr,j] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<curr,j>}; curr = curr + 1; end; end; if suffix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(suffix, best.); curr = j + 1; do until (curr not in COLS or clue[i,curr] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<i,curr>}; curr = curr + 1; end; end; end; set CELLS = union {c in CLUES} CELLS_c[c]; set DIGITS = 1..9; /* decision variables */ var X {CELLS, DIGITS} binary; var V {CELLS} integer >= 1 <= 9; /* constraints */ con OneDigitPerCell {<i,j> in CELLS}: sum {k in DIGITS} X[i,j,k] = 1; con AlldiffPerClue {c in CLUES, k in DIGITS}: sum {<i,j> in CELLS_c[c]} X[i,j,k] <= 1; con VCon {<i,j> in CELLS}: sum {k in DIGITS} k * X[i,j,k] = V[i,j]; con SumCon {c in CLUES}: sum {<i,j> in CELLS_c[c]} V[i,j] = clue_sum[c]; /* call MILP solver with no objective */ solve noobj; /* create output data set for use by PROC SGPLOT */ create data solution from [i j]=(ROWS cross COLS) V label=(if <i,j> in CELLS then put(V[i,j],1.) else compress(clue[i,j],'x')) color=(<i,j> in CELLS); quit; |

Notice that the PROC OPTMODEL code is completely data-driven so that you can solve other problem instances just by changing the input data set. In this case, the MILP presolver solves the entire problem instantly:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 360 variables (0 free, 0 fixed). NOTE: The problem has 324 binary and 36 integer variables. NOTE: The problem has 312 linear constraints (216 LE, 96 EQ, 0 GE, 0 range). NOTE: The problem has 1404 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The OPTMODEL presolver is disabled for linear problems. NOTE: The MILP presolver value AUTOMATIC is applied. NOTE: The MILP presolver removed all variables and constraints. NOTE: Optimal. NOTE: Objective = 0.

Finally, a call to PROC SGPLOT generates a nice plot of the solution:

proc sgplot data=solution noautolegend; heatmapparm x=j y=i colorresponse=color / colormodel=(black white) outline; text x=j y=i text=label / colorresponse=color colormodel=(white black); xaxis display=none; yaxis display=none reverse; run; |

The resulting plot makes it easy to verify the correctness of the solution:

Here is the corresponding DATA step for the large instance in Kalvelagen's post:

%let n = 22; data indata; input (C1-C&n) ($); datalines; x\x 12\x 38\x x\x 23\x 21\x x\x x\x 16\x 29\x 30\x x\x 33\x 30\x 3\x x\x x\x 17\x 35\x x\x 21\x 3\x x\17 x x 11\14 x x x\x 17\23 x x x x\16 x x x 15\x x\8 x x 3\3 x x x\16 x x x x x 3\30 x x x x x\13 x x x x 4\22 x x x x x x\x 23\27 x x x x x x 23\11 x x 17\17 x x x\21 x x x x x x 18\x x\13 x x x 30\45 x x x x x x x x x 30\14 x x x x 20\4 x x x\35 x x x x x 44\x 3\11 x x 10\16 x x 42\10 x x 3\x 22\29 x x x x x\17 x x 24\41 x x x x x x x 14\21 x x x x x x 20\23 x x x x\x 23\x 42\17 x x 30\3 x x x\6 x x x 16\16 x x 35\21 x x x x 40\x 3\x x\38 x x x x x x 16\x x\x 17\x 28\34 x x x x x 13\12 x x 26\4 x x x\24 x x x x\24 x x x 17\35 x x x x x 7\29 x x x x x x x x\8 x x 16\x 30\41 x x x x x x x 3\29 x x x x 29\18 x x x x\x x\x x\34 x x x x x 42\11 x x x 12\11 x x x x 15\30 x x x x 19\x x\x 12\24 x x x 14\17 x x 17\16 x x x x x x\14 x x x x\7 x x x x\16 x x 16\35 x x x x x 15\3 x x 3\x 34\x x\x 28\17 x x 23\x 35\16 x x x\16 x x x x x 4\41 x x x x x x x 6\41 x x x x x x x x\x x\4 x x 24\6 x x x 14\4 x x x\10 x x x x 16\x 29\23 x x x x\x x\x x\x 41\28 x x x x x x x 16\x x\x 29\41 x x x x x x x 43\x x\x x\x 16\23 x x x 21\x 24\29 x x x x 7\17 x x 3\23 x x x 29\16 x x 16\x x\42 x x x x x x x x\34 x x x x x x x 44\35 x x x x x x\15 x x 17\x x\8 x x 17\x x\x 29\x 3\3 x x 11\24 x x x x x 17\12 x x x\10 x x x 34\24 x x x 4\16 x x x x x 17\4 x x 30\23 x x x x\x x\x x\29 x x x x 3\10 x x x x 30\12 x x x 16\33 x x x x x 6\x x\x 17\11 x x x 30\11 x x x x 12\42 x x x x x x x x\x 13\4 x x x\29 x x x x x x x 11\35 x x x x x x\23 x x x 11\7 x x x x\16 x x 7\16 x x 14\16 x x x x x 17\x 29\x x\x 16\38 x x x x x x x\x 9\x 22\27 x x x x 15\6 x x 32\23 x x x 23\12 x x 21\4 x x 35\x 6\x x\6 x x x 27\22 x x x x x x 14\42 x x x x x x x 16\3 x x x\23 x x x x 15\x 4\4 x x 30\11 x x 29\9 x x 23\x 3\16 x x x x x x\4 x x 14\11 x x x x x\45 x x x x x x x x x 20\20 x x x x\x 17\21 x x x x x x 16\16 x x x\17 x x 16\24 x x x x x x 4\x x\34 x x x x x x\29 x x x x x\29 x x x x x\30 x x x x x x\13 x x x\4 x x x\x x\23 x x x x\16 x x x x\x x\4 x x x\8 x x ; |

And here is the plot of the input:

The same PROC OPTMODEL and PROC SGPLOT statements shown earlier then yield the following output:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 4920 variables (0 free, 0 fixed). NOTE: The problem has 4428 binary and 492 integer variables. NOTE: The problem has 3664 linear constraints (2412 LE, 1252 EQ, 0 GE, 0 range). NOTE: The problem has 19188 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The OPTMODEL presolver is disabled for linear problems. NOTE: The MILP presolver value AUTOMATIC is applied. NOTE: The MILP presolver removed 3077 variables and 2087 constraints. NOTE: The MILP presolver removed 12000 constraint coefficients. NOTE: The MILP presolver added 22 constraint coefficients. NOTE: The MILP presolver modified 11 constraint coefficients. NOTE: The presolved problem has 1843 variables, 1577 constraints, and 7188 constraint coefficients. NOTE: The MILP solver is called. NOTE: The parallel Branch and Cut algorithm is used. NOTE: The Branch and Cut algorithm is using up to 4 threads. Node Active Sols BestInteger BestBound Gap Time 0 1 0 . 0 . 2 NOTE: The MILP presolver is applied again. 0 0 1 0 0 0.00% 2 NOTE: Optimal. NOTE: Objective = 0.

The requirement that the digits within a single clue must all be different suggests the use of the ALLDIFF constraint supported in the constraint programming solver. You can then omit the binary X[i,j,k] variables and the corresponding constraints. The resulting model is much simpler, with only one set of variables and two sets of constraints:

/* decision variables */ var V {CELLS} integer >= 1 <= 9; /* constraints */ con AlldiffCon {c in CLUES}: alldiff({<i,j> in CELLS_c[c]} V[i,j]); con SumCon {c in CLUES}: sum {<i,j> in CELLS_c[c]} V[i,j] = clue_sum[c]; /* call constraint programming solver */ solve; |

For the small instance, the constraint programming solver instantly returns the solution:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 36 variables (0 free, 0 fixed). NOTE: The problem has 0 binary and 36 integer variables. NOTE: The problem has 24 linear constraints (0 LE, 24 EQ, 0 GE, 0 range). NOTE: The problem has 72 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The problem has 24 predicate constraints. NOTE: The OPTMODEL presolver is disabled for problems with predicates. NOTE: Required number of solutions found (1). NOTE: The data set WORK.SOLUTION has 64 observations and 5 variables. NOTE: PROCEDURE OPTMODEL used (Total process time): real time 0.10 seconds cpu time 0.06 seconds

But the large instance is more challenging and does not solve in a reasonable time, for reasons described in a paper by Helmut Simonis. One way to overcome this difficulty is to first do some precomputation, again with the constraint programming solver, to prune the domains of the decision variables. The idea is to consider each clue one at a time, using the FINDALLSOLNS option to find all solutions. Any digit that does not appear among these solutions can then be removed, by using a linear disequality (or "not equal") constraint, from the domains of the V[i,j] variables that are associated with that clue. Because these problems are independent across clues, you can use a COFOR loop to solve them concurrently. The full code is as follows:

proc optmodel; set ROWS; set COLS = 1..&n; str clue {ROWS, COLS}; /* read two-dimensional data */ read data indata into ROWS=[_n_]; read data indata into [i=_n_] {j in COLS} <clue[i,j] = col("c"||j)>; /* create output data set for use by PROC SGPLOT */ create data puzzle from [i j]=(ROWS cross COLS) label=(compress(clue[i,j],'x')) color=(clue[i,j]='x'); /* parse clues */ num num_clues init 0; set CLUES = 1..num_clues; set <num,num> CELLS_c {CLUES} init {}; num clue_length {c in CLUES} = card(CELLS_c[c]); num clue_sum {CLUES}; str prefix, suffix; num curr; for {i in ROWS, j in COLS: clue[i,j] ne 'x'} do; prefix = scan(clue[i,j],1,'\'); suffix = scan(clue[i,j],2,'\'); if prefix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(prefix, best.); curr = i + 1; do until (curr not in ROWS or clue[curr,j] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<curr,j>}; curr = curr + 1; end; end; if suffix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(suffix, best.); curr = j + 1; do until (curr not in COLS or clue[i,curr] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<i,curr>}; curr = curr + 1; end; end; end; set CELLS = union {c in CLUES} CELLS_c[c]; set DIGITS = 1..9; /* decision variables */ var V {CELLS} integer >= 1 <= 9; /* constraints */ con AlldiffCon {c in CLUES}: alldiff({<i,j> in CELLS_c[c]} V[i,j]); con SumCon {c in CLUES}: sum {<i,j> in CELLS_c[c]} V[i,j] = clue_sum[c]; problem Original include V AlldiffCon SumCon; /* prune the domains one clue at a time */ set DOMAIN {CELLS}; set LENGTHS_SUMS = setof {c in CLUES} <clue_length[c], clue_sum[c]>; set DOMAIN_ls {LENGTHS_SUMS}; num clue_length_this, clue_sum_this; var W {1..clue_length_this} integer >= 1 <= 9; con IncreasingWCon {k in 1..clue_length_this-1}: W[k] < W[k+1]; con SumWCon {c in CLUES}: sum {k in 1..clue_length_this} W[k] = clue_sum_this; problem Prune include W IncreasingWCon SumWCon; use problem Prune; reset options printlevel=0; cofor {<l,s> in LENGTHS_SUMS} do; put l= s=; clue_length_this = l; clue_sum_this = s; solve with clp / findallsolns; DOMAIN_ls[l,s] = setof {k in 1..clue_length_this, sol in 1.._NSOL_} W[k].sol[sol]; put DOMAIN_ls[l,s]=; end; reset options printlevel; for {<i,j> in CELLS} DOMAIN[i,j] = DIGITS; for {c in CLUES} do; clue_length_this = clue_length[c]; clue_sum_this = clue_sum[c]; for {<i,j> in CELLS_c[c]} DOMAIN[i,j] = DOMAIN[i,j] inter DOMAIN_ls[clue_length_this, clue_sum_this]; end; for {<i,j> in CELLS} put DOMAIN[i,j]=; /* solve original problem with pruned domains */ use problem Original; con PruneDomain {<i,j> in CELLS, k in DIGITS diff DOMAIN[i,j]}: V[i,j] ne k; solve; /* create output data set for use by PROC SGPLOT */ create data solution from [i j]=(ROWS cross COLS) V label=(if <i,j> in CELLS then put(V[i,j],1.) else compress(clue[i,j],'x')) color=(<i,j> in CELLS); quit; |

The entire PROC OPTMODEL call, including the precomputation step, now takes less than a second for the large instance:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 492 variables (0 free, 0 fixed). NOTE: The problem has 0 binary and 492 integer variables. NOTE: The problem has 2878 linear constraints (0 LE, 268 EQ, 0 GE, 0 LT, 2610 NE, 0 GT, 0 range). NOTE: The problem has 3594 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The problem has 268 predicate constraints. NOTE: The OPTMODEL presolver is disabled for problems with predicates. NOTE: Required number of solutions found (1). NOTE: The data set WORK.SOLUTION has 704 observations and 5 variables. NOTE: PROCEDURE OPTMODEL used (Total process time): real time 0.53 seconds cpu time 0.96 seconds

It turns out that you can also significantly improve the performance without doing any precomputation, simply by using a different variable selection strategy:

solve with clp / varselect=fifo; |

NOTE: Problem generation will use 4 threads. NOTE: The problem has 492 variables (0 free, 0 fixed). NOTE: The problem has 0 binary and 492 integer variables. NOTE: The problem has 268 linear constraints (0 LE, 268 EQ, 0 GE, 0 range). NOTE: The problem has 984 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The problem has 268 predicate constraints. NOTE: Required number of solutions found (1). NOTE: The data set WORK.SOLUTION has 704 observations and 5 variables. NOTE: PROCEDURE OPTMODEL used (Total process time): real time 6.55 seconds cpu time 6.42 seconds

As in similar logic puzzles, the solution is supposed to be unique. To check uniqueness with the MILP formulation, you can add one more constraint to exclude the first solution and call the MILP solver again, as described by Kalvelagen:

/* check uniqueness */ set <num,num,num> SUPPORT; SUPPORT = {<i,j> in CELLS, k in DIGITS: X[i,j,k].sol > 0.5}; con ExcludeSolution: sum {<i,j,k> in SUPPORT} X[i,j,k] <= card(SUPPORT) - 1; solve noobj; if _solution_status_ ne 'INFEASIBLE' then put 'Multiple solutions found!'; |

For the constraint programming formulation, checking uniqueness is even simpler. Just use the FINDALLSOLNS option:

/* check uniqueness */ solve with clp / findallsolns; if _NSOL_ > 1 then put 'Multiple solutions found!'; |

The post Solving Kakuro Puzzles with SAS/OR appeared first on Operations Research with SAS.

]]>The post How good is the MILP solver in SAS/OR? appeared first on Operations Research with SAS.

]]>Based on this test set Hans Mittelmann is compiling a regular comparison of solvers using different subsets of the MIPLIB2010 set. These are meant to measure different aspects of the solution process. At the recent INFORMS Annual Meeting in Nashville we presented detailed results for all of these categories for the first time. Here we reprint these results for future reference, along with some commentary.

We have 96 nodes each with 128GB of RAM and 2 Intel(R) Xeon(R) E5-2630 v3 CPUs with 8 cores each running at 2.40GHz. We run each benchmark set with the same options as Hans Mittelmann does. Note that our hardware is quite a bit slower than what he is using.

We are using the latest release of SAS as of November 22, 2016: SAS 9.4M4 with SAS/OR 14.2. The operating system is Redhat Enterprise Linux 6.5.

The time averages are 10-second shifted geometric means over all instances. Timeouts and failures are counted at the time limit. Deterministic parallel mode was used for the threaded MILP tests.

This set has been the backbone of all MILP benchmarking efforts for the last 6 years. Many of the instances require special techniques to solve them.

Solved: **81 out of 87**

Average time (s): **240s**

Solved: **85 out of 87**

Average time (s): **121s**

Solved: **82 out of 87**

Average time (s): **95s**

Since this is a very small set it is easy to overtune the solvers to perform well. To counter this we ran our solver with 16 other random seeds and took the averages. They are:

Solved: **82 out of 87**

Average time (s): **235s**

Solved: **85 out of 87**

Average time (s): **113s**

Solved: **84 out of 87**

Average time (s): **84s**

This shows that our solver isn't overtuned since the default run is actually quite a bit unlucky, especially in the 12 thread run.

These instances are solvable by a commercial MILP solver in an hour on a standard machine.

Solved: **170 out of 214**

Average time (s): **261s**

These instances are solved only to find the first feasible solution. The quality of the solution isn't measured.

Solved: **27 out of 33**

Average time (s): **114s**

All of these instances are infeasible. The task is to prove this infeasibility in the shortest time possible.

Solved: **18 out of 19**. Only the huge and LP infeasible zib02 instance was not solved.

Average time (s): **47s**

These instances are typically very hard and have some special feature: numeric instability, size, large trees.

Solved: **24 out of 45**

Average time (s): **1956s**

You can get the slides of our INFORMS 2016 talk here.

Finally, here are all the data in a table:

Test set | Solved | Average time (s) |
---|---|---|

Benchmark (1 thread) | 81 out of 87 | 240 |

Benchmark (1 thread, average of 17 runs) | 82 out of 87 | 235 |

Benchmark (4 threads) | 85 out of 87 | 121 |

Benchmark (4 threads, average of 17 runs) | 85 out of 87 | 113 |

Benchmark (12 threads) | 82 out of 87 | 95 |

Benchmark (12 threads, average of 17 runs) | 84 out of 87 | 84 |

Solvable (12 threads) | 170 out of 214 | 261 |

Feasibility | 27 out of 33 | 114 |

Infeasibility | 18 out of 19 | 47 |

Pathological | 24 out of 45 | 1956 |

If you have some difficult or interesting MILP instances and you want to shape the future of mixed integer optimization research, then please submit your instances to MIPLIB2017. SAS/OR will be participating in that project along with other leading commercial and open source MILP developers.

The post How good is the MILP solver in SAS/OR? appeared first on Operations Research with SAS.

]]>The post SAS at the 2016 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>[Location Abbreviations: MCC=Music City Center; OHN=Omni Hotel Nashville]

**Technology Workshops on Saturday, November 12:**

- SAS Global Academic Program: "SAS University Edition," 9:00-11:30am, MCC 201A
- JMP: "JMP 13 Pro," 12:00-2:30pm, MCC 202A
- SAS/OR: "Building and Solving Business Problems with SAS Analytics and OPTMODEL," 3:00-5:30pm, MCC 201A

**Software Tutorials on Tuesday, November 15 (individual presentation time indicated):**

- SAS Global Academic Program: "Analysis of a Presidential Debate Using SAS Text Analytics," Andre de Waal (in track TA94, 8:45-9:30am, MCC 5th Avenue Lobby)
- SAS/OR: "Building and Solving Optimization Models with SAS," Ed Hughes, Rob Pratt (in track TB94, 11:00-11:45am, MCC 5th Avenue Lobby)
- JMP: "Data Analysis and Discovery with JMP 13 Pro," Mia Stephens (in track TB94, 11:45am-12:30pm, MCC 5th Avenue Lobby)

*Other SAS-related talks (session time indicated):*

**Sunday, November 13:**

- "Identifying Shifting Production Bottlenecks Using Clearing Functions," Baris Kacar, Lars Moench (University of Hagen), Reha Uzsoy (North Carolina State University) (in track SA86, 8:00-9:30am, OHN Gibson Board Room)
- "The Use of Simulation for Evaluating Forecast Models," Sanjeewa Naranpanawe (in track SC33, 1:30-3:00pm, MCC 203B)
- "Unlocking Your 80%: Unearthing New Insights with Text Analytics," Christina Engelhardt (in track SD49, 4:30-6:00pm, MCC 211)

**Monday, November 14:**

- "Panel Session: IoT-enabled Data Analytics: Opportunities, Challenges and Applications," including Gul Ege (in track MA68, 8:00-9:30am, OHN Mockingbird 4)

**Tuesday, November 15:**

- "SAS/OR Value Beyond the Model," Leo Lopes (session chair) (in track TA19, 8:00-9:30am, MCC 106B)
- "Estimating Clearing Functions for Production Resources Using Simulation Optimization," Reha Uzsoy (North Carolina State University), Baris Kacar (in track TA67, 8:00-9:30am, OHN Mockingbird 3)
- "Strategies for Maintaining Sparse Dual Solutions in Large-scale Nonlinear SVM," Joshua Griffin, Alireza Yektamaram (in track TB06, 11:00am-12:30pm, MCC 102A)
- "A Hessian Free Method with Warm-starts for Deep Learning Problems," Wenwen Zhou, Joshua Griffin (in track TB06, 11:00am-12:30pm, MCC 102A)
- "An Accelerated Power Method for the Best Rank-1 Approximation to a Matrix," Jun Liu, Ruiwen Zhang, Yan Xu (session chair) (in track TB06, 11:00am-12:30pm, MCC 102A)
- "Local Search Optimization for Hyper-parameter Tuning," Yan Xu (in track TB06, 11:00am-12:30pm, MCC 102A)

**Wednesday, November 16:**

- "The SAS MILP Solver: Current Status and Future Developments," Philipp Christophel (in track WB13, 11:00am-12:30pm, MCC 104C)
- "Single-resource Capacity Control in the Presence of Cancellations, No-shows and Overbooking," Jason Chen (in track WB42, 11:00am-12:30pm, MCC 207D)
- "Pricing and Revenue Management of Function Space in Hotels," Altan Gulcu, Xiaodong Yao (session chair) (in track WB42, 11:00am-12:30pm, MCC 207D)
- "Visual Statistics--SAS," Dursun Delen (Oklahoma State University) (in track WD49, 2:45-4:15pm, MCC 211)
- "
*P*-center and*P*-dispersion Problems: A Bi-criteria Analysis," Golbarg Tutunchi, Yahya Fathi (North Carolina State University) (in track WE82, 4:30-6:00pm, OHN Broadway G)

If you will be in Nashville for the conference, we'd love to see you at our workshops, in the exhibit hall, and at our tutorials. Please consider adding some of these great SAS-related talks to your itinerary!

The post SAS at the 2016 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>