The post "Are You The One?" Finding the matching through SAS Constraint Programming appeared first on Operations Research with SAS.

]]>I bet you were not expecting a photo of 22 youngsters in swimsuits in a math blog, were you? The first time I heard about this reality TV show called "Are You The One?", my nerdiness and love for OR yelled at me: "There is so much math in this show!!!"

I'm sure producers did not mean it that way, but I saw mathematical optimization in it and just HAD to code it. Season 7 is premiering August 15th, and if you are as nerdy as I am, you should use my code to figure out the solution way before the all the participants do.

According to Wikipedia, "Are You The One?" is a reality TV show produced by MTV, in which a group of men and women are secretly paired into male-female couples via a matchmaking algorithm. Then, while living together, the contestants try to identify all of these "perfect matches." If they succeed, the entire group shares a prize of up to $1 million.

In each episode, the participants get two new pieces of information about the coupling:

- They get to choose one couple to go to the "truth booth" and ask if that couple is a perfect match or not.
- They all pair up in a "matching ceremony" and get to know how many couples they got right (not which ones though).

This way, after each episode they have an increasing amount of information to find their matches. If they don't find it by episode 10, they lose.

This problem can be solved through constraint programming because we're attempting to find all possible solutions (couples matches) that satisfy certain constraints (such as "Pete and Lola are NOT a couple" or "out of these given 10 couples only 2 are correct"). The more information you get as the weeks go by, the more constrained your problem is and the fewer feasible solutions you have - until you are left with only one (hopefully at most by week 10).

With SAS/OR Constraint Programming (PROC OPTMODEL or PROC CLP), we are able to discover this perfect matching much earlier than the participants.

The formulation shown in the next section uses a set of binary decision variables called Assign[b,g], with the interpretation that Assign[b,g] = 1 if boy b is assigned to girl g in the matching. The first two sets of constraints enforce a one-to-one matching between boys and girls. The Ceremony_Matches constraint makes sure that each solution respects the matching ceremony results. The Truth_Booth constraint forces Assign[b,g] = 1 or 0 (depending on the truth booth outcome) for the couple (b,g) that entered the truth booth.

Here is the SAS code I used to solve this problem. To run it for the next season, just copy all the code and update the "STEP 1" portion each week.

/***********************************************/ /******* STEP 1 - UPDATE YOUR INPUTS ***********/ /***********************************************/ /* Input number of weeks of data */ %let num_weeks = 5; /* Input weekly matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 5, these were the couples in each ceremony */ data input_ceremony_pairs(keep=week boys_name girls_name); array girl_week{&num_weeks.} $; input boys_name $ girl_week{*}; do week = 1 to &num_weeks.; girls_name = girl_week[week]; output; end; datalines; Anthony Geles Diandra Jada Keyana Nicole Clinton Uche Uche Uche Uche Jada Dimitri Diandra Nicole Nurys Alexis Uche Ethan Jada Jada Alexis Nicole Geles Joe Zoe Audrey Zoe Zoe Zoe Kareem Alivia Alivia Alivia Diandra Alivia Keith Alexis Alexis Diandra Nurys Alexis Malcolm Nurys Nurys Geles Alivia Diandra Michael Keyana Keyana Audrey Geles Nurys Shad Audrey Geles Keyana Audrey Audrey Tyler Nicole Zoe Nicole Jada Keyana ; /* Input number of correct matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 5, these were the correct guesses */ data input_ceremony_correct_matches; input week correct_guesses; datalines; 1 3 2 1 3 2 4 3 5 1 ; /* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */ /* For example, in Season 6, as of week 5, these were the truth booths */ data input_truth_booth; input week boys_name $ girls_name $ outcome; datalines; 1 Ethan Keyana 0 2 Anthony Geles 0 3 Malcolm Nurys 0 4 Dimitri Nicole 0 5 Clinton Uche 0 ; /***********************************************/ /****** STEP 2 - RUN WITHOUT MODIFYING CODE ****/ /***********************************************/ /* Constraint Programming Formulation */ proc optmodel; /* SETS */ set <num,str,str> CEREMONY_PAIRS; set BOYS = setof {<w,b,g> in CEREMONY_PAIRS} b; set GIRLS = setof {<w,b,g> in CEREMONY_PAIRS} g; set WEEKS = 1..&num_weeks.; set <num,str,str> TRUTH_BOOTH_PAIRS; set SOLS init 1.._NSOL_; num correctMatches{WEEKS}; num truthBoothOutcome{TRUTH_BOOTH_PAIRS}; /* READ DATA */ read data input_ceremony_pairs into CEREMONY_PAIRS=[week boys_name girls_name]; read data input_truth_booth into TRUTH_BOOTH_PAIRS=[week boys_name girls_name] truthBoothOutcome=outcome; read data input_ceremony_correct_matches into [week] correctMatches=correct_guesses; /* DECISION VAR */ /* Assign[b,g] = 1 if boy b is assigned to girl g in the matching; 0 otherwise */ var Assign{BOYS,GIRLS} BINARY; /* CONSTRAINTS */ con One_Girl_Per_Boy{b in BOYS}: sum{g in GIRLS} Assign[b,g] = 1; con One_Boy_Per_Girl{g in GIRLS}: sum{b in BOYS} Assign[b,g] = 1; /* WEEKLY CONSTRAINTS */ con Ceremony_Matches{w in WEEKS}: sum{<(w),b,g> in CEREMONY_PAIRS} Assign[b,g] = correctMatches[w]; con Truth_Booth{<w,b,g> in TRUTH_BOOTH_PAIRS}: Assign[b,g] = truthBoothOutcome[w,b,g]; /* SOLVE */ solve with clp / FINDALLSOLNS; /* CREATE OUTPUT */ num couplesCount{BOYS, GIRLS} init 0; for{s in SOLS} do; for{b in BOYS} do; for{g in GIRLS: Assign[b,g].sol[s] > 0.5} do; couplesCount[b,g] = couplesCount[b,g] + 1; leave; end; end; end; create data output_couples from [BOYS GIRLS] couplesCount; call symputx('Number_Solutions',_NSOL_); quit; |

To visualize the solution we can use the SAS SGPLOT procedure to create a heat map that illustrates the number of feasible solutions that contain a certain couple matched.

proc sort data=output_couples; by boys girls; run; proc sgplot data=output_couples noautolegend; title "Couples Occurrence Count in &Number_Solutions. Solutions"; heatmap y=boys x=girls / weight=couplesCount colormodel=(white pink red) outline x2axis; text y=Boys x=Girls text=couplesCount / textattrs=(size=10pt); xaxis display=none; x2axis display=(nolabel); yaxis display=(nolabel) reverse; run; |

For example, if we run the code above with the given data (Season 6, Week 5) we can see this diagram:

In this output we can see, for example, that Ethan and Geles are for sure not a couple (bummer) and that Tyler and Nicole seem to be the most promising couple (so far).

Now use the following data (Season 6, Week 8) and my optimization and heat map code. Be ready for a surprise!

/***********************************************/ /******* STEP 1 - UPDATE YOUR INPUTS ***********/ /***********************************************/ /* Input number of weeks of data */ %let num_weeks = 8; /* Input weekly matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 8, these were the couples in each ceremony */ data input_ceremony_pairs(keep=week boys_name girls_name); array girl_week{&num_weeks.} $; input boys_name $ girl_week{*}; do week = 1 to &num_weeks.; girls_name = girl_week[week]; output; end; datalines; Anthony Geles Diandra Jada Keyana Nicole Keyana Keyana Alivia Clinton Uche Uche Uche Uche Jada Geles Geles Geles Dimitri Diandra Nicole Nurys Alexis Uche Diandra Diandra Diandra Ethan Jada Jada Alexis Nicole Geles Jada Zoe Alexis Joe Zoe Audrey Zoe Zoe Zoe Alexis Uche Jada Kareem Alivia Alivia Alivia Diandra Alivia Nurys Nurys Nurys Keith Alexis Alexis Diandra Nurys Alexis Zoe Jada Audrey Malcolm Nurys Nurys Geles Alivia Diandra Alivia Alexis Uche Michael Keyana Keyana Audrey Geles Nurys Uche Audrey Keyana Shad Audrey Geles Keyana Audrey Audrey Audrey Alivia Zoe Tyler Nicole Zoe Nicole Jada Keyana Nicole Nicole Nicole ; /* Input number of correct matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 8, these were the correct guesses */ data input_ceremony_correct_matches; input week correct_guesses; datalines; 1 3 2 1 3 2 4 3 5 1 6 4 7 5 8 3 ; /* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */ /* For example, in Season 6, as of week 8, these were the truth booths */ data input_truth_booth; input week boys_name $ girls_name $ outcome; datalines; 1 Ethan Keyana 0 2 Anthony Geles 0 3 Malcolm Nurys 0 4 Dimitri Nicole 0 5 Clinton Uche 0 6 Keith Alexis 0 7 Keith Alivia 0 8 Michael Audrey 0 ; |

FYI - The participants in Season 6 did not find the perfect matching till week 10. I bet they would have loved to have SAS Constraint Programming handy.

To understand better how the number of possible matches (feasible solutions) decreases with each additional data piece (constraints), you can see the following super cool animation (thanks, Rob Pratt!):

Notice that the truth booths in weeks 4 and 6 were wasted because the status of the selected couple was already able to be deduced, so the number of possible solutions did not change. In a later update, I'll show how to optimally select which couple should go to the truth booth.

So, now that you know how to use SAS Constraint Programming, let's help these poor guys find their perfect matches in Season 7. Follow this blog and see how much faster than the contestants we'll figure out the matching!

And so it begins. After the first "truth booth" and "matching ceremony" (and lots of drama), here is the first heat map.

Teaser from Rob:

If the contestants have a choice, they should choose either Shamoy-Maria or Tomas-Morgan to go to the next truth booth, but NOT because those couples appear in the most remaining solutions.

Unfortunately the contestants did not follow Rob's suggestion. After the second truth booth and matching ceremony, we now have 287,580 solutions and actually Shamoy-Maria is now the most promising couple! Here is the heat map.

Update from Rob:

The fate button required the contestants to pair Andrew or Cam with Asia or Nutsa for the truth booth. Because the four pairs (Andrew-Asia, Andrew-Nutsa, Cam-Asia, Cam-Nutsa) all appeared in the same number of solutions (165,326) after week 1, these four choices were equally desirable to send to the truth booth, with respect to reducing the number of possible solutions. After week 2, Shamoy-Maria would be the best choice for the truth booth, but again NOT because they have the highest solution count (136,848). More details next week...

The third truth booth revealed that Shamoy and Maria are in fact a perfect match. After the third matching ceremony, there are 37,681 solutions.

Optimal truth booth recommendation from Rob:

Choosing either Cam-Kayla or Tevin-Kenya would yield the smallest number of resulting solutions in the worst case.

The "Couples Occurrence Count" table provides a way to determine which couple should go to the truth booth if you want to minimize the worst-case (maximum) resulting number of solutions. If couple (b,g) goes to the truth booth, there are two possibilities:

- If that couple is a perfect match, then the resulting number of solutions is couplesCount[b,g].
- If that couple is not a perfect match, then the resulting number of solutions is _NSOL_ - couplesCount[b,g].

So in the worst (worse?) case, the resulting number of solutions is max(couplesCount[b,g], _NSOL_ - couplesCount[b,g]), that is, the larger of these two counts. Because we want to reduce the number of solutions, the best strategy is to choose a couple that minimizes this quantity. Equivalently, we want to choose a couple that comes the closest to appearing in half the solutions. The following statements implement this idea:

/* find best pair for truth booth by minimizing the maximum number of resulting solutions */ num maxSolsTruth {b in BOYS, g in GIRLS} = max(couplesCount[b,g], _NSOL_ - couplesCount[b,g]); num min_maxSolsTruth init constant('BIG'); set <str,str> BEST_PAIR; for {b in BOYS, g in GIRLS} do; if min_maxSolsTruth > maxSolsTruth[b,g] then do; min_maxSolsTruth = maxSolsTruth[b,g]; BEST_PAIR = {<b,g>}; end; else if min_maxSolsTruth = maxSolsTruth[b,g] then do; BEST_PAIR = BEST_PAIR union {<b,g>}; end; end; put BEST_PAIR= min_maxSolsTruth; create data output_truth from [BOYS GIRLS] maxSolsTruth; |

After week 3, the PUT statement reports in the log:

BEST_PAIR={<'Cam','Kayla'>,<'Tevin','Kenya'>} 21889 |

The interpretation is that by choosing either of these two couples to go to the truth booth, the resulting number of solutions will be at most 21,889, and this number is best possible in the sense that choosing any other couple could yield a larger number of solutions.

As before, it is useful to display the results in a heat map:

proc sort data=output_truth; by boys girls; run; proc sgplot data=output_truth noautolegend; title "Maximum Number of Solutions if Couple goes to Truth Booth"; heatmap y=boys x=girls / weight=maxSolsTruth colormodel=(white pink red) outline x2axis; text y=Boys x=Girls text=maxSolsTruth / textattrs=(size=10pt); xaxis display=none; x2axis display=(nolabel); yaxis display=(nolabel) reverse; run; |

The lowest numbers appear in the lightest cells.

Here's the update after last night's episode. We are down to 12,860 solutions.

Using Rob's optimal strategy, the contestants should send Tevin and Kenya to the booth next episode such that in the worst-case scenario, the number of solutions will be down to 7,081.

If the matching ceremony yields no matches (a "blackout") beyond those confirmed by the truth booth, the prize pool gets cuts in half. So it would be good to figure out a way to avoid a blackout if possible. After the week 4 truth booth (before the matching ceremony), here was the heat map:

To avoid a blackout, you can solve an auxiliary optimization problem that maximizes the minimum number of matches. The formulation shown in the next code snippet uses a set of binary decision variables called IsCeremonyPair[b,g], with the interpretation that IsCeremonyPair[b,g] = 1 if boy b is paired with girl g in the matching ceremony. The first two sets of constraints enforce a one-to-one matching between boys and girls. The PerfectMatch constraint makes sure that any couple that is confirmed a "perfect match" by the truth booth is also paired together in the matching ceremony. The MaxMinCon constraint linearizes the maximin objective, as in the Fun with Flags post from July 2017. The following statements build and solve this auxiliary problem:

/* MATCHES_sol[s] = the set of pairs <b,g> for which Assign[b,g] = 1 in solution s */ set <str,str> MATCHES_sol {SOLS} init {}; for{s in SOLS} do; for{b in BOYS} do; for{g in GIRLS: Assign[b,g].sol[s] > 0.5} do; MATCHES_sol[s] = MATCHES_sol[s] union {<b,g>}; leave; end; end; end; /* find best ceremony pairs by maximizing the minimum number of matches (to avoid blackout) */ var IsCeremonyPair{b in BOYS, g in GIRLS} binary; con One_Girl_Per_Boy_Ceremony{b in BOYS}: sum{g in GIRLS} IsCeremonyPair[b,g] = 1; con One_Boy_Per_Girl_Ceremony{g in GIRLS}: sum{b in BOYS} IsCeremonyPair[b,g] = 1; con PerfectMatch{<w,b,g> in TRUTH_BOOTH_PAIRS: truthBoothOutcome[w,b,g] = 1}: IsCeremonyPair[b,g] = 1; var MaxMin >= 0 integer; max MinNumMatches = MaxMin; con MaxMinCon{s in SOLS}: MaxMin <= sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g]; problem CeremonyProblem include IsCeremonyPair MaxMin MinNumMatches One_Girl_Per_Boy_Ceremony One_Boy_Per_Girl_Ceremony PerfectMatch MaxMinCon; use problem CeremonyProblem; solve; put (min{s in SOLS} sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g])=; num couplesCountCeremony {b in BOYS, g in GIRLS} = (if IsCeremonyPair[b,g].sol > 0.5 then couplesCount[b,g] else .); create data output_couples from [BOYS GIRLS] couplesCount couplesCountCeremony; |

For week 4, the solver recommends an optimal matching ceremony that guarantees at least three matches and hence avoids a blackout. You can use the following PROC SGPLOT code to display the resulting solution:

proc sort data=output_couples; by boys girls; run; proc sgplot data=output_couples noautolegend; title "Couples Occurrence Count in &Number_Solutions. Solutions"; heatmapparm y=boys x=girls colorResponse=couplesCountCeremony / colormodel=(white pink red) outline x2axis; text y=Boys x=Girls text=couplesCount / textattrs=(size=10pt); xaxis display=none; x2axis display=(nolabel); yaxis display=(nolabel) reverse; run; |

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

It turns out that the actual matching ceremony chosen by the contestants in week 4 could have yielded a blackout. That is, there was at least one remaining solution that contained no matches from the ceremony.

Now we're down to 1,696 solutions:

Optimal truth booth recommendation from Rob:

Cam-Kayla would yield at most \(\max(884, 1696-884) = 884\) solutions, and they are the best choice. Note that Tevin-Kenya appear in more solutions together but have a higher worst-case count of \(\max(1005, 1969-1005) = 1005\) solutions.

We're getting there: 574 possible solutions left.

Optimal truth booth recommendation from Rob:

Cam-Kayla would yield at most \(\max(285, 574-285) = 289\) solutions, and they are still the best choice. Also, we can now conclude that Brett-Kayla are not a match, even though they never went to the truth booth.

The contestants voted to send Zak and Nutsa to the truth booth, and they were not a match, so \(574-36=538\) possible solutions remain.

We already saw one way to avoid a blackout by solving an auxiliary optimization problem to maximize the minimum number of matches. But there might be many such possible matching ceremonies that yield the same number of matches, and (as with the truth booth strategy) we would like to minimize the number of resulting solutions in the worst case. We can solve a different auxiliary problem to both avoid a blackout and minimize the number of solutions. This problem involves the IsCeremonyPair[b,g] variables and related constraints from before, as well as additional binary variables IsBeamCount[s,c] with the interpretation that IsBeamCount[s,c] = 1 if and only if the beam count for solution s is c. Because Shamoy and Maria are the only perfect match so far, we need at least two beams to avoid a blackout. The following statements build and solve this auxiliary problem:

/* find best ceremony pairs by minimizing the maximum number of resulting solutions */ /* minBeamCount is minimum number of beams to avoid blackout */ num minBeamCount = 1 + sum{<w,b,g> in TRUTH_BOOTH_PAIRS} truthBoothOutcome[w,b,g]; set BEAM_COUNTS = minBeamCount..card(BOYS) diff {card(BOYS) - 1}; /* IsBeamCount[s,c] = 1 iff sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g] = c */ var IsBeamCount{SOLS, BEAM_COUNTS} binary; impvar NumSolsPerBeamCount{c in BEAM_COUNTS} = sum{s in SOLS} IsBeamCount[s,c]; var MinMax >= 0 integer; min MaxNumSolutions = MinMax; con MinMaxCon{c in BEAM_COUNTS}: MinMax >= NumSolsPerBeamCount[c]; con OneBeamCount{s in SOLS}: sum{c in BEAM_COUNTS} IsBeamCount[s,c] = 1; con Link{s in SOLS}: sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g] = sum{c in BEAM_COUNTS} c * IsBeamCount[s,c]; problem CeremonyProblem2 include IsCeremonyPair IsBeamCount MinMax MaxNumSolutions One_Girl_Per_Boy_Ceremony One_Boy_Per_Girl_Ceremony PerfectMatch MinMaxCon OneBeamCount Link; use problem CeremonyProblem2; solve; print NumSolsPerBeamCount; create data output_couples from [BOYS GIRLS] couplesCount couplesCountCeremony; |

For week 7, the solver recommends an optimal matching ceremony that guarantees at least two matches (and hence avoids a blackout) and yields at most 142 solutions in the worst case (three beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

It turns out that the actual matching ceremony chosen by the contestants in week 7 (the results of which were left as a cliffhanger to be revealed in next week's episode) could yield up to 296 solutions, so the matching ceremony proposed here is better in the sense that it guarantees a much smaller worst-case number of solutions.

Update from Rob: The matching ceremony yielded four beams, and this implies several no-matches, four of which the contestants realized, and an inferred match (Tevin-Kenya), which the contestants did not realize and instead sent them to the truth booth. Although the truth booth result was treated as a cliffhanger to be revealed in the next episode, we already know the result. (In fairness to the contestants, there does not seem to be a simple explanation that Tevin and Kenya are a perfect match, without enumeration. For example, if you fix Assign['Tevin','Kenya'] = 0, the resulting problem is MILP infeasible but LP feasible.) There are now 128 remaining solutions, and that number will not change when the truth booth confirms the Tevin-Kenya match.

For next week, the solver recommends an optimal matching ceremony that guarantees at least three matches (and hence avoids a blackout) and yields at most 35 solutions in the worst case (five beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

As we already knew would happen, the truth booth confirmed that Tevin and Kenya are a perfect match. The matching ceremony, which could have yielded a blackout, leaves 34 solutions possible:

The contestants sent Cam and Samantha to the truth booth, but the results above already showed they were not a match, so that is two wasted truth booths in a row. The solver recommends an optimal matching ceremony that guarantees at least three matches (and hence avoids a blackout) and yields at most 8 solutions in the worst case (four or five beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

The contestants chose a different matching ceremony, which could have yielded a blackout. They got four beams, and now only five solutions remain:

By adding a PUT statement, you can write these solutions to the log:

for {s in SOLS} put s= MATCHES_SOL[s]; |

Here they are:

s=1 {<'Andrew','Cali'>,<'Brett','Nutsa'>,<'Cam','Bria'>,<'Daniel','Jasmine'>,<'Kwasi','Lauren'>,<'Lewis','Samantha'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Asia'>,<'Zak','Morgan'>} s=2 {<'Andrew','Cali'>,<'Brett','Samantha'>,<'Cam','Lauren'>,<'Daniel','Bria'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Nutsa'>,<'Zak','Morgan'>} s=3 {<'Andrew','Cali'>,<'Brett','Samantha'>,<'Cam','Lauren'>,<'Daniel','Nutsa'>,<'Kwasi','Kayla'>,<'Lewis','Bria'>,<'Moe','Asia'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Jasmine'>,<'Zak','Morgan'>} s=4 {<'Andrew','Samantha'>,<'Brett','Bria'>,<'Cam','Nutsa'>,<'Daniel','Cali'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Lauren'>,<'Zak','Morgan'>} s=5 {<'Andrew','Samantha'>,<'Brett','Nutsa'>,<'Cam','Morgan'>,<'Daniel','Lauren'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Cali'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Bria'>,<'Zak','Kayla'>}

Brett and Nutsa are headed to the final truth booth. Out of the four choices allowed by the fate button, they are the only couple whose status is unknown, so that was a good decision. From the latest heat map, we can see that there will be either two solutions (if Brett-Nutsa is a match) or three solutions (if not) left after the truth booth results. So it is not possible for the contestants to have logically deduced the correct matching in time for the final ceremony. It will take some guesswork for them to win the $1 million prize.

The final truth booth revealed that Brett-Nutsa is a perfect match, so we are down to two solutions (s=1 and s=5 above):

The contestants used solution s=1 for the final matching ceremony, and...

...they won the $1 million! Thanks for following along with us this season.

You can reproduce these results by using the following data.

/* Input number of weeks of data */ %let num_weeks = 10; /* Input weekly matches (from matching ceremony) up to current week */ data input_ceremony_pairs(keep=week boys_name girls_name); array girl_week{&num_weeks.} $; input boys_name $ girl_week{*}; do week = 1 to &num_weeks.; girls_name = girl_week[week]; output; end; datalines; Andrew Lauren Morgan Lauren Nutsa Samantha Lauren Lauren Samantha Cali Cali Brett Cali Asia Cali Kayla Nutsa Nutsa Nutsa Nutsa Bria Nutsa Cam Kayla Kayla Kayla Asia Kayla Kayla Cali Lauren Morgan Bria Daniel Nutsa Nutsa Samantha Lauren Bria Samantha Samantha Asia Lauren Jasmine Kwasi Asia Lauren Jasmine Bria Jasmine Asia Asia Jasmine Nutsa Lauren Lewis Samantha Jasmine Asia Kenya Lauren Bria Bria Bria Asia Samantha Moe Jasmine Bria Nutsa Samantha Asia Jasmine Jasmine Kayla Kayla Kayla Shamoy Maria Maria Maria Maria Maria Maria Maria Maria Maria Maria Tevin Kenya Kenya Kenya Jasmine Kenya Kenya Kenya Kenya Kenya Kenya Tomas Morgan Cali Bria Cali Cali Cali Kayla Morgan Jasmine Asia Zak Bria Samantha Morgan Morgan Morgan Morgan Morgan Cali Samantha Morgan ; /* Input number of correct matches (from matching ceremony) up to current week */ data input_ceremony_correct_matches; input week correct_guesses; datalines; 1 3 2 3 3 3 4 2 5 4 6 4 7 4 8 4 9 4 10 11 ; /* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */ data input_truth_booth; input week boys_name $ girls_name $ outcome; datalines; 1 Tomas Maria 0 2 Andrew Asia 0 3 Shamoy Maria 1 4 Brett Kenya 0 5 Zak Bria 0 6 Brett Cali 0 7 Zak Nutsa 0 8 Tevin Kenya 1 9 Cam Samantha 0 10 Brett Nutsa 1 ; |

The post "Are You The One?" Finding the matching through SAS Constraint Programming appeared first on Operations Research with SAS.

]]>The post Visiting all 30 Major League Baseball Stadiums - with Python and SAS® Viya® appeared first on Operations Research with SAS.

]]>A cross-country trip is pretty much an all-American experience, and so is baseball. Traveling around the country to see all 30 Major League Baseball (MLB) stadiums is not a new idea; there's even a social network between so-called "Ballpark Chasers" where people communicate and share their journeys. Even though I'm not a baseball fan myself, I find the idea of traveling around the country to visit all 30 MLB stadiums pretty interesting.

Since we all lack time, the natural question that might pop into your mind is, "How fast can I visit all 30 stadiums and see a MLB game in each?" This question was first asked by Cleary et al. (2000) in a mathematical context. This is where the math and baseball intersect. Finding the optimal trip is an expensive calculation. Discarding the schedule for a second and focusing only on ordering stadiums to visit results in more than \(2.65 \times 10^{32}\) different permutations. When you add the game schedule and distances between stadiums, the problem gets much bigger and more difficult quickly. See the Traveling Salesman Problem if you are interested in difficult scheduling problems, which is the main source of the "Traveling Baseball Fan Problem."

Before starting to talk about "The Optimal Trip" I should make some assumptions. The Optimal Trip is quite a subjective term, so I need to choose a measurable objective. My focus is to complete the schedule within the shortest time possible. Further, I assume that the superfan only uses land transportation (a car) between stadiums. Unlike some variations, I don't require the fan to return back to the origin, so the start and end cities will be different. Each stadium will be visited only once.

The Traveling Baseball Fan Problem (TBFP) has gained quite a bit of attention. There is a book by Ben Blatt and Eric Brewster about their 30 games in 30 days. They created an online visualization of such a tour for those interested; unfortunately the tool only shows schedules for the 2017 season. Their approach here is a heuristic, so the resulting solution is not guaranteed to be the "shortest possible tour." Since the problem is huge, one can only expect the true optimal solution to be obtained after optimization. Ben Blatt also wrote a mathematical optimization formulation for the shortest possible baseball road trip.

There are different ways to model this problem. The model I am going to use to optimize the TBFP is the network-based formulation presented in a SAS Global Forum 2014 paper by Chapman, Galati, and Pratt. Rob Pratt, one of the authors, wrote about the TBFP in this blog before.

Just to reiterate, ground rules for the optimal schedule:

- Use ground transportation only.
- Use driving distances between stadiums. The driving distances are obtained via OSRM.
- Stay until each game ends (assume each game lasts 3 hours).

The main challenge is to gather data, model the problem, and visualize results---all within the Python environment. Moreover, my aim here is to show you that the mathematical formulation for the TBFP can easily be written with our new open-source Python package **sasoptpy** and can be solved using SAS Viya on the cloud. If you are interested, check our Github repository for the package and our SAS Global Forum 2018 paper (Erickson and Cay) to learn more about it!

To provide variety, let's solve the problem for different time periods (2, 3, and 6 months) with two different objectives. The first objective is to finish the schedule in the shortest time possible. The second objective is to finish the schedule with spending the least amount of money. For this, I will assume $130 accommodation rate per day and $0.25 travel cost per mile. This objective was the main motivation of Rick and Mike's 1980 tour, as mentioned in the paper by Cleary et al. (2000).

I will use the network formulation from the aforementioned SAS Global Forum 2014 paper. For this model, I define directed arcs between pairs of games, eliminate the arcs that cannot be part of a feasible solution, and optimize the given objective.

The decision variable in this formulation is \(u\), which is defined for each arc \((g_1, g_2)\) as

\(u[g_1, g_2] = \begin{cases}1 & \text{if the fan attends games $g_1$ and $g_2$ and no game in between} \\0 & \text{otherwise}\end{cases}\)

Denote \(c[g_1,g_2]\) as the time between games \(g_1\) and \(g_2\) in days, including game duration, as follows:

\( c[g_1,g_2] = \begin{cases} \textrm{end}[g_2] - \textrm{end}[g_1] & \textrm{if } g_1 \not = \textrm{source and } g_2 \not = \textrm{sink} \\ \textrm{end} [g_2] - \textrm{start}[g_2] & \textrm{if } g_1 = \textrm{source and } g_2 \not = \textrm{sink} \\ 0 & \textrm{otherwise} \end{cases} \)

Also denote \(l[g]\) as the location of the game \(g\), \(\text{NODES}\) as the list of games including dummy nodes 'source' and 'sink', \(\text{ARCS}\) as the connections between games, and finally \(\text{STADIUMS}\) as the list of all stadiums. Now I can write the Network Formulation as follows:

\(\begin{array}{rlcll} \textrm{minimize:} & \displaystyle \sum_{(g_1, g_2) \in \text{ARCS}} c[g_1,g_2] \cdot u[g_1,g_2] \\ \textrm{subject to:} & \displaystyle \sum_{(g,g_2) \in \text{ARCS}} u[g,g_2] - \sum_{(g_1,g) \in \text{ARCS}} u[g_1,g] & = & \begin{cases} 1 & \text{if } g = \text{source,} \\ -1 & \text{if } g = \text{sink,} \\ 0 & \text{otherwise}\end{cases} & & \forall g \in \text{NODES} \\ & \displaystyle \sum_{(g_1,g_2) \in \text{ARCS}: g_2 \not = \text{sink and } l[g_2] = s} u[g_1, g_2] & = & 1 & & \forall s \in \text{STADIUMS} \end{array}\)

The solution of this optimization problem should produce a route starting at the source, finishing at the sink, and passing through all 30 ballparks. The objective here is to minimize the total schedule time. The first set of constraints ensures that inflow and outflow are equal for regular nodes. The second set of constraints ensures that the fan visits every stadium once.

For the second objective, I need to replace the objective function with the following:

\(\textrm{minimize:} \; \displaystyle 130 \cdot \sum_{(g_1, g_2) \in \text{ARCS}} c[g_1, g_2] \cdot u[g_1, g_2] + 0.25 \cdot \sum_{(g_1, g_2) \in \text{ARCS}: g_1 \not = \text{source} \text{ and } g_2 \not = \text{sink}} d[g_1, g_2] \cdot u[g_1, g_2]\)

where \(d\) is the distance between games in miles.

Now that I have my formulation ready, it is a breeze to write this problem using sasoptpy. Only part of the code is shown here for illustration purposes. See the Github repository for all of the code, including the code used for grabbing the season schedule from the MLB website, driving distances from OpenStreetMap, and exporting results.

I can write the Network Formulation to solve TBFP in Python as follows:

''' Defines the optimization problem and solves it. Parameters ---------- distance_data : pandas.DataFrame Distances between stadiums in miles. driving_data : pandas.DataFrame The driving times between stadiums in minutes. game_data : pandas.DataFrame The game schedule information for the current season. venue_data : pandas.DataFrame The information regarding each 30 MLB venues. start_date : datetime.date, optional The earliest start date for the schedule. end_date : datetime.date, optional The latest end date for the schedule. obj_type : integer, optional Objective type for the optimization problem, 0: Minimize total schedule time, 1: Minimize total cost ''' def tbfp(distance_data, driving_data, game_data, venue_data, start_date=datetime.date(2018, 3, 29), end_date=datetime.date(2018, 10, 31), obj_type=0): # Define a CAS session cas_session = CAS(your_cas_server, port=your_cas_port) m = so.Model(name='tbfp', session=cas_session) # Define sets, parameters and pre-process data (omitted) # Add variables use_arc = m.add_variables(ARCS, vartype=so.BIN, name='use_arc') # Define expressions for the objectives total_time = so.quick_sum( cost[g1,g2] * use_arc[g1,g2] for (g1, g2) in ARCS) total_distance = so.quick_sum( distance[location[g1], location[g2]] * use_arc[g1, g2] for (g1, g2) in ARCS if g1 != 'source' and g2 != 'sink') total_cost = total_time * 130 + total_distance * 0.25 # Set objectives if obj_type == 0: m.set_objective(total_time, sense=so.MIN) elif obj_type == 1: m.set_objective(total_cost, sense=so.MIN) # Balance constraint m.add_constraints(( so.quick_sum(use_arc[g, g2] for (gx,g2) in ARCS if gx==g) -\ so.quick_sum(use_arc[g1, g] for (g1,gx) in ARCS if gx==g)\ == (1 if g == 'source' else (-1 if g == 'sink' else 0) ) for g in NODES), name='balance') # Visit once constraint visit_once = so.ConstraintGroup(( so.quick_sum( use_arc[g1,g2] for (g1,g2) in ARCS if g2 != 'sink' and location[g2] == s) == 1 for s in STADIUMS), name='visit_once') m.include(visit_once) # Send the problem to SAS Viya solvers and solve the problem m.solve(milp={'concurrent': True}) # Post-process results (omitted) |

I ran this problem for the following twelve settings.

ID | Period | Objective (Min) |
---|---|---|

1 | 03/29 - 06/01 | Time |

2 | 03/29 - 06/01 | Cost |

3 | 06/01 - 08/01 | Time |

4 | 06/01 - 08/01 | Cost |

5 | 08/01 - 10/01 | Time |

6 | 08/01 - 10/01 | Cost |

7 | 03/29 - 07/01 | Time |

8 | 03/29 - 07/01 | Cost |

9 | 07/01 - 10/01 | Time |

10 | 07/01 - 10/01 | Cost |

11 | 03/29 - 10/01 | Time |

12 | 03/29 - 10/01 | Cost |

The last two settings (11 and 12) cover the entire 2018 MLB season from March 29th to October 1st. Therefore, these problems should give the optimal solutions for the best time and best cost objectives, respectively. My aim is to show how the problem size and solution time grow when the problem period is larger.

The ultimate benefit of working within the Python environment is ability to use open-source packages for many tasks. I have used the Bokeh package for plots and Folium for generating travel maps below. Bokeh is capable of web-ready interactive plots, which makes the visualization engaging for the user. Folium uses the Leaflet.js Javascript library to generate interactive maps based on OpenStreetMap maps. For interaction between the Bokeh scatter plots and the Leaflet maps, I have used a custom Javascript function provided by the Bokeh CustomJS class. You can see details of how the visualization part works in the Jupyter notebook.

The optimal solution I obtained from the 11th experiment gives the best schedule time for the 2018 season, which is just over 24 days (24 days and 3 hours). The solution starts with Diamondbacks @ Giants in AT&T Park, San Francisco on the 5th of June and ends with Royals @ Mariners in Safeco Field, Seattle on the 29th of June. This is the global best solution among the scheduled games this season. Maps and itineraries of selected solutions are shown below. Click on any of the plots and tables to see the Jupyter notebook. All times in these schedules are in EDT.

As a 22,528-mile trip, this schedule costs roughly $8,767. By changing the objective, it is possible to obtain a better cost, however the schedule takes longer significantly longer. Solution 12 is only 11,914 miles. It's 10,614 miles shorter and $2,149 cheaper compared to Solution 11 but takes 4 days longer.

Among the schedules you can still try at the writing of this post, Solution 9 gives the shortest schedule time (24 days 3 hours) and Solution 10 gives the best cost ($6,899). The latter schedule starts with Tigers @ Angels in Angel Stadium, Anaheim on the 6th of August, and ends with Orioles @ Mariners in Safeco Field, Seattle on the 4th of September. Note that, this solution is the longest schedule among all solutions I have with a little over 29 days.

Here's a list of all solutions:

The objective and the time period of the formulation heavily affect the solution time. Minimizing the cost takes longer due to unique optimal solutions. Moreover, increasing the time period from 3 months to 6 months nearly quadruples the solution time.

Ultimately, the best trip is up to you. You can define another objective of your choice, whether it be minimizing the cost, minimizing the schedule time, avoiding the risky connections (minimum time between games minus the driving time), minimizing the total driving time, or even maximizing the landmarks you have visited along the way. Whatever objective you choose, you can use *sasoptpy* and use the powerful SAS Viya mixed integer linear optimization solver to generate a trip that is perfect for you! Working in Python allows you to integrate packages you are familiar with and makes everything smoother. Do not forget to check my Jupyter notebook and see the Python files.

You can further improve this model based on your desire. Here I list a few ideas:

If you would like to see an away and a home game for every team exactly once, then you can add the following constraint

\(\displaystyle \sum_{(g, g_1) \in \text{ARCS}: \text{away}[g]= t \text{ and } \text{away}[g_1] \not = t} u[g,g_1]+\sum_{(g_1, g) \in \text{ARCS}: \text{away}[g]= t \text{ and } \text{away}[g_1] \not = t} u[g_1,g] = 1 \qquad \forall t \in \text{TEAMS}\)

You can try to avoid risky connections you have in the schedule by replacing the objective with

\(\displaystyle \text{maximize: } \sum_{(g_1,g_2) \in \text{ARCS}} ( \text{start}[g_2] - \text{end}[g_1] - c[g_1,g_2])\cdot u[g_1, g_2]\)

To prevent schedules that are too long, you should add a limit to the total schedule length, for example, 25 days:

\(\displaystyle \sum_{(g, \text{sink}) \in \text{ARCS}} \text{end}[g]\cdot u[g, \text{sink}] - \sum_{(\text{source}, g) \in \text{ARCS}} \text{start}[g]\cdot u[\text{source}, g] \leq 25\)

Share how you define the perfect schedule below if you have more ideas!

The post Visiting all 30 Major League Baseball Stadiums - with Python and SAS® Viya® appeared first on Operations Research with SAS.

]]>The post Optimizing your way in the woods -- Orienteering for real appeared first on Operations Research with SAS.

]]>Orienteering is the sport of finding some physical control points in the woods as fast as possible using only a map and a compass. The standard competition is point-to-point, where all the controls need to be visited in a prescribed order. (There is a subtle optimization problem in there as well; you need to figure out how you will get to the next control, factor in topography, obstacles, trails, fatigue, and so on.) The competition I attended a few months ago was score-orienteering. We were given 25 control points, and we had to visit as many of them in a set time limit as possible. (I chose two hours, but there were one-hour and four-hour options as well.) Each control is worth a certain number of points, and the runner with the most points wins. You can see the locations and the point values of the controls in this figure:

The race starts and ends at the control marked with the red dot.

There is a severe time penalty for not reaching the finish on time: 8 points for the first minute over the time limit and 1 point for each additional minute. There was also a special bonus for reaching both the East-West (6 points) or the North-South (8 points) controls.

We had about 30 minutes to plan our routes, but most of us also had to recalculate (read *improvise*) on the go.

So how does one (with a Ph.D. in optimization) go about solving a problem like this?

*Use optimization to find your way in the woods! #orienteering #optimization @sassoftware @orienteeringusa*

Click To Tweet

As usual, the most challenging aspect of this project is getting the data right. We need to get pairwise distances between the control points. This is easy to do on a map: just measure and multiply by the scaling factor. However, there are certain complications. First of all, in reality the distance will not be symmetric; that is, reaching point A from point B can be much faster than reaching B from A if B is at the top of a hill and A is in the valley. Another complication is that the distances are uncertain. Usually the controls cannot be seen from a distance; they need to be found. This adds to the time it takes to reach them. Also, the runner might make a mistake while navigating or stop for a water break. To make it worse, this uncertainty varies from runner to runner.

An overly ambitious consultant might collect data from past races and come up with an approximation for the distribution of the travel times between control points (in both directions). Knowing that, the consultant could calculate the distribution of completion times for various course options and choose the best one according to some criterion:

- most points with at most a small chance of not finishing on time, or
- highest probability of success given the number of points targeted, or
- highest expected value of points finishing within the time limit (this assumes that the runner reoptimizes after each control).

Although such a detailed analysis would solve the problem completely, it isn't practical. This is usually the case not only for orienteering but for most practical engagements: the modeler needs to simplify the problem to make it practical and solvable.

A usual orienteering course is characterized by its length and the number of controls it includes. I use a GPS tracker to see how much I actually run during a race and compare that to the ideal course length. The closer these two numbers are, the better I am at navigating or the easier the course is. In my case, at the time of this race, the ratio was around 1.5, so I figured I would end up running about 50% more than the ideal distance. (That wasn't very good, I've improved since.) I also knew my pace during these races: I could do an average speed of about 6km/h. So before the race I estimated I would run about 12 kilometers, and I could cover a course that is about 8 kilometers long.

This simplification assumes that the course for the score-O race is similar to the courses I've run before. Its power comes from the fact that it accumulates all the uncertainty into just two numbers: the average speed of the runner and the ratio between the ideal and the tracked distance.

Now we can take the ideal pairwise distances and use them in this adjusted setting.

The optimization model is a variant of the traveling salesman problem (TSP). For simplicity of exposition, we use the MTZ formulation. We have a binary variable for each control point to indicate that the control is part of the route. Another set of binary variables indicate whether each arc is used.

set NODES; num x {NODES}; num y {NODES}; num score {NODES} init 0; read data &node_data nomiss into NODES=[control] x y score; str source; for {i in NODES: score[i] = 0} do; source = i; leave; end; set <str,str> ARCS; num distance {ARCS}; read data &arc_data into ARCS=[c1 c2] distance; /* declare decision variables */ var UseNode {NODES} binary; var UseArc {ARCS} binary; |

These two variables are naturally connected by the following constraints:

con LeaveNode {i in NODES}: sum {<(i),j> in ARCS} UseArc[i,j] = UseNode[i]; con EnterNode {i in NODES}: sum {<j,(i)> in ARCS} UseArc[j,i] = UseNode[i]; |

We need to make sure that the tour is contiguous; that is, it doesn't include a subtour. So we introduce an order variable:

var U {NODES} >= 0 <= card(NODES) - 1 integer; /* if UseArc[i,j] = 1 then U[i] + 1 <= U[j] */ /* strengthened by lifting coefficient on UseArc[j,i] */ con MTZ {<i,j> in ARCS: j ne source}: U[i] + 1 - U[j] <= (U[i].ub + 1 - U[j].lb) * (1 - UseArc[i,j]) - (card(NODES) - 2) * UseArc[j,i]; fix U[source] = 0; |

The bonus points require a couple of extra binary variables:

/* East-West and North-South bonuses */ num xmin = min {i in NODES} x[i]; num xmax = max {i in NODES} x[i]; num ymin = min {i in NODES} y[i]; num ymax = max {i in NODES} y[i]; set BONUSES = 1..2; set NODES_b {BONUSES} init {}; for {i in NODES: x[i] = xmin} do; NODES_b[1] = NODES_b[1] union {i}; leave; end; for {i in NODES: x[i] = xmax} do; NODES_b[1] = NODES_b[1] union {i}; leave; end; for {i in NODES: y[i] = ymin} do; NODES_b[2] = NODES_b[2] union {i}; leave; end; for {i in NODES: y[i] = ymax} do; NODES_b[2] = NODES_b[2] union {i}; leave; end; put NODES_b[*]=; var IsBonus {BONUSES} binary; con IsBonusCon {b in BONUSES, i in NODES_b[b]}: IsBonus[b] <= UseNode[i]; |

We can have two objectives, minimize distance given a target point value or maximize points given a limited distance:

max TotalScore = sum {i in NODES} score[i] * UseNode[i] + &east_west_bonus * IsBonus[1] + &north_south_bonus * IsBonus[2]; min TotalDistance = sum {<i,j> in ARCS} distance[i,j] * UseArc[i,j]; |

The problem is modeled using PROC OPTMODEL in SAS/OR 14.3. The optimization problems are solved with the mixed integer linear programming (MILP) solver. It takes on average about half a minute to find the course with the most points given a distance budget, and it takes only a couple of seconds to potentially shorten the course.

I ran the model for various course lengths to get the course with the most points. Then I fixed the number of points and found the shortest course that achieves that. These are listed in the following table:

Distance (m) | Points |
---|---|

2811 | 11 |

3648 | 17 |

4844 | 23 |

5746 | 31 |

6932 | 36 |

7828 | 42 |

8565 | 49 |

9950 | 55 |

10693 | 61 |

11964 | 68 |

12761 | 75 |

13950 | 80 |

14889 | 87 |

15790 | 89 |

16856 | 97 |

17904 | 104 |

18870 | 108 |

19846 | 115 |

20893 | 122 |

You can click on each distance to get the optimal course for that distance. Because I optimized the course lengths, the table can be read from left to right to get the most points given a distance or right to left to get the minimum distance given a target point value.

Browsing those routes reveals some interesting insights. The East-West bonus wasn't worth the effort, even though it looked enticing and definitely doable within two hours. Turns out there weren't enough high-valued control points around those two to make it worth it. On the other hand the North-South bonus was worth it as long as a course longer than 17km (or at least 97 points) was targeted. We can also see that 1km of distance is worth about 5-7 points, which is an important number when one wants to decide how far to run for a control.

Here are a few plots of optimal courses of a given length. First, a course that visits all the controls. It is just under 21km long:

Then let us see what I should have done given the condition I set out to myself that the nominal course length should be less than 8km:

If I had managed to complete this course I would have been tied for third place. Finally, the optimal course for my pace in the four-hour race:

This score would have won the four-hour race. It is interesting that it doesn't require either of the bonuses.

And what happened at the race? I had 24 points halfway through the race. Then I got greedy and decided to go for the East-West bonus. I lost a lot of time on the East control, had to give up the West control, and had only 30 minutes to get back to the finish. I went 8 minutes overtime and finished with only 15 points. On the upside, I ran 13km in 128 minutes, which is right on target for my 12km/120min original estimate. The nominal length of my course was 8.8km, a bit more than I projected, and that is probably why I couldn't finish on time. I should have headed back toward the finish and picked up a few more controls in that area instead. In any case, it was a good lesson in orienteering as well as in optimization.

You can find the race information here and the results here. If you are interested in participating in a similar event this year you can find the necessary information here. I will be there, hopefully armed with a better strategy this year.

Thanks to Rob Pratt for providing the code I used in this blog post. The code and the data can be downloaded from here.

The post Optimizing your way in the woods -- Orienteering for real appeared first on Operations Research with SAS.

]]>The post SAS at the 2017 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>**Technology Workshops on Saturday, October 21:**

- "JMP Pro and the Predictive Modeling Workflow," (1:00-3:30pm, room 330A)
- "Solving Business Problems with SAS Analytics and OPTMODEL," (4:00-6:30pm, room 330A)

**Software Tutorials on Tuesday, October 24 (individual presentation time indicated):**

- "Building and Solving Optimization Models with SAS," Ed Hughes and Rob Pratt (TD71, 2:00-2:45pm, room 371F)
- "JMP Pro: Top Features That Make You Say 'Wow!'," Mia Stephens (TD71, 2:45-3:30pm, room 371F)

**Other SAS-related talks (session time indicated):**

**Sunday, October 22:**

- "Tuning a Bayesian Framework for B2B Pricing," Fang Liang, Maarten Oosten (Vistex) (SB78B, 11:00am-12:30pm, room 380B)
- "Time Series Classification using Normalized Subsequences," Iman Vasheghani Farahani, Alex Chien, Russell Edward King (North Carolina State University) (SD70, 4:30-6:00pm, room 371E)

**Monday, October 23:**

- "Internet of Things: Promises, Challenges and Analytics," Patricia Neri, Ann Olecka (Honeywell), Bill Groves (Honeywell) (MA43, 8:00-9:30am, room 360B)
- "The SAS MILP Solver: Current Status and Future Developments," Imre Pólik, Menal Güzelsoy, Amar Kumar Narisetty, Philipp M. Christophel (MB77, 11:00am-12:30pm, room 372F)

**Tuesday, October 24:**

- "Robust Operational Decisions for Multi Period Industrial Gas Pipeline Networks under Uncertainty," Pelin Çay, Camilo Mancilla (Air Products and Chemicals), Robert H. Storer (Lehigh University), Luis F. Zuluaga (Lehigh University) (TC83, 12:05-1:35pm, room 382C)
- "Robust Principal Component Analysis in Cloud-based Run-time Environment," Kyungduck Cha, Seunghyun Kong, Zohreh Asgharzadeh, Arin Chaudhuri (TD70, 2:00-3:30pm, room 371E)

**Wednesday, October 25:**

- "Pricing Advance Purchase Products," Matthew Maxwell, Jennie Hu (WA23, 7:30-9:00am, room 342E)
- "Effective Methods for Solving the Bicriteria
*P*-center*P*-dispersion Problem," Golbarg Kazemi Tutunchi, Yahya Fathi (North Carolina State University) (WB58, 10:30am-12:00pm, room 362E) - "Warmstart of Interior Point Methods for Second Order Cone Optimization via Rounding Over Optimal Jordan Frames," Sertalp Bilal Çay, Imre Pólik, Tamas Terlaky (Lehigh University) (WC83, 12:30-2:00pm, room 382C)

Two additional important notes:

- Patricia Neri is the 2017 Daniel H. Wagner Prize chair, with presentations in sessions MB40, MC40, and MD40. The winner will be announced and the winning presentation will be repeated during keynote session TK04.
- Sertalp Bilal Çay will chair session WC83.

We hope you can make it to Houston to see these and many other excellent presentations! Don't forget to stop by the SAS and JMP booths to say hello!

The post SAS at the 2017 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>The post The Traveling Salesman Traverses the Plant Genome appeared first on Operations Research with SAS.

]]>One more large number is 3 billion: the estimated number of people (over 40%) that are currently malnourished. Technological/medical advances and new understanding of the human genome have improved both our quality of life and longevity. Now that same attention must be focused on our food sources.

What does this have to do with the software tools developed here at SAS? Well, researchers at places like the USDA and General Mills, to name just a couple, are using it to help solve this problem. Because grains and cereals make up over 80% of the world food sources they are in front-line position to help. They use JMP Genomics and SAS/OR to transform breeding programs with marker-assisted plant breeding.

The idea is to associate desirable traits in crops with locations on their genome so that we can help drive selection of new plant lines to optimize trait outcomes. A basic understanding of genetics is now common, especially with controversies surrounding genetically modified organisms (GMOs). If two blue-eyed people have a child, that child most likely will have blue eyes due to genetic code inherited from his/her parents. Cross two strains of corn that are tall, and the outcome will tend to be tall corn. These are called heritable traits. The goal then is to apply genomic knowledge of heritable traits to assist plant breeders in making the best crosses (mating of two plant lines) to produce crops that not only increase yield, but have less disease, have higher vitamin content, are easier to mill, can grow in drought-tolerant conditions, require less processing, and even *taste better* (yes, hedonic traits can and are being researched for crop improvement programs) *without* direct genetic modification.

By developing tools that use the observed relationships between genetic markers to build a genetic map that facilitates trait association/prediction, crop selection, and simulation. Unlike the human genome, of which the majority has now been mapped, we don't have this “genetic roadmap” for many plant species due to strong complexities in their genome; for example, four, six, or even eight sets of chromosomes instead of two. Instead, we typically use the inheritance patterns between genetic markers to estimate a linkage map.

Markers inherited together are correlated, meaning they are closer on the genetic map. So we need algorithms that can group and order markers to build a genetic map. Markers that are correlated (a small genetic recombination distance) to a certain degree belong on the same chromosome. Once markers are assigned a chromosome, we need to determine the order of those markers that will produce the smallest genetic distance map.

It turns out this is directly analogous to the traveling salesman problem (TSP). Lucky for me, SAS just happens to have a group of experts in operations research that easily recognized this as an optimization problem using minimum spanning forests to determine groups of markers, and TSP algorithms in SAS/OR's OPTMODEL and OPTNET procedures to find optimal marker orders.

*“Hey it’s not that simple! There are also known genetic relationships that might anchor certain markers to groups/orders,”* I said; my new best friends in SAS/OR said, *“No problem, we got this!”* and applied optional node connection constraints in a side-constrained TSP solution.

The Figure above shows an estimated Linkage Map using these methods for an experimental Oat population, where colored/italicized markers were already known/anchored markers. This optimization work and JMP Genomics visualization tools for linkage and consensus maps have since been used in several research publications for genetic mapping of a variety of plants, including this SAS Global Forum paper.

With a reliable map, employing genomic selection methods (modern data mining and predictive modeling techniques for marker-assisted plant selection) becomes feasible in a plant breeding program. Using the arsenal of robust predictive models available in JMP Genomics, the goal is to find the best model to score a new plant variety for a given trait, based on genetic variability. The art of plant breeding comes in developing a new line (seed variety) that can *balance* multiple traits to give the best possible outcome. For example, just driving an increase in yield alone will often lead to the plant/product suffering in other ways.

Our solution in JMP Genomics once again uses our secret weapon found in the capabilities of SAS/OR to create a process that allows breeders to analytically evaluate all possible plant/line crosses and create a cross simulation tool that selects the potential plant crosses that will *simultaneously optimize* multiple traits at once (for example in the plot below, increasing yield while decreasing height in the selected maize plant crosses).

The figure above shows that in five generations of breeding, crossing lines 45 or 99 with line 41 would produce some of the highest predicted yields while effectively reducing the height of the plant. A traditional breeding program can take several years before gain is realized. Each year, breeders have to decide the right crosses to make under the constraints of the amount of land/resources and time available to try to improve a given crop, and then must wait an entire growing season to see how the lines perform. JMP Genomics offers an analytic solution to help breeders find optimal crosses for a set of balanced physical traits and use multi-year simulation results to dramatically speed up this process, in many cases requiring just a few hours on a computer.

Creating sustainable agriculture techniques to produce not just *more* but *healthier *food is one of our most pressing concerns. The new methods outlined above provide just one way that SAS is able to help.

The post The Traveling Salesman Traverses the Plant Genome appeared first on Operations Research with SAS.

]]>The post Fun with Flags: Optimizing Arrangements of Stars with SAS appeared first on Operations Research with SAS.

]]>Suppose we want to arrange \(n\) stars in a rectangular region. Let \(x_i\) and \(y_i\) denote the coordinates of star \(i\), for \(i=1,\dots,n\). We need some way to measure the quality of a particular arrangement. What stands out to me in the current flag is that the stars are spread out in the sense that no two stars are too close to each other. One way to capture this idea with optimization is by using a maximin model that maximizes the minimum Euclidean distance between stars. Explicitly, we want to maximize

\(\min\limits_{i,j} \sqrt{(x_i-x_j)^2+(y_i-y_j)^2}\)

subject to lower and upper bounds on \(x_i\) and \(y_i\).

The following SAS macro variables record the number of stars and the width and height of the rectangular region as proportions of the overall flag height, according to the United States code:

%let n = 50; %let width = 0.76; %let height = 0.5385; |

The following PROC OPTMODEL statements capture the mathematical optimization problem just formulated:

proc optmodel; /* declare parameters */ num n = &n; set POINTS = 1..n; num width = &width; num height = &height; /* declare variables, objective, and constraints */ var X {POINTS} >= 0 <= width; var Y {POINTS} >= 0 <= height; var Maxmin >= 0; max Objective = Maxmin; con MaxminCon {i in POINTS, j in POINTS: i < j}: Maxmin <= (X[i]-X[j])^2 + (Y[i]-Y[j])^2; /* call nonlinear programming solver with multistart option */ solve with nlp / multistart; /* save optimal solution to SAS data set for use with PROC SGPLOT */ create data solution from [i] X Y; quit; |

Note that we use squared distance instead of distance in order to avoid the non-differentiability of the square root function at 0. Also, the MULTISTART option of the nonlinear programming solver increases the likelihood of finding a globally optimal solution for this nonconvex optimization problem with many locally optimally solutions.

The following PROC SGPLOT statements plot the resulting solution:

proc sgplot data=solution aspect=%sysevalf(10/19); styleattrs wallcolor='#3C3B6E'; scatter x=X y=Y / markerattrs=(symbol='StarFilled' color=white size=20); xaxis values=(0 &width) display=none; yaxis values=(0 &height) display=none; run; |

Here is the resulting plot, which is a transpose of the existing design:

The resulting objective value is 0.0116, which corresponds to the squared distance between vertically adjacent pairs of stars.

For comparison, here is the plot of the existing design (objective value 0.0103), where instead the diagonally adjacent stars are the closest:

For each new state admitted to the union, a star is added and the new flag is introduced the next July 4th. Recent proposals to add either Puerto Rico or Washington, D.C. raise the prospect of a 51-star flag. Several researchers have looked for such arrangements among a handful of prespecified patterns. See, for example, this 2012 article in *The American Mathematical Monthly*. Instead, we have a general optimization model for \(n\) stars, so we can change the value of \(n\) to 51 and rerun the code. Here is the resulting plot:

How do you like it in comparison with this popular design that has alternating rows of nine and eight stars?

The U.S. flag adopted on June 14, 1777 had 13 stars. Here's the optimized version:

The Star Spangled Banner flag that inspired Francis Scott Key to write the U.S. national anthem had 15 stars. Here's the optimized version:

Although the problem discussed here was motivated by flag design, the same problem applies to facility location. Suppose you want to build a number of facilities within some prescribed region. If two facilities are too close to each other, they will cannibalize each other's business. So it makes sense to maximize the minimum distance between facilities.

For those in the United States, enjoy the holiday tomorrow!

The post Fun with Flags: Optimizing Arrangements of Stars with SAS appeared first on Operations Research with SAS.

]]>The post Creating Synthetic Data with SAS/OR appeared first on Operations Research with SAS.

]]>Researchers could use synthetic data to, for example, understand the format of the real data, develop an understanding of its statistical properties, build preliminary models, or tune algorithm parameters. Analytical procedures and code developed using the synthetic data can then be passed to the data owner if the collaborators are not permitted to access the original data. Even if the collaborators are eventually granted access, synthetic data allow the work to be conducted in parallel with tasks required to access the real data, allowing an efficient path to the final analysis.

One method for producing synthetic data with SAS was presented at SAS Global Forum 2017 with this corresponding paper (Bogle and Erickson 2017), which implemented an algorithm originally proposed in another paper (Bogle and Mehrotra 2016, subscription required). The implementation uses the OPTMODEL procedure and solvers in SAS/OR to create synthetic data with statistical raw moments similar to the real data.

Most SAS users are familiar with mean, variance, and covariance. The mean is the first-order raw moment, \(\mathbf{E}[X]\). The variance is the second-order marginal central moment, \(\mathbf{E}[(X-\mathbf{E}[X])^2]\), and covariance is the second-order mixed central moment, \(\mathbf{E}[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])]\). Skewness and kurtosis are the third- and fourth-order normalized moments.

This method considers only raw moments, which keeps the optimization problem linear. The more of these moments that are matched, the more like the original data it will be. It is difficult to match the moments exactly, so the algorithm tries to keep them within desired bounds. For example, one of the fourth-order moments (meaning the sum of the exponents is four) could be \(\mathbf{E}[X^2YZ]\), which could have a value of 0.5 in the real data set. The method might try to "match" this moment by constraining that moment in the new data set to be between 0.48 and 0.52. If \(N\) observations are desired, this would ideally mean that the moment value would be constrained between those bounds, \(\displaystyle{0.48 \le \frac{1}{N} \sum_{i=1}^N x_i^2y_iz_i \le 0.52}\).

There are a couple of problems with solving a system of inequalities like this. First, it is not linear or convex, so it would be very hard to solve, especially when some variables need to take integer values. This macro instead creates candidate observations and uses mathematical optimization to select which observations to include. Second, there might not be a solution that solves such a system of inequalities, so the method allows violations to the bounds on the moments and minimizes them.

There are three basic steps to the algorithm. The first step reads in the original data, computes the upper and lower bounds on the moments, and sets up other parameters for the optimization models. The second step uses column generation with a linear program to produce candidate synthetic observations. The third step uses an integer program to select the desired number of synthetic observations from the candidate synthetic observations, while minimizing the largest scaled violation of the moment bounds. The following code is the macro that generates synthetic data by calling a different macro for each of the three steps.

%macro GENDATA(INPUTDATA, METADATA, OUTPUTDATA=SyntheticData, MOMENTORDER=3, NUMOBS=0, MINNUMIPCANDS=0, LPBATCHSIZE=10, LPGAP=1E-3, NUMCOFORTHREADS=1, MILPMAXTIME=600, RELOBJGAP=1E-4, ALPHA=0.95, RANDSEED=0); proc optmodel printlevel=0; call streaminit(&RANDSEED); %PRELIMINARYSTEP(INPUTDATA=&INPUTDATA, METADATA=&METADATA, MOMENTORDER=&MOMENTORDER, ALPHA=&ALPHA); %LPSTEP(MOMENTORDER=&MOMENTORDER, NUMOBS=&NUMOBS, MINNUMIPCANDS=&MINNUMIPCANDS, LPBATCHSIZE=&LPBATCHSIZE, LPGAP=&LPGAP, NUMCOFORTHREADS=&NUMCOFORTHREADS); %IPSTEP(OUTPUTDATA=&OUTPUTDATA, MOMENTORDER=&MOMENTORDER, NUMOBS=&NUMOBS, MINNUMIPCANDS=&MINNUMIPCANDS, MILPMAXTIME=&MILPMAXTIME, RELOBJGAP=&RELOBJGAP); quit; %mend GENDATA; |

You can see that there are several parameters, which are described in the paper. Most of them control the balance of the quality of the synthetic data with the time needed to produce the synthetic data.

The macro demonstrates some useful features of PROC OPTMODEL. One option for the macro is to use PROC OPTMODEL's COFOR loop. This allows multiple LPs to be solved concurrently, which can speed up the LP step. The implementation breaks the assignment of creating candidate observations into NUMCOFORTHREADS groups, and one solve statement from each group can be run concurrently. The use of COFOR reduces the total time from five hours to three hours in the example below.

The code also calls SAS functions not specific to PROC OPTMODEL, including RAND, STD, and PROBIT. The RAND function is used to create random potential observations during the column generation process. The STD and PROBIT functions are used to determine the desired moment range based on the input data and a user-specified parameter.

The examples in the paper are based on the *Sashelp.Heart* data set. The data set is modified to fit the requirements that all variables must be numeric and there can be no missing data.

The example randomly divides the data into Test and Training data sets so that a model based on the Training data set can be tested against a model based on a synthetic data set derived from the Training data set.

The following call of the macro generates the synthetic data based on the Training data set with the appropriate METADATA data set. It allows up to four LPs to be solved concurrently, matches moments up to the third order, allows 20 minutes for the MILP solver, and uses a fairly small range for the moment bounds.

%GENDATA(INPUTDATA=Training, METADATA=Metadata, NUMCOFORTHREADS=4, MOMENTORDER=3, MILPMAXTIME=1200, ALPHA=0.2); |

One way to verify that the synthetic data are similar to the real data is to compare the means, standard deviations, and covariances of the input and synthetic data sets. The paper shows how to do this using PROC MEANS and PROC CORR on a smaller synthetic data set.

In this case we want to compare the training and synthetic data by comparing how well the models they fit score the Test data set. One of the variables indicates if a person is dead, so we can use a logistic regression to predict if a person is dead or alive. Because BMI is a better health predictor than height or weight, we have added a BMI column to the data sets. The following code calls PROC LOGISTIC for the two data sets.

proc logistic data=TrainingWithBmi; model Dead = Male AgeAtStart Diastolic Systolic Smoking Cholesterol BMI / outroc=troc; score data=TestingWithBmi out=valpred outroc=vroc; run; proc logistic data=SyntheticDataWithBmi; model Dead = Male AgeAtStart Diastolic Systolic Smoking Cholesterol BMI / outroc=troc; score data=TestingWithBmi out=valpred outroc=vroc; run; |

This produces quite a bit of output. One useful comparison is the area under the curve (AUC) of the ROC curve, which plots a comparison of sensitivity and specificity of the model. The first plot is for the original training data scored with the testing data from the original data set. The second plot is for the synthetic data scored with the testing data from the original data set. You can see that the curves are fairly similar and there is little difference between the AUCs.

This synthetic data set behaves similarly to the original data set and can be safely shared without risk of exposing personally identifiable information.

What uses do you have for synthetic data?

The post Creating Synthetic Data with SAS/OR appeared first on Operations Research with SAS.

]]>The post Operations Research Talks at SAS Global Forum 2017 appeared first on Operations Research with SAS.

]]>**Monday, April 3**

- 11:00 AM - 11:30 AM, Dolphin Level 3 - Oceanic 1 (Session 0681) "Using SAS/OR® Software to Optimize the Capacity Expansion Plan of a Robust Oil Products Distribution Network," Dr. Shahrzad Azizzadeh, SAS
- 1:00 PM - 1:30 PM, Dolphin Level 3 - Oceanic 1 (Session 1387) "Increasing Revenue in Only Four Months with SAS® Real-Time Decision Manager," Alvaro Velasquez, Telefonica
- 2:30 PM - 3:00 PM, Dolphin Level 1 - The Quad - Lower - Super Demo Station 9 (Super Demo SD921) "Optimization and SAS® Viya," Manoj Chari, SAS
- 3:30 PM - 4:00 PM, Dolphin Level 5 - The Quad - Upper - Theater 2 (Session 1055) "Using the CLP Procedure to Solve the Agent-District Assignment Problem," Stephen Sloan, Kevin Gillette, Accenture
- 4:00 PM - 4:30 PM, Dolphin Level 3 - Oceanic 1 (Session 0514) "Automated Hyperparameter Tuning for Effective Machine Learning," Patrick Koch, Brett Wujek, Oleg Golovidov, Steven Gardner, SAS

**Tuesday, April 4**

- 11:00 AM - 11:30 AM, Dolphin Level 3 - Asia 4 (Session 0454) "Change Management: Best Practices for Implementing SAS® Prescriptive Analytics," Scott Shuler, SAS
- 3:30 PM - 4:30 PM, Dolphin Level 3 - Oceanic 3 (Session 0302) "Optimize My Stock Portfolio! A Case Study with Three Different Estimates of Risk," Aric LaBarr, NC State University
- 3:30 PM - 4:30 PM, Dolphin Level 3 - Oceanic 1 (Session 0851) "Optimizing Delivery Routes with SAS® Software," Ben Murphy, Zencos, and Bruce Bedford, Oberweis Dairy Inc.
- 4:00 PM - 4:30 PM, Dolphin Level 1 - The Quad - Lower - Super Demo Station 9 (Super Demo SD909) "New Features in SAS® Simulation Studio," Edward P. Hughes, SAS
- 4:00 PM - 4:30 PM, Dolphin Level 5 - Southern Hemisphere IV (Session 2016) "Optimization and SAS® Viya," Manoj Chari, SAS

**Wednesday, April 5**

- 11:00 AM - 11:30 AM, Dolphin Level 3 - Oceanic 2 (Session 1224) "A Moment-Matching Approach for Generating Synthetic Data in SAS®," Brittany M. Bogle, The University of North Carolina at Chapel Hill, and Jared C. Erickson, SAS
- 11:30 AM - 12:00 PM, Dolphin Level 3 - Oceanic 1 (Session 0593) "Key Components and Finished Products Inventory Optimization for a Multi-Echelon Assembly System," Sherry Xu, Kansun Xia, Ruonan Qiu, SAS
- 1:00 PM - 1:30 PM, Dolphin Level 3 - Asia 1 (Session 1326) "Price Recommendation Engine for Airbnb," Praneeth Guggilla, Singdha Gutha, and Goutam Chakraborty, Oklahoma State University

That's quite a list! It's great to see so much operations research work being done with SAS, and it's even more encouraging to see so many people talking about their successes with these methods.

The post Operations Research Talks at SAS Global Forum 2017 appeared first on Operations Research with SAS.

]]>The post SAS at the 2017 INFORMS Conference on Business Analytics and Operations Research appeared first on Operations Research with SAS.

]]>- Want SAS Skills? Get SAS University Edition, a Powerful and Free Analytical Tool from SAS, James Harroun, 9:00am-10:45am, Trevi
- Data Discovery and Analysis with JMP 13 Pro, Mia Stephens, 11:00am-12:45pm, Verona
- Solving Business Problems with SAS Analytics and OPTMODEL, Rob Pratt and Golbarg Tutunchi, 3:00pm-4:45pm, Turin

- Analyzing Unstructured Text Data with the JMP 13 Text Explorer, Mia Stephens, 9:10am-10:00am, Octavius 23
- Building and Solving Optimization Models with SAS, Rob Pratt, 10:30am-11:20am, Octavius 23
- Strategic Scheduling Intelligence at JPMorgan Chase’s Card Production Center, Jay Carini (JPMorgan Chase), 3:00pm–3:40pm, Octavius Ballroom
- A Deep Learning Tutorial, Mustafa Kabul, 3:50pm-4:40pm, Octavius 5/6

- A Novel SVDD Based Control Chart for Monitoring Multivariate Sensor Data, Anya McGuirk, 1:50pm-2:40pm, Octavius 7/8
- SAS Visual Statistics, Tom Bohannon, 1:50pm-2:40pm, Octavius 21
- Optimizing Rates for Service Agreements by Shaping Loss Functions, Maarten Oosten, 4:40pm-5:30pm, Octavius 19

We look forward to seeing you in Vegas!

The post SAS at the 2017 INFORMS Conference on Business Analytics and Operations Research appeared first on Operations Research with SAS.

]]>The post Solving Kakuro Puzzles with SAS/OR appeared first on Operations Research with SAS.

]]>For example, the 16 in the top left corner indicates that the first two white cells in the top row must add up to 16. Since they must also be distinct, the only two possibilities are 7 + 9 and 9 + 7.

The puzzle is of course meant to be solved by hand by making logical deductions. But you can also use mixed integer linear programming (MILP). One natural formulation involves two sets of decision variables:

\(\begin{align*} \mathrm{V}[i,j] &= \text{the digit in cell $(i,j)$}, \\ \mathrm{X}[i,j,k] &= \begin{cases} 1 & \text{if cell $(i,j)$ contains digit $k$},\\ 0 & \text{otherwise} \end{cases} \end{align*}\)

For more details, see Kalvelagen's post.

You can capture the clues with a SAS data set:

%let n = 8; data indata; input (C1-C&n) ($); datalines; x\x 23\x 30\x x\x x\x 27\x 12\x 16\x x\16 x x x\x 17\24 x x x x\17 x x 15\29 x x x x x\35 x x x x x 12\x x\x x\x x\7 x x 7\8 x x 7\x x\x 11\x 10\16 x x x x x x\21 x x x x x\5 x x x\6 x x x x\x x\3 x x ; |

The following PROC OPTMODEL code declares sets and parameters, reads the data, declares the variables and constraints, calls the MILP solver, and outputs the solution to a SAS data set:

proc optmodel; set ROWS; set COLS = 1..&n; str clue {ROWS, COLS}; /* read two-dimensional data */ read data indata into ROWS=[_n_]; read data indata into [i=_n_] {j in COLS} <clue[i,j] = col("c"||j)>; /* create output data set for use by PROC SGPLOT */ create data puzzle from [i j]=(ROWS cross COLS) label=(compress(clue[i,j],'x')) color=(clue[i,j]='x'); /* parse clues */ num num_clues init 0; set CLUES = 1..num_clues; set <num,num> CELLS_c {CLUES} init {}; num clue_length {c in CLUES} = card(CELLS_c[c]); num clue_sum {CLUES}; str prefix, suffix; num curr; for {i in ROWS, j in COLS: clue[i,j] ne 'x'} do; prefix = scan(clue[i,j],1,'\'); suffix = scan(clue[i,j],2,'\'); if prefix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(prefix, best.); curr = i + 1; do until (curr not in ROWS or clue[curr,j] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<curr,j>}; curr = curr + 1; end; end; if suffix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(suffix, best.); curr = j + 1; do until (curr not in COLS or clue[i,curr] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<i,curr>}; curr = curr + 1; end; end; end; set CELLS = union {c in CLUES} CELLS_c[c]; set DIGITS = 1..9; /* decision variables */ var X {CELLS, DIGITS} binary; var V {CELLS} integer >= 1 <= 9; /* constraints */ con OneDigitPerCell {<i,j> in CELLS}: sum {k in DIGITS} X[i,j,k] = 1; con AlldiffPerClue {c in CLUES, k in DIGITS}: sum {<i,j> in CELLS_c[c]} X[i,j,k] <= 1; con VCon {<i,j> in CELLS}: sum {k in DIGITS} k * X[i,j,k] = V[i,j]; con SumCon {c in CLUES}: sum {<i,j> in CELLS_c[c]} V[i,j] = clue_sum[c]; /* call MILP solver with no objective */ solve noobj; /* create output data set for use by PROC SGPLOT */ create data solution from [i j]=(ROWS cross COLS) V label=(if <i,j> in CELLS then put(V[i,j],1.) else compress(clue[i,j],'x')) color=(<i,j> in CELLS); quit; |

Notice that the PROC OPTMODEL code is completely data-driven so that you can solve other problem instances just by changing the input data set. In this case, the MILP presolver solves the entire problem instantly:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 360 variables (0 free, 0 fixed). NOTE: The problem has 324 binary and 36 integer variables. NOTE: The problem has 312 linear constraints (216 LE, 96 EQ, 0 GE, 0 range). NOTE: The problem has 1404 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The OPTMODEL presolver is disabled for linear problems. NOTE: The MILP presolver value AUTOMATIC is applied. NOTE: The MILP presolver removed all variables and constraints. NOTE: Optimal. NOTE: Objective = 0.

Finally, a call to PROC SGPLOT generates a nice plot of the solution:

proc sgplot data=solution noautolegend; heatmapparm x=j y=i colorresponse=color / colormodel=(black white) outline; text x=j y=i text=label / colorresponse=color colormodel=(white black); xaxis display=none; yaxis display=none reverse; run; |

The resulting plot makes it easy to verify the correctness of the solution:

Here is the corresponding DATA step for the large instance in Kalvelagen's post:

%let n = 22; data indata; input (C1-C&n) ($); datalines; x\x 12\x 38\x x\x 23\x 21\x x\x x\x 16\x 29\x 30\x x\x 33\x 30\x 3\x x\x x\x 17\x 35\x x\x 21\x 3\x x\17 x x 11\14 x x x\x 17\23 x x x x\16 x x x 15\x x\8 x x 3\3 x x x\16 x x x x x 3\30 x x x x x\13 x x x x 4\22 x x x x x x\x 23\27 x x x x x x 23\11 x x 17\17 x x x\21 x x x x x x 18\x x\13 x x x 30\45 x x x x x x x x x 30\14 x x x x 20\4 x x x\35 x x x x x 44\x 3\11 x x 10\16 x x 42\10 x x 3\x 22\29 x x x x x\17 x x 24\41 x x x x x x x 14\21 x x x x x x 20\23 x x x x\x 23\x 42\17 x x 30\3 x x x\6 x x x 16\16 x x 35\21 x x x x 40\x 3\x x\38 x x x x x x 16\x x\x 17\x 28\34 x x x x x 13\12 x x 26\4 x x x\24 x x x x\24 x x x 17\35 x x x x x 7\29 x x x x x x x x\8 x x 16\x 30\41 x x x x x x x 3\29 x x x x 29\18 x x x x\x x\x x\34 x x x x x 42\11 x x x 12\11 x x x x 15\30 x x x x 19\x x\x 12\24 x x x 14\17 x x 17\16 x x x x x x\14 x x x x\7 x x x x\16 x x 16\35 x x x x x 15\3 x x 3\x 34\x x\x 28\17 x x 23\x 35\16 x x x\16 x x x x x 4\41 x x x x x x x 6\41 x x x x x x x x\x x\4 x x 24\6 x x x 14\4 x x x\10 x x x x 16\x 29\23 x x x x\x x\x x\x 41\28 x x x x x x x 16\x x\x 29\41 x x x x x x x 43\x x\x x\x 16\23 x x x 21\x 24\29 x x x x 7\17 x x 3\23 x x x 29\16 x x 16\x x\42 x x x x x x x x\34 x x x x x x x 44\35 x x x x x x\15 x x 17\x x\8 x x 17\x x\x 29\x 3\3 x x 11\24 x x x x x 17\12 x x x\10 x x x 34\24 x x x 4\16 x x x x x 17\4 x x 30\23 x x x x\x x\x x\29 x x x x 3\10 x x x x 30\12 x x x 16\33 x x x x x 6\x x\x 17\11 x x x 30\11 x x x x 12\42 x x x x x x x x\x 13\4 x x x\29 x x x x x x x 11\35 x x x x x x\23 x x x 11\7 x x x x\16 x x 7\16 x x 14\16 x x x x x 17\x 29\x x\x 16\38 x x x x x x x\x 9\x 22\27 x x x x 15\6 x x 32\23 x x x 23\12 x x 21\4 x x 35\x 6\x x\6 x x x 27\22 x x x x x x 14\42 x x x x x x x 16\3 x x x\23 x x x x 15\x 4\4 x x 30\11 x x 29\9 x x 23\x 3\16 x x x x x x\4 x x 14\11 x x x x x\45 x x x x x x x x x 20\20 x x x x\x 17\21 x x x x x x 16\16 x x x\17 x x 16\24 x x x x x x 4\x x\34 x x x x x x\29 x x x x x\29 x x x x x\30 x x x x x x\13 x x x\4 x x x\x x\23 x x x x\16 x x x x\x x\4 x x x\8 x x ; |

And here is the plot of the input:

The same PROC OPTMODEL and PROC SGPLOT statements shown earlier then yield the following output:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 4920 variables (0 free, 0 fixed). NOTE: The problem has 4428 binary and 492 integer variables. NOTE: The problem has 3664 linear constraints (2412 LE, 1252 EQ, 0 GE, 0 range). NOTE: The problem has 19188 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The OPTMODEL presolver is disabled for linear problems. NOTE: The MILP presolver value AUTOMATIC is applied. NOTE: The MILP presolver removed 3077 variables and 2087 constraints. NOTE: The MILP presolver removed 12000 constraint coefficients. NOTE: The MILP presolver added 22 constraint coefficients. NOTE: The MILP presolver modified 11 constraint coefficients. NOTE: The presolved problem has 1843 variables, 1577 constraints, and 7188 constraint coefficients. NOTE: The MILP solver is called. NOTE: The parallel Branch and Cut algorithm is used. NOTE: The Branch and Cut algorithm is using up to 4 threads. Node Active Sols BestInteger BestBound Gap Time 0 1 0 . 0 . 2 NOTE: The MILP presolver is applied again. 0 0 1 0 0 0.00% 2 NOTE: Optimal. NOTE: Objective = 0.

The requirement that the digits within a single clue must all be different suggests the use of the ALLDIFF constraint supported in the constraint programming solver. You can then omit the binary X[i,j,k] variables and the corresponding constraints. The resulting model is much simpler, with only one set of variables and two sets of constraints:

/* decision variables */ var V {CELLS} integer >= 1 <= 9; /* constraints */ con AlldiffCon {c in CLUES}: alldiff({<i,j> in CELLS_c[c]} V[i,j]); con SumCon {c in CLUES}: sum {<i,j> in CELLS_c[c]} V[i,j] = clue_sum[c]; /* call constraint programming solver */ solve; |

For the small instance, the constraint programming solver instantly returns the solution:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 36 variables (0 free, 0 fixed). NOTE: The problem has 0 binary and 36 integer variables. NOTE: The problem has 24 linear constraints (0 LE, 24 EQ, 0 GE, 0 range). NOTE: The problem has 72 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The problem has 24 predicate constraints. NOTE: The OPTMODEL presolver is disabled for problems with predicates. NOTE: Required number of solutions found (1). NOTE: The data set WORK.SOLUTION has 64 observations and 5 variables. NOTE: PROCEDURE OPTMODEL used (Total process time): real time 0.10 seconds cpu time 0.06 seconds

But the large instance is more challenging and does not solve in a reasonable time, for reasons described in a paper by Helmut Simonis. One way to overcome this difficulty is to first do some precomputation, again with the constraint programming solver, to prune the domains of the decision variables. The idea is to consider each clue one at a time, using the FINDALLSOLNS option to find all solutions. Any digit that does not appear among these solutions can then be removed, by using a linear disequality (or "not equal") constraint, from the domains of the V[i,j] variables that are associated with that clue. Because these problems are independent across clues, you can use a COFOR loop to solve them concurrently. The full code is as follows:

proc optmodel; set ROWS; set COLS = 1..&n; str clue {ROWS, COLS}; /* read two-dimensional data */ read data indata into ROWS=[_n_]; read data indata into [i=_n_] {j in COLS} <clue[i,j] = col("c"||j)>; /* create output data set for use by PROC SGPLOT */ create data puzzle from [i j]=(ROWS cross COLS) label=(compress(clue[i,j],'x')) color=(clue[i,j]='x'); /* parse clues */ num num_clues init 0; set CLUES = 1..num_clues; set <num,num> CELLS_c {CLUES} init {}; num clue_length {c in CLUES} = card(CELLS_c[c]); num clue_sum {CLUES}; str prefix, suffix; num curr; for {i in ROWS, j in COLS: clue[i,j] ne 'x'} do; prefix = scan(clue[i,j],1,'\'); suffix = scan(clue[i,j],2,'\'); if prefix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(prefix, best.); curr = i + 1; do until (curr not in ROWS or clue[curr,j] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<curr,j>}; curr = curr + 1; end; end; if suffix ne 'x' then do; num_clues = num_clues + 1; clue_sum[num_clues] = input(suffix, best.); curr = j + 1; do until (curr not in COLS or clue[i,curr] ne 'x'); CELLS_c[num_clues] = CELLS_c[num_clues] union {<i,curr>}; curr = curr + 1; end; end; end; set CELLS = union {c in CLUES} CELLS_c[c]; set DIGITS = 1..9; /* decision variables */ var V {CELLS} integer >= 1 <= 9; /* constraints */ con AlldiffCon {c in CLUES}: alldiff({<i,j> in CELLS_c[c]} V[i,j]); con SumCon {c in CLUES}: sum {<i,j> in CELLS_c[c]} V[i,j] = clue_sum[c]; problem Original include V AlldiffCon SumCon; /* prune the domains one clue at a time */ set DOMAIN {CELLS}; set LENGTHS_SUMS = setof {c in CLUES} <clue_length[c], clue_sum[c]>; set DOMAIN_ls {LENGTHS_SUMS}; num clue_length_this, clue_sum_this; var W {1..clue_length_this} integer >= 1 <= 9; con IncreasingWCon {k in 1..clue_length_this-1}: W[k] < W[k+1]; con SumWCon {c in CLUES}: sum {k in 1..clue_length_this} W[k] = clue_sum_this; problem Prune include W IncreasingWCon SumWCon; use problem Prune; reset options printlevel=0; cofor {<l,s> in LENGTHS_SUMS} do; put l= s=; clue_length_this = l; clue_sum_this = s; solve with clp / findallsolns; DOMAIN_ls[l,s] = setof {k in 1..clue_length_this, sol in 1.._NSOL_} W[k].sol[sol]; put DOMAIN_ls[l,s]=; end; reset options printlevel; for {<i,j> in CELLS} DOMAIN[i,j] = DIGITS; for {c in CLUES} do; clue_length_this = clue_length[c]; clue_sum_this = clue_sum[c]; for {<i,j> in CELLS_c[c]} DOMAIN[i,j] = DOMAIN[i,j] inter DOMAIN_ls[clue_length_this, clue_sum_this]; end; for {<i,j> in CELLS} put DOMAIN[i,j]=; /* solve original problem with pruned domains */ use problem Original; con PruneDomain {<i,j> in CELLS, k in DIGITS diff DOMAIN[i,j]}: V[i,j] ne k; solve; /* create output data set for use by PROC SGPLOT */ create data solution from [i j]=(ROWS cross COLS) V label=(if <i,j> in CELLS then put(V[i,j],1.) else compress(clue[i,j],'x')) color=(<i,j> in CELLS); quit; |

The entire PROC OPTMODEL call, including the precomputation step, now takes less than a second for the large instance:

NOTE: Problem generation will use 4 threads. NOTE: The problem has 492 variables (0 free, 0 fixed). NOTE: The problem has 0 binary and 492 integer variables. NOTE: The problem has 2878 linear constraints (0 LE, 268 EQ, 0 GE, 0 LT, 2610 NE, 0 GT, 0 range). NOTE: The problem has 3594 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The problem has 268 predicate constraints. NOTE: The OPTMODEL presolver is disabled for problems with predicates. NOTE: Required number of solutions found (1). NOTE: The data set WORK.SOLUTION has 704 observations and 5 variables. NOTE: PROCEDURE OPTMODEL used (Total process time): real time 0.53 seconds cpu time 0.96 seconds

It turns out that you can also significantly improve the performance without doing any precomputation, simply by using a different variable selection strategy:

solve with clp / varselect=fifo; |

NOTE: Problem generation will use 4 threads. NOTE: The problem has 492 variables (0 free, 0 fixed). NOTE: The problem has 0 binary and 492 integer variables. NOTE: The problem has 268 linear constraints (0 LE, 268 EQ, 0 GE, 0 range). NOTE: The problem has 984 linear constraint coefficients. NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range). NOTE: The problem has 268 predicate constraints. NOTE: Required number of solutions found (1). NOTE: The data set WORK.SOLUTION has 704 observations and 5 variables. NOTE: PROCEDURE OPTMODEL used (Total process time): real time 6.55 seconds cpu time 6.42 seconds

As in similar logic puzzles, the solution is supposed to be unique. To check uniqueness with the MILP formulation, you can add one more constraint to exclude the first solution and call the MILP solver again, as described by Kalvelagen:

/* check uniqueness */ set <num,num,num> SUPPORT; SUPPORT = {<i,j> in CELLS, k in DIGITS: X[i,j,k].sol > 0.5}; con ExcludeSolution: sum {<i,j,k> in SUPPORT} X[i,j,k] <= card(SUPPORT) - 1; solve noobj; if _solution_status_ ne 'INFEASIBLE' then put 'Multiple solutions found!'; |

For the constraint programming formulation, checking uniqueness is even simpler. Just use the FINDALLSOLNS option:

/* check uniqueness */ solve with clp / findallsolns; if _NSOL_ > 1 then put 'Multiple solutions found!'; |

The post Solving Kakuro Puzzles with SAS/OR appeared first on Operations Research with SAS.

]]>