The post SAS® Optimization Web App Using REST APIs with Embedded SAS® Visual Analytics Reports appeared first on Operations Research with SAS.

]]>In this article, I will provide a demo of a web app that connects SAS Visual Analytics and other SAS® Viya services through REST APIs. The web interface provides an interactive platform to demonstrate small-scale MILP (Mixed Integer Linear Programming) models to non-technical audiences and the ability to perform some scenario analysis by changing certain parameters.The purpose of this demo tool is to enable users with little to no knowledge of SAS programming to invoke the power of SAS Optimization by simply clicking a few buttons.

Since this is a web application project, instead of describing all components of the project I will provide only the sample code of the optimization model, but you may clone my GitHub repository to recreate this project.

Everyone with moderate to good front-end coding experience knows the necessity of finding good JavaScript libraries for developing dynamic web applications. The server-side library is restaf-server, which authenticates the connection to a SAS Viya server. It serves up the entry point to the app or the first page and provides the session context to the restaf library.

The restaf library is UI framework agnostic, which means it can be seamlessly integrated into any front-end framework like Angular, React, etc. It provides a user-friendly way to make REST API calls to SAS Viya. A very simple web application showing how to run a small optimization problem and fetch a SAS Visual Analytics report to visualize output using restaf can be found here. In this blog we use the concepts from the simple example to build a much richer application. The UI framework used in this project is ReactJS due to its component-based rendering. ReactJS is easily regarded as the UI framework of choice for most front-end developers and is reportedly being used by all the big companies like Facebook, Uber, Netflix, etc. In fact, it was started by a software engineer at Facebook and has steadily gained popularity. The styling of the web page is done using MaterialUI, which includes styling of the sidebar and the fonts. Some other libraries are also used in this web application, notably, react-bootstrap-table. This library helped me in creating the editable table which ultimately lets the user change input parameters and perform scenario analysis. When you look at the video below, it may seem like a multi-page app. However, it is a misnomer. This app is in fact a single-page app and I am rendering different React components onto the same HTML element. Pretty cool, huh? I am using React Router, another JavaScript library, to enable the routing between the different components.

See below a flowchart of the underlying technologies:

In this section, I will describe in words the workflow of the Facility Location demo, walk you through each of the interactive components, and demonstrate how the user can do a scenario analysis as well.

But first, check out the video on our YouTube channel for this demo!

Now, let me give you a quick overview of the optimization problem I am trying to demonstrate using the web app. It is a facility location problem for a car manufacturing company. The company has 25 existing facilities, each of which is eligible to produce up to 5 types of cars. The problem is to optimally assign cars to be manufactured at the eligible facilities. A variable cost is incurred per car manufactured at a facility, while a fixed cost is a one-time operating cost incurred to keep a facility open or closed. Each facility can manufacture at most one car. Capacity and demand are also provided in terms of numbers of cars.

On mounting this component, a SAS Visual Analytics report created by the user on the server to analyze the input data is rendered onto the page. In this report, we can interactively see which facilities are eligible to produce a particular type of car. The bubbles on the map show the different locations of the existing facilities. By clicking on each of the bars, we can filter out the facilities for that type of car. The color of each bubble represents the variable cost for producing a car at that facility and the size of the bubble shows the fixed cost incurred for keeping the facility open. The goal is to select the smallest green bubble for each of the cars.

In this component, the user is able to set a few parameters and run the optimization code based on these parameters. They can select the objective function, which can be to minimize fixed costs, variable costs, or total costs.They may also change the demand for each type of car and/or assign a fixed facility to manufacture it using the editable table. Hitting the 'Optimize' button, a series of promise-based API calls is submitted to the Viya server. First, the updated table from the client-side is passed to the back-end as a JSON object and a table.upload action is used to upload the table to the user's own caslib. Then a string of CASL statements are submitted using the runCASL action, which includes code for the optimization model. Finally, the results are retrieved from the Viya server using REST APIs and the user is redirected to the output report.

The MILP model is coded as a JavaScript string and passed later on into some restAF functions. Anyone with some OPTMODEL experience will easily be able to recognize the syntax in the following code block. It is written as a JavaScript function, which takes the objective type and scenario name as input from the form on the web page. This piece of code, along with a few data pre-processing and post-processing steps, is submitted purely as a string of CASL statements and the 'RunCASL' action is executed using REST APIs.

function optCode(ObjType, appEnv, scenario) { let pgm = ` set PRODUCTS; set FACILITIES init {}; set <str, str> FIXED; num demand {PRODUCTS}; num fixed_cost {FACILITIES}; num close_cost {FACILITIES}; num x{FACILITIES}; num y{FACILITIES}; num var_cost {PRODUCTS, FACILITIES}; num viable_flg {PRODUCTS, FACILITIES}; num Obj_Type=${ObjType}; num capacity {FACILITIES}; str soln_type init '${scenario}'; read data demandTable into PRODUCTS=[Product_Name] demand; read data demandTable (where = (facility_name ne 'None')) into FIXED=[Product_Name facility_name]; read data siteTable into FACILITIES=[facility_name] close_cost fixed_cost x y capacity; read data costTable into [Product_Name facility_name] var_cost viable_flg; var Assign {PRODUCTS, FACILITIES} binary; var Units {PRODUCTS, FACILITIES} >=0; var Close {FACILITIES} binary; min VarCost = sum {i in PRODUCTS, j in FACILITIES} var_cost[i,j]*Units[i,j]; min FixedCost = sum {j in FACILITIES} (fixed_cost[j]*(1-Close[j]) + close_cost[j]*Close[j]); min TotalCost = VarCost + FixedCost; /* Constraints */ /* PRODUCTS demand needs to be satisfied */ con Demand_Con {i in PRODUCTS}: sum {j in FACILITIES} Units[i,j] >= demand[i]; /* each facility have capacity constraints */ con Capacity_Con2 {j in FACILITIES}: sum {i in PRODUCTS} Units[i,j] <= capacity[j]*(1-Close[j]); /* if operation i assigned to site j, then facility must not be closed at j */ con If_Used_Then_Not_Closed {j in FACILITIES}: sum {i in PRODUCTS} Assign[i,j] = (1-Close[j]); /*not viable assignemnts*/ con Viable_Assignemnts {i in PRODUCTS, j in FACILITIES}: Assign[i,j] <= viable_flg[i,j]; con Viable_Num_Assignemnts {i in PRODUCTS, j in FACILITIES}: Units[i,j] <= capacity[j]*Assign[i,j]; /*Fixing facility - what-if analysis*/ for{<i,j> in FIXED} fix Units[i,j]=min(demand[i], capacity[j]); /* solve the MILP */ if Obj_Type=1 then do; solve obj VarCost with milp; end; else if Obj_Type=2 then do; solve obj FixedCost with milp; end; else if Obj_Type=3 then do; solve obj TotalCost with milp; end; /* clean up the solution */ for {i in PRODUCTS, j in FACILITIES} Assign[i,j] = round(Assign[i,j]); for {j in FACILITIES} Close[j] = round(Close[j]); num siteTotalCost {j in FACILITIES}= sum{i in PRODUCTS} var_cost[i,j] * Units[i,j].sol + (fixed_cost[j]*(1-Close[j].sol) + close_cost[j]*Close[j].sol); num variableCosts{i in PRODUCTS, j in FACILITIES}= var_cost[i,j]*Units[i,j].sol; /* create output */ create data optimal_cost from [product site]={i in PRODUCTS, j in FACILITIES: Assign[i,j] = 1} var_cost variableCosts Assign.sol Units.sol; create data optimal_facilities from [site]={j in FACILITIES} x y Close.sol siteTotalCost soln_type capacity; `; return pgm; } export default optCode; |

Finally, using React Router I am directly routing the user to the output SAS Visual Analytics report. This report helps visualize the output of the optimization model for each scenario, by showing which facilities to close and which ones to keep open. It also shows the assignment of the products to the facilities, along with the total cost and the variable cost for each scenario. The tree map on the left shows the different products, and the size of each rectangle represents the number of units of each product that is produced. It is connected to the geographic map on the right and enables the user to filter out the facilities for each product. In the next section, I will describe how we can create new scenarios and compare them side by side.

To create a new scenario, the user needs to go back to the 'Input Optimization Parameters' component using the burger menu. A new scenario may be created and the user can either change the number of units required for a product or fix a facility for a product. For the example in the video, I increased the number of units required for Sedan from 200 to 400 and fixed its assignment to Site2. When we analyze the output solutions in the SAS Visual Analytics report, we can see that the total and variable costs for the new scenario increase. The 'Solutions Comparison' tab in the report enables the user to visualize the decisions made by each solution side-by-side. We can see that the variable cost for Sedan has increased due to the increase in number of units. The new solution also requires Site2 to be open, while the optimal solution recommended that it be closed. Similarly, the user can create many more scenarios by giving each scenario a unique name and analyze them in the output report.

Now, that I've shown you how easy it is to combine various SAS products using REST APIs, you can clone my repository and try building one for yourself. The code is sufficiently commented to enable users with some front-end experience to customize it for a different optimization problem. Happy Coding!

The post SAS® Optimization Web App Using REST APIs with Embedded SAS® Visual Analytics Reports appeared first on Operations Research with SAS.

]]>The post Batter-pitcher matchup analysis in the 2019 Major League Baseball playoffs appeared first on Operations Research with SAS.

]]>This article demonstrates how you can use SAS to quantify past years' batter versus pitcher statistics. You can use these historical trends to highlight the matchups to watch out for this postseason. The article also lays the groundwork for solving a lineup optimization problem. This problem is of great interest to fantasy buffs, hobbyists, as well as actual MLB managers!

One abundant source of Major League Baseball data is the event files archived by season at Retrosheet.org. The code below assumes that the files for each year of interest YYYY have been downloaded and extracted into a subdirectory called YYYY. Each event file (one per team, per year) contains game event logs for all home games of a particular team in a particular season. Events recorded per game include information about starting rosters, substitutions, umpires, and game conditions. In addition, they keep a record of every play-by-play event in the game.

Because of the intricacies of baseball score keeping, a detailed notation is used to describe what happened in a baseball play. For the purpose of this blog, however, you can greatly simplify the game data by simply determining whether a particular *at bat* ended up with a *hit* or with *no hit*. Using a regular expression, you can parse this result string to determine which at bats resulted in a hit, and which did not.

Note that many events are neither positively nor negatively considered when computing batting average -- walks and hit-by-pitch, for example -- these events are ignored for this analysis.

To do this, you can parse the raw data as follows:

As you read through the file

- keep track of the current pitcher for the home and away teams (see /* 1 */ and /* 2 */ below)
- do this by checking for a "start" or "sub" event where the position is listed as a pitcher (position code=1)
- store the pitcher id in the appropriate variable
- identify which plays result in a
*hit*or*no hit*(see /* 3 */ through /* 6 */ below) - do this by parsing the result string that corresponds to each play
- record the batter, pitcher, and outcome of each applicable play as an observation in the output data set (see /* 7 */ below)

%macro loadYears(YEAR_START, YEAR_END, DATASET, LEAGUE=MLB, DIR=C:/data/baseball); /** Specify N or NL to import only National League home team data files **/ %if &LEAGUE = NL %then %let LEAGUE=N; /** Specify A or AL to import only American League home team data files **/ %else %if &LEAGUE = AL %then %let LEAGUE=A; %else %let LEAGUE=*; data fileList; length filename $ 100; do YR=&YEAR_START to &YEAR_END; filename=CATS("&DIR/",YR,"/",YR,"*.EV&LEAGUE"); output; end; drop YR; run; data &DATASET; length pitcher $8 batter $8 outcome 4 event $30; if _N_ = 1 then do; /** Prefixes that indicate a hit **/ hitPrefix0 = "S\d"; /* a single */ hitPrefix1 = "D\d"; /* a double */ hitPrefix2 = "DGR"; /* a ground rule double */ hitPrefix3 = "T\d"; /* a triple */ hitPrefix4 = "HR"; /* a home run */ /** Prefixes that indicate no hit **/ noHitPrefix0 = "[1-9]"; /* an out in the field */ noHitPrefix1 = "FC"; /* a fielder's choice */ noHitPrefix2 = "E"; /* a fielding error */ noHitPrefix3 = "K"; /* a strikeout */ /** Build a regular expressions that check the prefixes **/ expHit = "/(" || hitPrefix0 || ")|(" || hitPrefix1 || ")|(" || hitPrefix2 || ")|(" || hitPrefix3 || ")|(" || hitPrefix4 || ")/"; expNoHit = "/(" || noHitPrefix0 || ")|(" || noHitPrefix1 || ")|(" || noHitPrefix2 || ")|(" || noHitPrefix3 || ")/"; retain reHit; /* regular expression matching a hit event */ retain reNoHit; /* regular expression matching a no-hit event */ reHit = prxparse(expHit); reNoHit = prxparse(expNoHit); /** Error checking will tell if the regular expression is invalid **/ if missing(reHit) then do; putlog "ERROR: Invalid expression " expHit; stop; end; if missing(reNoHit) then do; putlog "ERROR: Invalid expression " expNoHit; stop; end; end; set fileList; infile dummy fileVar=filename end=done DLM=',' DSD MISSOVER; do while(^done); format pitcherHome $10. pitcherAway $10. pitcher $10.; /** need the current home and away pitchers to carry over between observations **/ retain pitcherHome pitcherAway; input eventType $ col1 $ col2 $ col3 $ col4 $ col5 $ col6 $; /** Events of interest are given in the formats: start, playerID, playerName,isHomeTeam,battingOrder,positionCode sub, playerID, playerName,isHomeTeam,battingOrder,positionCode play, inningNumber, batterIsHomeTeam, batterPlayerID,ballStrikeCount,pitchResultSequence,playResult **/ if eventType in ("start", "sub") and col3 = "0" and col5 = "1" then do; /* 1 */ /** pitcher entering for away team **/ pitcherAway = col1; end; else if eventType in ("start", "sub") and col3 = "1" and col5 = "1" then do; /* 2 */ /** pitcher entering for home team **/ pitcherHome = col1; end; else if eventType="play" then do; /* 3 */ batter=col3; if col2="0" then pitcher=pitcherHome; /* 4 */ else pitcher=pitcherAway; event = col6; if prxmatch(reHit, event) = 1 then do /* 5 */ outcome = 1; output; end; else if prxmatch(reNoHit, event) = 1 then do; /* 6 */ outcome = 0; output; end; end; end; keep batter pitcher outcome event; /* 7 */ run; %mend loadYears; |

After producing a data set of hit/no-hit outcomes, you can aggregate these data to produce one observation per batter-pitcher pair. For example, you can aggregate the number of at bats "ab" and the batting average "avg". To limit the analysis to players in the 2019 playoffs, here is a roster data file. You can use this file to create a data set containing the 25-player active roster of each playoff team:

data playoffRosters; infile "rosters.txt"; input playerId $ team $; run; |

Next, the following macro aggregates batter-pitcher matchup data for a particular list of MLB players.

%macro dataPrep(outcomes, bvp, playerList); %global nBatters nPitchers; /* Aggregate per batter-pitcher pair */ proc sql; create table &bvp.Tmp as select batter, pitcher, sum(outcome)/count(outcome) as avg, count(outcome) as ab from &outcomes group by batter, pitcher; quit; /* Aggregate stats per batter and per pitcher. Keep only batters and pitchers in playerList. */ proc sql; create table battersTmp as select a.batter as batter, SUM(a.ab) as ab, SUM(a.avg*a.ab)/SUM(a.ab) as avg from &bvp.Tmp as a join &playerList as b on a.batter EQ b.playerid group by batter order by ab desc ; quit; proc sql; create table pitchersTmp as select a.pitcher, SUM(a.ab) as ab, SUM(a.avg*a.ab)/SUM(a.ab) as avg from &bvp.Tmp as a join &playerList as b on a.pitcher EQ b.playerid group by pitcher order by ab desc ; quit; /* Join together */ proc sql; create table &bvp as select a.*, b.ab as batter_ab, b.avg as batter_avg, c.ab as pitcher_ab, c.avg as pitcher_avg from &bvp.Tmp a join battersTmp b on a.batter = b.batter join pitchersTmp c on a.pitcher = c.pitcher ; select count (distinct batter) into :nBatters from &bvp; select count (distinct pitcher) into :nPitchers from &bvp; quit; proc sql; drop table battersTmp; drop table pitchersTmp; drop table &bvp.Tmp; quit; %mend dataPrep; |

Now, for the analysis: you can invoke the macros

%let LEAGUE=MLB; %loadYears(2010, 2018, outcomes, LEAGUE=&LEAGUE); %dataPrep(outcomes, bvp, playoffRosters); |

and inspect the data

proc sort data=bvp; by descending ab; run; proc print data=bvp (obs=3); run; |

This indicates that the batter-pitcher pairs with the most at bats from 2010 to 2018 are Michael Brantley (branm003) against Justin Verlander (verlj001), Asdrubal Cabrera (cabra002) against Max Scherzer (schem001), and Nick Markakis (markn001) against Max Scherzer (schem001).

Next, you can use the following statements to see all the historical batting averages of the key batter-pitcher matchups. To avoid noise from tiny sample sets, you can filter out matchups with fewer than 10 at bats. In the generated heat map visualization, red corresponds to higher batting average and blue corresponds to lower batting average. Since the National League and American League teams play each other infrequently, it is helpful to show them one at a time. Here are the National League matchups:

%let LEAGUE=NL; %loadYears(2010, 2018, outcomes, LEAGUE=&LEAGUE); %dataPrep(outcomes, bvp, playoffRosters); proc sgplot data=bvp(where=(ab GE 10)) noautolegend; title "&LEAGUE Matchup Batting Averages 2010-2018"; heatmap y=batter x=pitcher / weight=avg colormodel=(blue lightBlue pink red) outline x2axis; xaxis display=none; x2axis display=ALL; yaxis display=ALL reverse; run; |

and the American League matchups:

%let LEAGUE=AL; %loadYears(2010, 2018, outcomes, LEAGUE=&LEAGUE); %dataPrep(outcomes, bvp, playoffRosters); proc sgplot data=bvp(where=(ab GE 10)) noautolegend; title "&LEAGUE Matchup Batting Averages 2010-2018"; heatmap y=batter x=pitcher / weight=avg colormodel=(blue lightBlue pink red) outline x2axis; xaxis display=none; x2axis display=ALL; yaxis display=ALL reverse; run; |

Noteworthy matchups from the heat maps include:

- Anthony Rendon (renda001) has a .583 batting average in 12 at bats against Wade Miley (milew001)
- Paul Goldschmidt (goldp001) has a .048 batting average in 21 at bats against Pedro Baez (baezp001)
- Alex Bregman (brega001) has a .500 batting average in 10 at bats against Blake Snell (snelb001)
- Yuli Gurriel (gurry001) has a .136 batting average in 10 at bats against James Paxton (paxtj001)

So, how have these actually panned out? Alex Bregman faced Blake Snell in game 2 of the ALDS. Bregman had 1 hit in 2 at bats including a home run off Snell! Goldschmidt never got to face Baez, as the Dodgers didn't make it out of the NLDS. Yuli Gurriel was hitless in 4 at bats against James Paxton in the ALCS. Anthony Rendon won't get to face Wade Miley since the Astros have not included Miley on the World Series roster.

Baseball managers have often criticized batter-pitcher matchup data because of their poor predictive power. This is typically due to very small sample sizes. In a future post, I will analyze this data using collaborative filtering, a machine learning technique that will help address the problem of small sample sizes. From there, I will explore ways to optimally choose a lineup of batters to face a particular pitcher. On an introductory level, this series of posts will demonstrate an end-to-end approach to lineup optimization.

The post Batter-pitcher matchup analysis in the 2019 Major League Baseball playoffs appeared first on Operations Research with SAS.

]]>The post Bringing Analytics to the Soccer Transfer Season appeared first on Operations Research with SAS.

]]>Major European football (soccer) leagues came to an end after an intensive year. Manchester City claimed the Premier League title after a high-intensity title race against Liverpool. It is unbelievable that Liverpool would have won 25 out of the last 27 PL titles with 97 points. Luckily, the Reds deserved and won the Champions League, and redeemed themselves after losing last year's finale.

Football is over on the pitch for the season, but the battle for the transfer season is beginning now. Money shapes major football leagues all over the world despite rare successes by lower-budget teams such as Leicester City in 2016. Teams across Europe are changing the outcome of their domestic leagues with massive transfer budgets. Transfer season is the time when teams are shaping their potential for the next season for sure. Sports analytics is often used to analyze teams on the pitch, but it is possible to bring it to the transfer season also. We have a chance to analyze the upcoming transfer season using mathematical optimization and the capabilities of SAS Viya.

In football, players can move between clubs during the transfer season. If they are out of contract, clubs can acquire them and sign a contract. Otherwise, the current contract needs to be terminated before any transfer. In this case, the purchasing team pays an amount called the *transfer fee*.

"How should we allocate our transfer budget to maximize the benefit we gain?" This is the ultimate question that every team needs to answer. (Teams often try to answer "Which player should we get to make our fans happy?", but no one truly knows what could make fans happy.) Maximizing benefit under a limited resource is known as the Knapsack problem in combinatorial optimization. Given a set of items and their values, the Knapsack problem is to find the optimal selection of items to pack within a weight limit to maximize the total value. We can ask a similar question here: given a set of players, their values and ratings, how to choose which players to transfer to maximize total team rating within a budget limit.

Even though writing a detailed mathematical model of the problem is challenging, I will show how a simple model can be written to benefit from the capabilities of optimization. Before we dive any further, note that we are solving a simplified problem under the following assumptions to make things easier:

- We consider only the starting lineup to measure team ratings
- Teams can transfer any player as long as their current value is paid
- We only focus on acquiring players, not selling them
- Teams use the same formation for the next year
- Players can be played only at the positions they are listed in the data set

One of the most challenging stages of any analytical problem is to obtain clean data. At this point, we are lucky to have a great web resource: sofifa.com. SoFIFA has more data than we need for this problem. By using parallel web requests, we managed to create a database of 12,000 players sorted by their overall rating. The web scraper is available on GitHub and the data are available as a CSV file.

*As an important side note, since these models are being run on data based on the football game FIFA, not on real player metrics, they are a better reflection of the players in the computer game, not the players in real life. However, these same concepts can be applied to real player data if you have access to it.*

Our aim is to maximize the sum of player ratings in the starting lineup of teams. We will solve the problem separately for each team. For each position, we filter the list of players who have a better rating than what the team currently has. Then, the increase in the total rating is used to measure the performance of the transfer for the team.

Let us define \(\) as the set of all players, \(\) as the set of team positions, and \(\) as the set of player-position pairs. The following parameters are used to define the problem:

- \(\): Current rating of the player at position \(\)
- \(\): Overall rating of player \(\)
- \(\): Team budget
- \(\): Transfer value of player \(\)

The main decision variable \(\) represents a binary variable, whether player \(\) is transferred for position \(\). We also have an auxiliary variable \(\) to define the final rating for position \(\) in the formation.

The objective function can be written as the summation of the final ratings:

\(\)

Our first constraint is the budget for the transfer. The total value of players transferred cannot exceed the team budget:

\(\)

The next constraint defines the final rating for each position. This constraint accounts for transfer player \(\) replacing the current player at position \(\):

\(\)

The following two constraints satisfy conditions that at most one player is transferred for a given position, and the one player cannot be transferred for two different positions:

\(\)

\(\)

We model this problem using *sasoptpy*, an open-source Python interface of SAS Optimization.

m = so.Model(name='optimal_team', session=session) rating = m.add_variables(POSITIONS, name='rating') transfer = m.add_variables(ELIG, name='transfer', vartype=so.BIN) m.set_objective( so.quick_sum(rating[j] for j in POSITIONS), name='total_rating', sense=so.MAX) m.add_constraint( so.quick_sum(transfer[i, j] * value[i] for (i, j) in ELIG) <= budget, name='budget_con') m.add_constraints(( rating[j] == overall[member[j]] + so.quick_sum( transfer[i, j] * (overall[i] - overall[member[j]]) for (i, j2) in ELIG if j==j2) for j in POSITIONS), name='transfer_con') m.add_constraints(( so.quick_sum(transfer[i, j] for (i2, j) in ELIG if i==i2) <= 1 for i in PLAYERS), name='only_one_position') m.add_constraints(( so.quick_sum(transfer[i, j] for (i, j2) in ELIG if j==j2) <= 1 for j in POSITIONS), name='only_one_transfer') m.solve() |

Notice that it is very easy to model this problem using the Python interface. Our open-source optimization modeling package sasoptpy uses the runOptmodel action under the hood, as shown in examples in the documentation. If you are familiar with PROC OPTMODEL, you can write the SAS code and run it on SAS Viya directly.

We have run the optimal transfer problem for the top six teams in Premier League standings: Manchester City, Liverpool, Chelsea, Tottenham, Arsenal, Manchester United. The current team and budget information are obtained from SoFIFA at the time of execution. We filtered out all the players older than 33 years old since a majority of players reach their peak before 33 and steadily lose performance.

See the table below for a comparison between optimal transfers for each team. The positions of the transfers are given in the following figures below the table.

As mentioned above, we do not consider the likelihood of the transfer itself. We consider what money could buy if teams are able to get players at their current valuation.

Manchester City increases its total team rating from 944 to 972 by 28 points if they spend all of their current transfer budget of €170M. It is not surprising to see that with a rather limited budget of €90M, Liverpool can increase its total rating by 17 points, whereas Manchester United's total team rating can increase 36 points with their massive budget of €175M.

The efficiency column is calculated by dividing the change in total rating by total money spent in million euros. We expect the efficiency of the transfer to be larger when a few players have significantly lower ratings compared to the rest of the team and can be replaced with rather cheap alternatives. Arsenal has the highest efficiency and can increase its total rating 0.31 per million euros by purchasing 4 players.

The reason why the total rating of Liverpool does not increase as much as Arsenal's despite having close transfer budgets can be explained by the variation of the player ratings. The rating of the right back (RB) is increased 9 points (from 73 to 82) with a transfer worth of €17M for Arsenal. Liverpool's lowest rating in the current team is 80. Player values tend to increase sharply as we increase the rating:

Therefore, it is clear why some teams have an advantage in the transfer season. For these teams, it is easy to improve the team by replacing the weakest player. Consider these two extremes: Manchester City has to spend €170M to improve its total rating by 28 points, whereas Arsenal increases its total rating the same amount by spending €90M only.

Here's how the old and new lineups look for each team. New transfers are colored red while existing players are in blue:

In the last problem, we will have a look at how the budget is affecting the decisions. We will be varying the transfer budget of Liverpool from €0 to €200M in increments of €10M to see how it affects the outcome.

As seen below in detail, efficiency (total rating increase per million euros) decreases as we pay more money for a relatively lower change, as expected.

It seems Liverpool gets the best worth of its money if the Reds transfer Thiago Emiliano da Silva for CB position. Notice that efficiency converges to 0.16 total rating increase per million euros spent as we keep increasing the budget.

We have looked only at the current ratings of the players up to this point. The next problem we solve includes "potential" ratings of the new transfers. Naturally, young players have a significantly higher potential value compared to the old players. We need to replace the rating constraint as follows:

\(\)

where \(\) is the potential rating of a player, and \(\) is the potential of the current player at position \(\) in the team.

For players under 25 years old, the optimal solution is to replace Henderson and Matip with Melo and de Ligt for €36M and €44M, respectively. These changes increase the potential rating by 18 points:

*Edit*: An earlier version of the blog post compared potential ratings of new transfers to current ratings of the current team. After fixing the problem, results have changed slightly.

*Edit #2*: We have updated results after fixing a filtering issue with the CSV database.

Based on reader suggestions, we had a look at the optimal squad under €150M budget. Our objective is to maximize the potential rating and create a full team. I chose 4-4-2 formation for illustration purposes. The optimal squad cost €148.3M and the potential rating is 982:

I hope you enjoyed this brief analysis of potential transfers for top Premier League teams using Python and SAS Viya. I would be happy to answer if you have any questions, especially about how the model is constructed and how it is easily implemented in Python using sasoptpy. As usual, all the code for the problem is available at GitHub.

The post Bringing Analytics to the Soccer Transfer Season appeared first on Operations Research with SAS.

]]>The post "Are You The One?" Finding the matching through SAS Constraint Programming appeared first on Operations Research with SAS.

]]>I bet you were not expecting a photo of 22 youngsters in swimsuits in a math blog, were you? The first time I heard about this reality TV show called "Are You The One?", my nerdiness and love for OR yelled at me: "There is so much math in this show!!!"

I'm sure producers did not mean it that way, but I saw mathematical optimization in it and just HAD to code it. Season 7 is premiering August 15th, and if you are as nerdy as I am, you should use my code to figure out the solution way before the all the participants do.

According to Wikipedia, "Are You The One?" is a reality TV show produced by MTV, in which a group of men and women are secretly paired into male-female couples via a matchmaking algorithm. Then, while living together, the contestants try to identify all of these "perfect matches." If they succeed, the entire group shares a prize of up to $1 million.

In each episode, the participants get two new pieces of information about the coupling:

- They get to choose one couple to go to the "truth booth" and ask if that couple is a perfect match or not.
- They all pair up in a "matching ceremony" and get to know how many couples they got right (not which ones though).

This way, after each episode they have an increasing amount of information to find their matches. If they don't find it by episode 10, they lose.

This problem can be solved through constraint programming because we're attempting to find all possible solutions (couples matches) that satisfy certain constraints (such as "Pete and Lola are NOT a couple" or "out of these given 10 couples only 2 are correct"). The more information you get as the weeks go by, the more constrained your problem is and the fewer feasible solutions you have - until you are left with only one (hopefully at most by week 10).

With SAS/OR Constraint Programming (PROC OPTMODEL or PROC CLP), we are able to discover this perfect matching much earlier than the participants.

The formulation shown in the next section uses a set of binary decision variables called Assign[b,g], with the interpretation that Assign[b,g] = 1 if boy b is assigned to girl g in the matching. The first two sets of constraints enforce a one-to-one matching between boys and girls. The Ceremony_Matches constraint makes sure that each solution respects the matching ceremony results. The Truth_Booth constraint forces Assign[b,g] = 1 or 0 (depending on the truth booth outcome) for the couple (b,g) that entered the truth booth.

Here is the SAS code I used to solve this problem. To run it for the next season, just copy all the code and update the "STEP 1" portion each week.

/***********************************************/ /******* STEP 1 - UPDATE YOUR INPUTS ***********/ /***********************************************/ /* Input number of weeks of data */ %let num_weeks = 5; /* Input weekly matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 5, these were the couples in each ceremony */ data input_ceremony_pairs(keep=week boys_name girls_name); array girl_week{&num_weeks.} $; input boys_name $ girl_week{*}; do week = 1 to &num_weeks.; girls_name = girl_week[week]; output; end; datalines; Anthony Geles Diandra Jada Keyana Nicole Clinton Uche Uche Uche Uche Jada Dimitri Diandra Nicole Nurys Alexis Uche Ethan Jada Jada Alexis Nicole Geles Joe Zoe Audrey Zoe Zoe Zoe Kareem Alivia Alivia Alivia Diandra Alivia Keith Alexis Alexis Diandra Nurys Alexis Malcolm Nurys Nurys Geles Alivia Diandra Michael Keyana Keyana Audrey Geles Nurys Shad Audrey Geles Keyana Audrey Audrey Tyler Nicole Zoe Nicole Jada Keyana ; /* Input number of correct matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 5, these were the correct guesses */ data input_ceremony_correct_matches; input week correct_guesses; datalines; 1 3 2 1 3 2 4 3 5 1 ; /* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */ /* For example, in Season 6, as of week 5, these were the truth booths */ data input_truth_booth; input week boys_name $ girls_name $ outcome; datalines; 1 Ethan Keyana 0 2 Anthony Geles 0 3 Malcolm Nurys 0 4 Dimitri Nicole 0 5 Clinton Uche 0 ; /***********************************************/ /****** STEP 2 - RUN WITHOUT MODIFYING CODE ****/ /***********************************************/ /* Constraint Programming Formulation */ proc optmodel; /* SETS */ set <num,str,str> CEREMONY_PAIRS; set BOYS = setof {<w,b,g> in CEREMONY_PAIRS} b; set GIRLS = setof {<w,b,g> in CEREMONY_PAIRS} g; set WEEKS = 1..&num_weeks.; set <num,str,str> TRUTH_BOOTH_PAIRS; set SOLS init 1.._NSOL_; num correctMatches{WEEKS}; num truthBoothOutcome{TRUTH_BOOTH_PAIRS}; /* READ DATA */ read data input_ceremony_pairs into CEREMONY_PAIRS=[week boys_name girls_name]; read data input_truth_booth into TRUTH_BOOTH_PAIRS=[week boys_name girls_name] truthBoothOutcome=outcome; read data input_ceremony_correct_matches into [week] correctMatches=correct_guesses; /* DECISION VAR */ /* Assign[b,g] = 1 if boy b is assigned to girl g in the matching; 0 otherwise */ var Assign{BOYS,GIRLS} BINARY; /* CONSTRAINTS */ con One_Girl_Per_Boy{b in BOYS}: sum{g in GIRLS} Assign[b,g] = 1; con One_Boy_Per_Girl{g in GIRLS}: sum{b in BOYS} Assign[b,g] = 1; /* WEEKLY CONSTRAINTS */ con Ceremony_Matches{w in WEEKS}: sum{<(w),b,g> in CEREMONY_PAIRS} Assign[b,g] = correctMatches[w]; con Truth_Booth{<w,b,g> in TRUTH_BOOTH_PAIRS}: Assign[b,g] = truthBoothOutcome[w,b,g]; /* SOLVE */ solve with clp / FINDALLSOLNS; /* CREATE OUTPUT */ num couplesCount{BOYS, GIRLS} init 0; for{s in SOLS} do; for{b in BOYS} do; for{g in GIRLS: Assign[b,g].sol[s] > 0.5} do; couplesCount[b,g] = couplesCount[b,g] + 1; leave; end; end; end; create data output_couples from [BOYS GIRLS] couplesCount; call symputx('Number_Solutions',_NSOL_); quit; |

To visualize the solution we can use the SAS SGPLOT procedure to create a heat map that illustrates the number of feasible solutions that contain a certain couple matched.

proc sort data=output_couples; by boys girls; run; proc sgplot data=output_couples noautolegend; title "Couples Occurrence Count in &Number_Solutions. Solutions"; heatmap y=boys x=girls / weight=couplesCount colormodel=(white pink red) outline x2axis; text y=Boys x=Girls text=couplesCount / textattrs=(size=10pt); xaxis display=none; x2axis display=(nolabel); yaxis display=(nolabel) reverse; run; |

For example, if we run the code above with the given data (Season 6, Week 5) we can see this diagram:

In this output we can see, for example, that Ethan and Geles are for sure not a couple (bummer) and that Tyler and Nicole seem to be the most promising couple (so far).

Now use the following data (Season 6, Week 8) and my optimization and heat map code. Be ready for a surprise!

/***********************************************/ /******* STEP 1 - UPDATE YOUR INPUTS ***********/ /***********************************************/ /* Input number of weeks of data */ %let num_weeks = 8; /* Input weekly matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 8, these were the couples in each ceremony */ data input_ceremony_pairs(keep=week boys_name girls_name); array girl_week{&num_weeks.} $; input boys_name $ girl_week{*}; do week = 1 to &num_weeks.; girls_name = girl_week[week]; output; end; datalines; Anthony Geles Diandra Jada Keyana Nicole Keyana Keyana Alivia Clinton Uche Uche Uche Uche Jada Geles Geles Geles Dimitri Diandra Nicole Nurys Alexis Uche Diandra Diandra Diandra Ethan Jada Jada Alexis Nicole Geles Jada Zoe Alexis Joe Zoe Audrey Zoe Zoe Zoe Alexis Uche Jada Kareem Alivia Alivia Alivia Diandra Alivia Nurys Nurys Nurys Keith Alexis Alexis Diandra Nurys Alexis Zoe Jada Audrey Malcolm Nurys Nurys Geles Alivia Diandra Alivia Alexis Uche Michael Keyana Keyana Audrey Geles Nurys Uche Audrey Keyana Shad Audrey Geles Keyana Audrey Audrey Audrey Alivia Zoe Tyler Nicole Zoe Nicole Jada Keyana Nicole Nicole Nicole ; /* Input number of correct matches (from matching ceremony) up to current week */ /* For example, in Season 6, as of week 8, these were the correct guesses */ data input_ceremony_correct_matches; input week correct_guesses; datalines; 1 3 2 1 3 2 4 3 5 1 6 4 7 5 8 3 ; /* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */ /* For example, in Season 6, as of week 8, these were the truth booths */ data input_truth_booth; input week boys_name $ girls_name $ outcome; datalines; 1 Ethan Keyana 0 2 Anthony Geles 0 3 Malcolm Nurys 0 4 Dimitri Nicole 0 5 Clinton Uche 0 6 Keith Alexis 0 7 Keith Alivia 0 8 Michael Audrey 0 ; |

FYI - The participants in Season 6 did not find the perfect matching till week 10. I bet they would have loved to have SAS Constraint Programming handy.

To understand better how the number of possible matches (feasible solutions) decreases with each additional data piece (constraints), you can see the following super cool animation (thanks, Rob Pratt!):

Notice that the truth booths in weeks 4 and 6 were wasted because the status of the selected couple was already able to be deduced, so the number of possible solutions did not change. In a later update, I'll show how to optimally select which couple should go to the truth booth.

So, now that you know how to use SAS Constraint Programming, let's help these poor guys find their perfect matches in Season 7. Follow this blog and see how much faster than the contestants we'll figure out the matching!

And so it begins. After the first "truth booth" and "matching ceremony" (and lots of drama), here is the first heat map.

Teaser from Rob:

If the contestants have a choice, they should choose either Shamoy-Maria or Tomas-Morgan to go to the next truth booth, but NOT because those couples appear in the most remaining solutions.

Unfortunately the contestants did not follow Rob's suggestion. After the second truth booth and matching ceremony, we now have 287,580 solutions and actually Shamoy-Maria is now the most promising couple! Here is the heat map.

Update from Rob:

The fate button required the contestants to pair Andrew or Cam with Asia or Nutsa for the truth booth. Because the four pairs (Andrew-Asia, Andrew-Nutsa, Cam-Asia, Cam-Nutsa) all appeared in the same number of solutions (165,326) after week 1, these four choices were equally desirable to send to the truth booth, with respect to reducing the number of possible solutions. After week 2, Shamoy-Maria would be the best choice for the truth booth, but again NOT because they have the highest solution count (136,848). More details next week...

The third truth booth revealed that Shamoy and Maria are in fact a perfect match. After the third matching ceremony, there are 37,681 solutions.

Optimal truth booth recommendation from Rob:

Choosing either Cam-Kayla or Tevin-Kenya would yield the smallest number of resulting solutions in the worst case.

The "Couples Occurrence Count" table provides a way to determine which couple should go to the truth booth if you want to minimize the worst-case (maximum) resulting number of solutions. If couple (b,g) goes to the truth booth, there are two possibilities:

- If that couple is a perfect match, then the resulting number of solutions is couplesCount[b,g].
- If that couple is not a perfect match, then the resulting number of solutions is _NSOL_ - couplesCount[b,g].

So in the worst (worse?) case, the resulting number of solutions is max(couplesCount[b,g], _NSOL_ - couplesCount[b,g]), that is, the larger of these two counts. Because we want to reduce the number of solutions, the best strategy is to choose a couple that minimizes this quantity. Equivalently, we want to choose a couple that comes the closest to appearing in half the solutions. The following statements implement this idea:

/* find best pair for truth booth by minimizing the maximum number of resulting solutions */ num maxSolsTruth {b in BOYS, g in GIRLS} = max(couplesCount[b,g], _NSOL_ - couplesCount[b,g]); num min_maxSolsTruth init constant('BIG'); set <str,str> BEST_PAIR; for {b in BOYS, g in GIRLS} do; if min_maxSolsTruth > maxSolsTruth[b,g] then do; min_maxSolsTruth = maxSolsTruth[b,g]; BEST_PAIR = {<b,g>}; end; else if min_maxSolsTruth = maxSolsTruth[b,g] then do; BEST_PAIR = BEST_PAIR union {<b,g>}; end; end; put BEST_PAIR= min_maxSolsTruth; create data output_truth from [BOYS GIRLS] maxSolsTruth; |

After week 3, the PUT statement reports in the log:

BEST_PAIR={<'Cam','Kayla'>,<'Tevin','Kenya'>} 21889 |

The interpretation is that by choosing either of these two couples to go to the truth booth, the resulting number of solutions will be at most 21,889, and this number is best possible in the sense that choosing any other couple could yield a larger number of solutions.

As before, it is useful to display the results in a heat map:

proc sort data=output_truth; by boys girls; run; proc sgplot data=output_truth noautolegend; title "Maximum Number of Solutions if Couple goes to Truth Booth"; heatmap y=boys x=girls / weight=maxSolsTruth colormodel=(white pink red) outline x2axis; text y=Boys x=Girls text=maxSolsTruth / textattrs=(size=10pt); xaxis display=none; x2axis display=(nolabel); yaxis display=(nolabel) reverse; run; |

The lowest numbers appear in the lightest cells.

Here's the update after last night's episode. We are down to 12,860 solutions.

Using Rob's optimal strategy, the contestants should send Tevin and Kenya to the booth next episode such that in the worst-case scenario, the number of solutions will be down to 7,081.

If the matching ceremony yields no matches (a "blackout") beyond those confirmed by the truth booth, the prize pool gets cuts in half. So it would be good to figure out a way to avoid a blackout if possible. After the week 4 truth booth (before the matching ceremony), here was the heat map:

To avoid a blackout, you can solve an auxiliary optimization problem that maximizes the minimum number of matches. The formulation shown in the next code snippet uses a set of binary decision variables called IsCeremonyPair[b,g], with the interpretation that IsCeremonyPair[b,g] = 1 if boy b is paired with girl g in the matching ceremony. The first two sets of constraints enforce a one-to-one matching between boys and girls. The PerfectMatch constraint makes sure that any couple that is confirmed a "perfect match" by the truth booth is also paired together in the matching ceremony. The MaxMinCon constraint linearizes the maximin objective, as in the Fun with Flags post from July 2017. The following statements build and solve this auxiliary problem:

/* MATCHES_sol[s] = the set of pairs <b,g> for which Assign[b,g] = 1 in solution s */ set <str,str> MATCHES_sol {SOLS} init {}; for{s in SOLS} do; for{b in BOYS} do; for{g in GIRLS: Assign[b,g].sol[s] > 0.5} do; MATCHES_sol[s] = MATCHES_sol[s] union {<b,g>}; leave; end; end; end; /* find best ceremony pairs by maximizing the minimum number of matches (to avoid blackout) */ var IsCeremonyPair{b in BOYS, g in GIRLS} binary; con One_Girl_Per_Boy_Ceremony{b in BOYS}: sum{g in GIRLS} IsCeremonyPair[b,g] = 1; con One_Boy_Per_Girl_Ceremony{g in GIRLS}: sum{b in BOYS} IsCeremonyPair[b,g] = 1; con PerfectMatch{<w,b,g> in TRUTH_BOOTH_PAIRS: truthBoothOutcome[w,b,g] = 1}: IsCeremonyPair[b,g] = 1; var MaxMin >= 0 integer; max MinNumMatches = MaxMin; con MaxMinCon{s in SOLS}: MaxMin <= sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g]; problem CeremonyProblem include IsCeremonyPair MaxMin MinNumMatches One_Girl_Per_Boy_Ceremony One_Boy_Per_Girl_Ceremony PerfectMatch MaxMinCon; use problem CeremonyProblem; solve; put (min{s in SOLS} sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g])=; num couplesCountCeremony {b in BOYS, g in GIRLS} = (if IsCeremonyPair[b,g].sol > 0.5 then couplesCount[b,g] else .); create data output_couples from [BOYS GIRLS] couplesCount couplesCountCeremony; |

For week 4, the solver recommends an optimal matching ceremony that guarantees at least three matches and hence avoids a blackout. You can use the following PROC SGPLOT code to display the resulting solution:

proc sort data=output_couples; by boys girls; run; proc sgplot data=output_couples noautolegend; title "Couples Occurrence Count in &Number_Solutions. Solutions"; heatmapparm y=boys x=girls colorResponse=couplesCountCeremony / colormodel=(white pink red) outline x2axis; text y=Boys x=Girls text=couplesCount / textattrs=(size=10pt); xaxis display=none; x2axis display=(nolabel); yaxis display=(nolabel) reverse; run; |

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

It turns out that the actual matching ceremony chosen by the contestants in week 4 could have yielded a blackout. That is, there was at least one remaining solution that contained no matches from the ceremony.

Now we're down to 1,696 solutions:

Optimal truth booth recommendation from Rob:

Cam-Kayla would yield at most \(\) solutions, and they are the best choice. Note that Tevin-Kenya appear in more solutions together but have a higher worst-case count of \(\) solutions.

We're getting there: 574 possible solutions left.

Optimal truth booth recommendation from Rob:

Cam-Kayla would yield at most \(\) solutions, and they are still the best choice. Also, we can now conclude that Brett-Kayla are not a match, even though they never went to the truth booth.

The contestants voted to send Zak and Nutsa to the truth booth, and they were not a match, so \(\) possible solutions remain.

We already saw one way to avoid a blackout by solving an auxiliary optimization problem to maximize the minimum number of matches. But there might be many such possible matching ceremonies that yield the same number of matches, and (as with the truth booth strategy) we would like to minimize the number of resulting solutions in the worst case. We can solve a different auxiliary problem to both avoid a blackout and minimize the number of solutions. This problem involves the IsCeremonyPair[b,g] variables and related constraints from before, as well as additional binary variables IsBeamCount[s,c] with the interpretation that IsBeamCount[s,c] = 1 if and only if the beam count for solution s is c. Because Shamoy and Maria are the only perfect match so far, we need at least two beams to avoid a blackout. The following statements build and solve this auxiliary problem:

/* find best ceremony pairs by minimizing the maximum number of resulting solutions */ /* minBeamCount is minimum number of beams to avoid blackout */ num minBeamCount = 1 + sum{<w,b,g> in TRUTH_BOOTH_PAIRS} truthBoothOutcome[w,b,g]; set BEAM_COUNTS = minBeamCount..card(BOYS) diff {card(BOYS) - 1}; /* IsBeamCount[s,c] = 1 iff sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g] = c */ var IsBeamCount{SOLS, BEAM_COUNTS} binary; impvar NumSolsPerBeamCount{c in BEAM_COUNTS} = sum{s in SOLS} IsBeamCount[s,c]; var MinMax >= 0 integer; min MaxNumSolutions = MinMax; con MinMaxCon{c in BEAM_COUNTS}: MinMax >= NumSolsPerBeamCount[c]; con OneBeamCount{s in SOLS}: sum{c in BEAM_COUNTS} IsBeamCount[s,c] = 1; con Link{s in SOLS}: sum{<b,g> in MATCHES_sol[s]} IsCeremonyPair[b,g] = sum{c in BEAM_COUNTS} c * IsBeamCount[s,c]; problem CeremonyProblem2 include IsCeremonyPair IsBeamCount MinMax MaxNumSolutions One_Girl_Per_Boy_Ceremony One_Boy_Per_Girl_Ceremony PerfectMatch MinMaxCon OneBeamCount Link; use problem CeremonyProblem2; solve; print NumSolsPerBeamCount; create data output_couples from [BOYS GIRLS] couplesCount couplesCountCeremony; |

For week 7, the solver recommends an optimal matching ceremony that guarantees at least two matches (and hence avoids a blackout) and yields at most 142 solutions in the worst case (three beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

It turns out that the actual matching ceremony chosen by the contestants in week 7 (the results of which were left as a cliffhanger to be revealed in next week's episode) could yield up to 296 solutions, so the matching ceremony proposed here is better in the sense that it guarantees a much smaller worst-case number of solutions.

Update from Rob: The matching ceremony yielded four beams, and this implies several no-matches, four of which the contestants realized, and an inferred match (Tevin-Kenya), which the contestants did not realize and instead sent them to the truth booth. Although the truth booth result was treated as a cliffhanger to be revealed in the next episode, we already know the result. (In fairness to the contestants, there does not seem to be a simple explanation that Tevin and Kenya are a perfect match, without enumeration. For example, if you fix Assign['Tevin','Kenya'] = 0, the resulting problem is MILP infeasible but LP feasible.) There are now 128 remaining solutions, and that number will not change when the truth booth confirms the Tevin-Kenya match.

For next week, the solver recommends an optimal matching ceremony that guarantees at least three matches (and hence avoids a blackout) and yields at most 35 solutions in the worst case (five beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

As we already knew would happen, the truth booth confirmed that Tevin and Kenya are a perfect match. The matching ceremony, which could have yielded a blackout, leaves 34 solutions possible:

The contestants sent Cam and Samantha to the truth booth, but the results above already showed they were not a match, so that is two wasted truth booths in a row. The solver recommends an optimal matching ceremony that guarantees at least three matches (and hence avoids a blackout) and yields at most 8 solutions in the worst case (four or five beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

The contestants chose a different matching ceremony, which could have yielded a blackout. They got four beams, and now only five solutions remain:

By adding a PUT statement, you can write these solutions to the log:

for {s in SOLS} put s= MATCHES_SOL[s]; |

Here they are:

s=1 {<'Andrew','Cali'>,<'Brett','Nutsa'>,<'Cam','Bria'>,<'Daniel','Jasmine'>,<'Kwasi','Lauren'>,<'Lewis','Samantha'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Asia'>,<'Zak','Morgan'>} s=2 {<'Andrew','Cali'>,<'Brett','Samantha'>,<'Cam','Lauren'>,<'Daniel','Bria'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Nutsa'>,<'Zak','Morgan'>} s=3 {<'Andrew','Cali'>,<'Brett','Samantha'>,<'Cam','Lauren'>,<'Daniel','Nutsa'>,<'Kwasi','Kayla'>,<'Lewis','Bria'>,<'Moe','Asia'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Jasmine'>,<'Zak','Morgan'>} s=4 {<'Andrew','Samantha'>,<'Brett','Bria'>,<'Cam','Nutsa'>,<'Daniel','Cali'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Lauren'>,<'Zak','Morgan'>} s=5 {<'Andrew','Samantha'>,<'Brett','Nutsa'>,<'Cam','Morgan'>,<'Daniel','Lauren'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Cali'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Bria'>,<'Zak','Kayla'>}

Brett and Nutsa are headed to the final truth booth. Out of the four choices allowed by the fate button, they are the only couple whose status is unknown, so that was a good decision. From the latest heat map, we can see that there will be either two solutions (if Brett-Nutsa is a match) or three solutions (if not) left after the truth booth results. So it is not possible for the contestants to have logically deduced the correct matching in time for the final ceremony. It will take some guesswork for them to win the $1 million prize.

The final truth booth revealed that Brett-Nutsa is a perfect match, so we are down to two solutions (s=1 and s=5 above):

The contestants used solution s=1 for the final matching ceremony, and...

...they won the $1 million! Thanks for following along with us this season.

You can reproduce these results by using the following data.

/* Input number of weeks of data */ %let num_weeks = 10; /* Input weekly matches (from matching ceremony) up to current week */ data input_ceremony_pairs(keep=week boys_name girls_name); array girl_week{&num_weeks.} $; input boys_name $ girl_week{*}; do week = 1 to &num_weeks.; girls_name = girl_week[week]; output; end; datalines; Andrew Lauren Morgan Lauren Nutsa Samantha Lauren Lauren Samantha Cali Cali Brett Cali Asia Cali Kayla Nutsa Nutsa Nutsa Nutsa Bria Nutsa Cam Kayla Kayla Kayla Asia Kayla Kayla Cali Lauren Morgan Bria Daniel Nutsa Nutsa Samantha Lauren Bria Samantha Samantha Asia Lauren Jasmine Kwasi Asia Lauren Jasmine Bria Jasmine Asia Asia Jasmine Nutsa Lauren Lewis Samantha Jasmine Asia Kenya Lauren Bria Bria Bria Asia Samantha Moe Jasmine Bria Nutsa Samantha Asia Jasmine Jasmine Kayla Kayla Kayla Shamoy Maria Maria Maria Maria Maria Maria Maria Maria Maria Maria Tevin Kenya Kenya Kenya Jasmine Kenya Kenya Kenya Kenya Kenya Kenya Tomas Morgan Cali Bria Cali Cali Cali Kayla Morgan Jasmine Asia Zak Bria Samantha Morgan Morgan Morgan Morgan Morgan Cali Samantha Morgan ; /* Input number of correct matches (from matching ceremony) up to current week */ data input_ceremony_correct_matches; input week correct_guesses; datalines; 1 3 2 3 3 3 4 2 5 4 6 4 7 4 8 4 9 4 10 11 ; /* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */ data input_truth_booth; input week boys_name $ girls_name $ outcome; datalines; 1 Tomas Maria 0 2 Andrew Asia 0 3 Shamoy Maria 1 4 Brett Kenya 0 5 Zak Bria 0 6 Brett Cali 0 7 Zak Nutsa 0 8 Tevin Kenya 1 9 Cam Samantha 0 10 Brett Nutsa 1 ; |

The post "Are You The One?" Finding the matching through SAS Constraint Programming appeared first on Operations Research with SAS.

]]>The post Visiting all 30 Major League Baseball Stadiums - with Python and SAS® Viya® appeared first on Operations Research with SAS.

]]>A cross-country trip is pretty much an all-American experience, and so is baseball. Traveling around the country to see all 30 Major League Baseball (MLB) stadiums is not a new idea; there's even a social network between so-called "Ballpark Chasers" where people communicate and share their journeys. Even though I'm not a baseball fan myself, I find the idea of traveling around the country to visit all 30 MLB stadiums pretty interesting.

Since we all lack time, the natural question that might pop into your mind is, "How fast can I visit all 30 stadiums and see a MLB game in each?" This question was first asked by Cleary et al. (2000) in a mathematical context. This is where the math and baseball intersect. Finding the optimal trip is an expensive calculation. Discarding the schedule for a second and focusing only on ordering stadiums to visit results in more than \(\) different permutations. When you add the game schedule and distances between stadiums, the problem gets much bigger and more difficult quickly. See the Traveling Salesman Problem if you are interested in difficult scheduling problems, which is the main source of the "Traveling Baseball Fan Problem."

Before starting to talk about "The Optimal Trip" I should make some assumptions. The Optimal Trip is quite a subjective term, so I need to choose a measurable objective. My focus is to complete the schedule within the shortest time possible. Further, I assume that the superfan only uses land transportation (a car) between stadiums. Unlike some variations, I don't require the fan to return back to the origin, so the start and end cities will be different. Each stadium will be visited only once.

The Traveling Baseball Fan Problem (TBFP) has gained quite a bit of attention. There is a book by Ben Blatt and Eric Brewster about their 30 games in 30 days. They created an online visualization of such a tour for those interested; unfortunately the tool only shows schedules for the 2017 season. Their approach here is a heuristic, so the resulting solution is not guaranteed to be the "shortest possible tour." Since the problem is huge, one can only expect the true optimal solution to be obtained after optimization. Ben Blatt also wrote a mathematical optimization formulation for the shortest possible baseball road trip.

There are different ways to model this problem. The model I am going to use to optimize the TBFP is the network-based formulation presented in a SAS Global Forum 2014 paper by Chapman, Galati, and Pratt. Rob Pratt, one of the authors, wrote about the TBFP in this blog before.

Just to reiterate, ground rules for the optimal schedule:

- Use ground transportation only.
- Use driving distances between stadiums. The driving distances are obtained via OSRM.
- Stay until each game ends (assume each game lasts 3 hours).

The main challenge is to gather data, model the problem, and visualize results---all within the Python environment. Moreover, my aim here is to show you that the mathematical formulation for the TBFP can easily be written with our new open-source Python package **sasoptpy** and can be solved using SAS Viya on the cloud. If you are interested, check our Github repository for the package and our SAS Global Forum 2018 paper (Erickson and Cay) to learn more about it!

To provide variety, let's solve the problem for different time periods (2, 3, and 6 months) with two different objectives. The first objective is to finish the schedule in the shortest time possible. The second objective is to finish the schedule with spending the least amount of money. For this, I will assume $130 accommodation rate per day and $0.25 travel cost per mile. This objective was the main motivation of Rick and Mike's 1980 tour, as mentioned in the paper by Cleary et al. (2000).

I will use the network formulation from the aforementioned SAS Global Forum 2014 paper. For this model, I define directed arcs between pairs of games, eliminate the arcs that cannot be part of a feasible solution, and optimize the given objective.

The decision variable in this formulation is \(\), which is defined for each arc \(\) as

\(\)

Denote \(\) as the time between games \(\) and \(\) in days, including game duration, as follows:

\(\)

Also denote \(\) as the location of the game \(\), \(\) as the list of games including dummy nodes 'source' and 'sink', \(\) as the connections between games, and finally \(\) as the list of all stadiums. Now I can write the Network Formulation as follows:

\(\)

The solution of this optimization problem should produce a route starting at the source, finishing at the sink, and passing through all 30 ballparks. The objective here is to minimize the total schedule time. The first set of constraints ensures that inflow and outflow are equal for regular nodes. The second set of constraints ensures that the fan visits every stadium once.

For the second objective, I need to replace the objective function with the following:

\(\)

where \(\) is the distance between games in miles.

Now that I have my formulation ready, it is a breeze to write this problem using sasoptpy. Only part of the code is shown here for illustration purposes. See the Github repository for all of the code, including the code used for grabbing the season schedule from the MLB website, driving distances from OpenStreetMap, and exporting results.

I can write the Network Formulation to solve TBFP in Python as follows:

''' Defines the optimization problem and solves it. Parameters ---------- distance_data : pandas.DataFrame Distances between stadiums in miles. driving_data : pandas.DataFrame The driving times between stadiums in minutes. game_data : pandas.DataFrame The game schedule information for the current season. venue_data : pandas.DataFrame The information regarding each 30 MLB venues. start_date : datetime.date, optional The earliest start date for the schedule. end_date : datetime.date, optional The latest end date for the schedule. obj_type : integer, optional Objective type for the optimization problem, 0: Minimize total schedule time, 1: Minimize total cost ''' def tbfp(distance_data, driving_data, game_data, venue_data, start_date=datetime.date(2018, 3, 29), end_date=datetime.date(2018, 10, 31), obj_type=0): # Define a CAS session cas_session = CAS(your_cas_server, port=your_cas_port) m = so.Model(name='tbfp', session=cas_session) # Define sets, parameters and pre-process data (omitted) # Add variables use_arc = m.add_variables(ARCS, vartype=so.BIN, name='use_arc') # Define expressions for the objectives total_time = so.quick_sum( cost[g1,g2] * use_arc[g1,g2] for (g1, g2) in ARCS) total_distance = so.quick_sum( distance[location[g1], location[g2]] * use_arc[g1, g2] for (g1, g2) in ARCS if g1 != 'source' and g2 != 'sink') total_cost = total_time * 130 + total_distance * 0.25 # Set objectives if obj_type == 0: m.set_objective(total_time, sense=so.MIN) elif obj_type == 1: m.set_objective(total_cost, sense=so.MIN) # Balance constraint m.add_constraints(( so.quick_sum(use_arc[g, g2] for (gx,g2) in ARCS if gx==g) -\ so.quick_sum(use_arc[g1, g] for (g1,gx) in ARCS if gx==g)\ == (1 if g == 'source' else (-1 if g == 'sink' else 0) ) for g in NODES), name='balance') # Visit once constraint visit_once = so.ConstraintGroup(( so.quick_sum( use_arc[g1,g2] for (g1,g2) in ARCS if g2 != 'sink' and location[g2] == s) == 1 for s in STADIUMS), name='visit_once') m.include(visit_once) # Send the problem to SAS Viya solvers and solve the problem m.solve(milp={'concurrent': True}) # Post-process results (omitted) |

I ran this problem for the following twelve settings.

ID | Period | Objective (Min) |
---|---|---|

1 | 03/29 - 06/01 | Time |

2 | 03/29 - 06/01 | Cost |

3 | 06/01 - 08/01 | Time |

4 | 06/01 - 08/01 | Cost |

5 | 08/01 - 10/01 | Time |

6 | 08/01 - 10/01 | Cost |

7 | 03/29 - 07/01 | Time |

8 | 03/29 - 07/01 | Cost |

9 | 07/01 - 10/01 | Time |

10 | 07/01 - 10/01 | Cost |

11 | 03/29 - 10/01 | Time |

12 | 03/29 - 10/01 | Cost |

The last two settings (11 and 12) cover the entire 2018 MLB season from March 29th to October 1st. Therefore, these problems should give the optimal solutions for the best time and best cost objectives, respectively. My aim is to show how the problem size and solution time grow when the problem period is larger.

The ultimate benefit of working within the Python environment is ability to use open-source packages for many tasks. I have used the Bokeh package for plots and Folium for generating travel maps below. Bokeh is capable of web-ready interactive plots, which makes the visualization engaging for the user. Folium uses the Leaflet.js Javascript library to generate interactive maps based on OpenStreetMap maps. For interaction between the Bokeh scatter plots and the Leaflet maps, I have used a custom Javascript function provided by the Bokeh CustomJS class. You can see details of how the visualization part works in the Jupyter notebook.

The optimal solution I obtained from the 11th experiment gives the best schedule time for the 2018 season, which is just over 24 days (24 days and 3 hours). The solution starts with Diamondbacks @ Giants in AT&T Park, San Francisco on the 5th of June and ends with Royals @ Mariners in Safeco Field, Seattle on the 29th of June. This is the global best solution among the scheduled games this season. Maps and itineraries of selected solutions are shown below. Click on any of the plots and tables to see the Jupyter notebook. All times in these schedules are in EDT.

As a 22,528-mile trip, this schedule costs roughly $8,767. By changing the objective, it is possible to obtain a better cost, however the schedule takes longer significantly longer. Solution 12 is only 11,914 miles. It's 10,614 miles shorter and $2,149 cheaper compared to Solution 11 but takes 4 days longer.

Among the schedules you can still try at the writing of this post, Solution 9 gives the shortest schedule time (24 days 3 hours) and Solution 10 gives the best cost ($6,899). The latter schedule starts with Tigers @ Angels in Angel Stadium, Anaheim on the 6th of August, and ends with Orioles @ Mariners in Safeco Field, Seattle on the 4th of September. Note that, this solution is the longest schedule among all solutions I have with a little over 29 days.

Here's a list of all solutions:

The objective and the time period of the formulation heavily affect the solution time. Minimizing the cost takes longer due to unique optimal solutions. Moreover, increasing the time period from 3 months to 6 months nearly quadruples the solution time.

Ultimately, the best trip is up to you. You can define another objective of your choice, whether it be minimizing the cost, minimizing the schedule time, avoiding the risky connections (minimum time between games minus the driving time), minimizing the total driving time, or even maximizing the landmarks you have visited along the way. Whatever objective you choose, you can use *sasoptpy* and use the powerful SAS Viya mixed integer linear optimization solver to generate a trip that is perfect for you! Working in Python allows you to integrate packages you are familiar with and makes everything smoother. Do not forget to check my Jupyter notebook and see the Python files.

You can further improve this model based on your desire. Here I list a few ideas:

If you would like to see an away and a home game for every team exactly once, then you can add the following constraint

\(\)

You can try to avoid risky connections you have in the schedule by replacing the objective with

\(\)

To prevent schedules that are too long, you should add a limit to the total schedule length, for example, 25 days:

\(\)

Share how you define the perfect schedule below if you have more ideas!

The post Visiting all 30 Major League Baseball Stadiums - with Python and SAS® Viya® appeared first on Operations Research with SAS.

]]>The post Optimizing your way in the woods -- Orienteering for real appeared first on Operations Research with SAS.

]]>Orienteering is the sport of finding some physical control points in the woods as fast as possible using only a map and a compass. The standard competition is point-to-point, where all the controls need to be visited in a prescribed order. (There is a subtle optimization problem in there as well; you need to figure out how you will get to the next control, factor in topography, obstacles, trails, fatigue, and so on.) The competition I attended a few months ago was score-orienteering. We were given 25 control points, and we had to visit as many of them in a set time limit as possible. (I chose two hours, but there were one-hour and four-hour options as well.) Each control is worth a certain number of points, and the runner with the most points wins. You can see the locations and the point values of the controls in this figure:

The race starts and ends at the control marked with the red dot.

There is a severe time penalty for not reaching the finish on time: 8 points for the first minute over the time limit and 1 point for each additional minute. There was also a special bonus for reaching both the East-West (6 points) or the North-South (8 points) controls.

We had about 30 minutes to plan our routes, but most of us also had to recalculate (read *improvise*) on the go.

So how does one (with a Ph.D. in optimization) go about solving a problem like this?

*Use optimization to find your way in the woods! #orienteering #optimization @sassoftware @orienteeringusa*

Click To Tweet

As usual, the most challenging aspect of this project is getting the data right. We need to get pairwise distances between the control points. This is easy to do on a map: just measure and multiply by the scaling factor. However, there are certain complications. First of all, in reality the distance will not be symmetric; that is, reaching point A from point B can be much faster than reaching B from A if B is at the top of a hill and A is in the valley. Another complication is that the distances are uncertain. Usually the controls cannot be seen from a distance; they need to be found. This adds to the time it takes to reach them. Also, the runner might make a mistake while navigating or stop for a water break. To make it worse, this uncertainty varies from runner to runner.

An overly ambitious consultant might collect data from past races and come up with an approximation for the distribution of the travel times between control points (in both directions). Knowing that, the consultant could calculate the distribution of completion times for various course options and choose the best one according to some criterion:

- most points with at most a small chance of not finishing on time, or
- highest probability of success given the number of points targeted, or
- highest expected value of points finishing within the time limit (this assumes that the runner reoptimizes after each control).

Although such a detailed analysis would solve the problem completely, it isn't practical. This is usually the case not only for orienteering but for most practical engagements: the modeler needs to simplify the problem to make it practical and solvable.

A usual orienteering course is characterized by its length and the number of controls it includes. I use a GPS tracker to see how much I actually run during a race and compare that to the ideal course length. The closer these two numbers are, the better I am at navigating or the easier the course is. In my case, at the time of this race, the ratio was around 1.5, so I figured I would end up running about 50% more than the ideal distance. (That wasn't very good, I've improved since.) I also knew my pace during these races: I could do an average speed of about 6km/h. So before the race I estimated I would run about 12 kilometers, and I could cover a course that is about 8 kilometers long.

This simplification assumes that the course for the score-O race is similar to the courses I've run before. Its power comes from the fact that it accumulates all the uncertainty into just two numbers: the average speed of the runner and the ratio between the ideal and the tracked distance.

Now we can take the ideal pairwise distances and use them in this adjusted setting.

The optimization model is a variant of the traveling salesman problem (TSP). For simplicity of exposition, we use the MTZ formulation. We have a binary variable for each control point to indicate that the control is part of the route. Another set of binary variables indicate whether each arc is used.

set NODES; num x {NODES}; num y {NODES}; num score {NODES} init 0; read data &node_data nomiss into NODES=[control] x y score; str source; for {i in NODES: score[i] = 0} do; source = i; leave; end; set <str,str> ARCS; num distance {ARCS}; read data &arc_data into ARCS=[c1 c2] distance; /* declare decision variables */ var UseNode {NODES} binary; var UseArc {ARCS} binary; |

These two variables are naturally connected by the following constraints:

con LeaveNode {i in NODES}: sum {<(i),j> in ARCS} UseArc[i,j] = UseNode[i]; con EnterNode {i in NODES}: sum {<j,(i)> in ARCS} UseArc[j,i] = UseNode[i]; |

We need to make sure that the tour is contiguous; that is, it doesn't include a subtour. So we introduce an order variable:

var U {NODES} >= 0 <= card(NODES) - 1 integer; /* if UseArc[i,j] = 1 then U[i] + 1 <= U[j] */ /* strengthened by lifting coefficient on UseArc[j,i] */ con MTZ {<i,j> in ARCS: j ne source}: U[i] + 1 - U[j] <= (U[i].ub + 1 - U[j].lb) * (1 - UseArc[i,j]) - (card(NODES) - 2) * UseArc[j,i]; fix U[source] = 0; |

The bonus points require a couple of extra binary variables:

/* East-West and North-South bonuses */ num xmin = min {i in NODES} x[i]; num xmax = max {i in NODES} x[i]; num ymin = min {i in NODES} y[i]; num ymax = max {i in NODES} y[i]; set BONUSES = 1..2; set NODES_b {BONUSES} init {}; for {i in NODES: x[i] = xmin} do; NODES_b[1] = NODES_b[1] union {i}; leave; end; for {i in NODES: x[i] = xmax} do; NODES_b[1] = NODES_b[1] union {i}; leave; end; for {i in NODES: y[i] = ymin} do; NODES_b[2] = NODES_b[2] union {i}; leave; end; for {i in NODES: y[i] = ymax} do; NODES_b[2] = NODES_b[2] union {i}; leave; end; put NODES_b[*]=; var IsBonus {BONUSES} binary; con IsBonusCon {b in BONUSES, i in NODES_b[b]}: IsBonus[b] <= UseNode[i]; |

We can have two objectives, minimize distance given a target point value or maximize points given a limited distance:

max TotalScore = sum {i in NODES} score[i] * UseNode[i] + &east_west_bonus * IsBonus[1] + &north_south_bonus * IsBonus[2]; min TotalDistance = sum {<i,j> in ARCS} distance[i,j] * UseArc[i,j]; |

The problem is modeled using PROC OPTMODEL in SAS/OR 14.3. The optimization problems are solved with the mixed integer linear programming (MILP) solver. It takes on average about half a minute to find the course with the most points given a distance budget, and it takes only a couple of seconds to potentially shorten the course.

I ran the model for various course lengths to get the course with the most points. Then I fixed the number of points and found the shortest course that achieves that. These are listed in the following table:

Distance (m) | Points |
---|---|

2811 | 11 |

3648 | 17 |

4844 | 23 |

5746 | 31 |

6932 | 36 |

7828 | 42 |

8565 | 49 |

9950 | 55 |

10693 | 61 |

11964 | 68 |

12761 | 75 |

13950 | 80 |

14889 | 87 |

15790 | 89 |

16856 | 97 |

17904 | 104 |

18870 | 108 |

19846 | 115 |

20893 | 122 |

You can click on each distance to get the optimal course for that distance. Because I optimized the course lengths, the table can be read from left to right to get the most points given a distance or right to left to get the minimum distance given a target point value.

Browsing those routes reveals some interesting insights. The East-West bonus wasn't worth the effort, even though it looked enticing and definitely doable within two hours. Turns out there weren't enough high-valued control points around those two to make it worth it. On the other hand the North-South bonus was worth it as long as a course longer than 17km (or at least 97 points) was targeted. We can also see that 1km of distance is worth about 5-7 points, which is an important number when one wants to decide how far to run for a control.

Here are a few plots of optimal courses of a given length. First, a course that visits all the controls. It is just under 21km long:

Then let us see what I should have done given the condition I set out to myself that the nominal course length should be less than 8km:

If I had managed to complete this course I would have been tied for third place. Finally, the optimal course for my pace in the four-hour race:

This score would have won the four-hour race. It is interesting that it doesn't require either of the bonuses.

And what happened at the race? I had 24 points halfway through the race. Then I got greedy and decided to go for the East-West bonus. I lost a lot of time on the East control, had to give up the West control, and had only 30 minutes to get back to the finish. I went 8 minutes overtime and finished with only 15 points. On the upside, I ran 13km in 128 minutes, which is right on target for my 12km/120min original estimate. The nominal length of my course was 8.8km, a bit more than I projected, and that is probably why I couldn't finish on time. I should have headed back toward the finish and picked up a few more controls in that area instead. In any case, it was a good lesson in orienteering as well as in optimization.

You can find the race information here and the results here. If you are interested in participating in a similar event this year you can find the necessary information here. I will be there, hopefully armed with a better strategy this year.

Thanks to Rob Pratt for providing the code I used in this blog post. The code and the data can be downloaded from here.

The post Optimizing your way in the woods -- Orienteering for real appeared first on Operations Research with SAS.

]]>The post SAS at the 2017 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>**Technology Workshops on Saturday, October 21:**

- "JMP Pro and the Predictive Modeling Workflow," (1:00-3:30pm, room 330A)
- "Solving Business Problems with SAS Analytics and OPTMODEL," (4:00-6:30pm, room 330A)

**Software Tutorials on Tuesday, October 24 (individual presentation time indicated):**

- "Building and Solving Optimization Models with SAS," Ed Hughes and Rob Pratt (TD71, 2:00-2:45pm, room 371F)
- "JMP Pro: Top Features That Make You Say 'Wow!'," Mia Stephens (TD71, 2:45-3:30pm, room 371F)

**Other SAS-related talks (session time indicated):**

**Sunday, October 22:**

- "Tuning a Bayesian Framework for B2B Pricing," Fang Liang, Maarten Oosten (Vistex) (SB78B, 11:00am-12:30pm, room 380B)
- "Time Series Classification using Normalized Subsequences," Iman Vasheghani Farahani, Alex Chien, Russell Edward King (North Carolina State University) (SD70, 4:30-6:00pm, room 371E)

**Monday, October 23:**

- "Internet of Things: Promises, Challenges and Analytics," Patricia Neri, Ann Olecka (Honeywell), Bill Groves (Honeywell) (MA43, 8:00-9:30am, room 360B)
- "The SAS MILP Solver: Current Status and Future Developments," Imre Pólik, Menal Güzelsoy, Amar Kumar Narisetty, Philipp M. Christophel (MB77, 11:00am-12:30pm, room 372F)

**Tuesday, October 24:**

- "Robust Operational Decisions for Multi Period Industrial Gas Pipeline Networks under Uncertainty," Pelin Çay, Camilo Mancilla (Air Products and Chemicals), Robert H. Storer (Lehigh University), Luis F. Zuluaga (Lehigh University) (TC83, 12:05-1:35pm, room 382C)
- "Robust Principal Component Analysis in Cloud-based Run-time Environment," Kyungduck Cha, Seunghyun Kong, Zohreh Asgharzadeh, Arin Chaudhuri (TD70, 2:00-3:30pm, room 371E)

**Wednesday, October 25:**

- "Pricing Advance Purchase Products," Matthew Maxwell, Jennie Hu (WA23, 7:30-9:00am, room 342E)
- "Effective Methods for Solving the Bicriteria
*P*-center*P*-dispersion Problem," Golbarg Kazemi Tutunchi, Yahya Fathi (North Carolina State University) (WB58, 10:30am-12:00pm, room 362E) - "Warmstart of Interior Point Methods for Second Order Cone Optimization via Rounding Over Optimal Jordan Frames," Sertalp Bilal Çay, Imre Pólik, Tamas Terlaky (Lehigh University) (WC83, 12:30-2:00pm, room 382C)

Two additional important notes:

- Patricia Neri is the 2017 Daniel H. Wagner Prize chair, with presentations in sessions MB40, MC40, and MD40. The winner will be announced and the winning presentation will be repeated during keynote session TK04.
- Sertalp Bilal Çay will chair session WC83.

We hope you can make it to Houston to see these and many other excellent presentations! Don't forget to stop by the SAS and JMP booths to say hello!

The post SAS at the 2017 INFORMS Annual Meeting appeared first on Operations Research with SAS.

]]>The post The Traveling Salesman Traverses the Plant Genome appeared first on Operations Research with SAS.

]]>One more large number is 3 billion: the estimated number of people (over 40%) that are currently malnourished. Technological/medical advances and new understanding of the human genome have improved both our quality of life and longevity. Now that same attention must be focused on our food sources.

What does this have to do with the software tools developed here at SAS? Well, researchers at places like the USDA and General Mills, to name just a couple, are using it to help solve this problem. Because grains and cereals make up over 80% of the world food sources they are in front-line position to help. They use JMP Genomics and SAS/OR to transform breeding programs with marker-assisted plant breeding.

The idea is to associate desirable traits in crops with locations on their genome so that we can help drive selection of new plant lines to optimize trait outcomes. A basic understanding of genetics is now common, especially with controversies surrounding genetically modified organisms (GMOs). If two blue-eyed people have a child, that child most likely will have blue eyes due to genetic code inherited from his/her parents. Cross two strains of corn that are tall, and the outcome will tend to be tall corn. These are called heritable traits. The goal then is to apply genomic knowledge of heritable traits to assist plant breeders in making the best crosses (mating of two plant lines) to produce crops that not only increase yield, but have less disease, have higher vitamin content, are easier to mill, can grow in drought-tolerant conditions, require less processing, and even *taste better* (yes, hedonic traits can and are being researched for crop improvement programs) *without* direct genetic modification.

By developing tools that use the observed relationships between genetic markers to build a genetic map that facilitates trait association/prediction, crop selection, and simulation. Unlike the human genome, of which the majority has now been mapped, we don't have this “genetic roadmap” for many plant species due to strong complexities in their genome; for example, four, six, or even eight sets of chromosomes instead of two. Instead, we typically use the inheritance patterns between genetic markers to estimate a linkage map.

Markers inherited together are correlated, meaning they are closer on the genetic map. So we need algorithms that can group and order markers to build a genetic map. Markers that are correlated (a small genetic recombination distance) to a certain degree belong on the same chromosome. Once markers are assigned a chromosome, we need to determine the order of those markers that will produce the smallest genetic distance map.

It turns out this is directly analogous to the traveling salesman problem (TSP). Lucky for me, SAS just happens to have a group of experts in operations research that easily recognized this as an optimization problem using minimum spanning forests to determine groups of markers, and TSP algorithms in SAS/OR's OPTMODEL and OPTNET procedures to find optimal marker orders.

*“Hey it’s not that simple! There are also known genetic relationships that might anchor certain markers to groups/orders,”* I said; my new best friends in SAS/OR said, *“No problem, we got this!”* and applied optional node connection constraints in a side-constrained TSP solution.

The Figure above shows an estimated Linkage Map using these methods for an experimental Oat population, where colored/italicized markers were already known/anchored markers. This optimization work and JMP Genomics visualization tools for linkage and consensus maps have since been used in several research publications for genetic mapping of a variety of plants, including this SAS Global Forum paper.

With a reliable map, employing genomic selection methods (modern data mining and predictive modeling techniques for marker-assisted plant selection) becomes feasible in a plant breeding program. Using the arsenal of robust predictive models available in JMP Genomics, the goal is to find the best model to score a new plant variety for a given trait, based on genetic variability. The art of plant breeding comes in developing a new line (seed variety) that can *balance* multiple traits to give the best possible outcome. For example, just driving an increase in yield alone will often lead to the plant/product suffering in other ways.

Our solution in JMP Genomics once again uses our secret weapon found in the capabilities of SAS/OR to create a process that allows breeders to analytically evaluate all possible plant/line crosses and create a cross simulation tool that selects the potential plant crosses that will *simultaneously optimize* multiple traits at once (for example in the plot below, increasing yield while decreasing height in the selected maize plant crosses).

The figure above shows that in five generations of breeding, crossing lines 45 or 99 with line 41 would produce some of the highest predicted yields while effectively reducing the height of the plant. A traditional breeding program can take several years before gain is realized. Each year, breeders have to decide the right crosses to make under the constraints of the amount of land/resources and time available to try to improve a given crop, and then must wait an entire growing season to see how the lines perform. JMP Genomics offers an analytic solution to help breeders find optimal crosses for a set of balanced physical traits and use multi-year simulation results to dramatically speed up this process, in many cases requiring just a few hours on a computer.

Creating sustainable agriculture techniques to produce not just *more* but *healthier *food is one of our most pressing concerns. The new methods outlined above provide just one way that SAS is able to help.

The post The Traveling Salesman Traverses the Plant Genome appeared first on Operations Research with SAS.

]]>The post Fun with Flags: Optimizing Arrangements of Stars with SAS appeared first on Operations Research with SAS.

]]>Suppose we want to arrange \(\) stars in a rectangular region. Let \(\) and \(\) denote the coordinates of star \(\), for \(\). We need some way to measure the quality of a particular arrangement. What stands out to me in the current flag is that the stars are spread out in the sense that no two stars are too close to each other. One way to capture this idea with optimization is by using a maximin model that maximizes the minimum Euclidean distance between stars. Explicitly, we want to maximize

\(\)

subject to lower and upper bounds on \(\) and \(\).

The following SAS macro variables record the number of stars and the width and height of the rectangular region as proportions of the overall flag height, according to the United States code:

%let n = 50; %let width = 0.76; %let height = 0.5385; |

The following PROC OPTMODEL statements capture the mathematical optimization problem just formulated:

proc optmodel; /* declare parameters */ num n = &n; set POINTS = 1..n; num width = &width; num height = &height; /* declare variables, objective, and constraints */ var X {POINTS} >= 0 <= width; var Y {POINTS} >= 0 <= height; var Maxmin >= 0; max Objective = Maxmin; con MaxminCon {i in POINTS, j in POINTS: i < j}: Maxmin <= (X[i]-X[j])^2 + (Y[i]-Y[j])^2; /* call nonlinear programming solver with multistart option */ solve with nlp / multistart; /* save optimal solution to SAS data set for use with PROC SGPLOT */ create data solution from [i] X Y; quit; |

Note that we use squared distance instead of distance in order to avoid the non-differentiability of the square root function at 0. Also, the MULTISTART option of the nonlinear programming solver increases the likelihood of finding a globally optimal solution for this nonconvex optimization problem with many locally optimally solutions.

The following PROC SGPLOT statements plot the resulting solution:

proc sgplot data=solution aspect=%sysevalf(10/19); styleattrs wallcolor='#3C3B6E'; scatter x=X y=Y / markerattrs=(symbol='StarFilled' color=white size=20); xaxis values=(0 &width) display=none; yaxis values=(0 &height) display=none; run; |

Here is the resulting plot, which is a transpose of the existing design:

The resulting objective value is 0.0116, which corresponds to the squared distance between vertically adjacent pairs of stars.

For comparison, here is the plot of the existing design (objective value 0.0103), where instead the diagonally adjacent stars are the closest:

For each new state admitted to the union, a star is added and the new flag is introduced the next July 4th. Recent proposals to add either Puerto Rico or Washington, D.C. raise the prospect of a 51-star flag. Several researchers have looked for such arrangements among a handful of prespecified patterns. See, for example, this 2012 article in *The American Mathematical Monthly*. Instead, we have a general optimization model for \(\) stars, so we can change the value of \(\) to 51 and rerun the code. Here is the resulting plot:

How do you like it in comparison with this popular design that has alternating rows of nine and eight stars?

The U.S. flag adopted on June 14, 1777 had 13 stars. Here's the optimized version:

The Star Spangled Banner flag that inspired Francis Scott Key to write the U.S. national anthem had 15 stars. Here's the optimized version:

Although the problem discussed here was motivated by flag design, the same problem applies to facility location. Suppose you want to build a number of facilities within some prescribed region. If two facilities are too close to each other, they will cannibalize each other's business. So it makes sense to maximize the minimum distance between facilities.

For those in the United States, enjoy the holiday tomorrow!

The post Fun with Flags: Optimizing Arrangements of Stars with SAS appeared first on Operations Research with SAS.

]]>The post Creating Synthetic Data with SAS/OR appeared first on Operations Research with SAS.

]]>Researchers could use synthetic data to, for example, understand the format of the real data, develop an understanding of its statistical properties, build preliminary models, or tune algorithm parameters. Analytical procedures and code developed using the synthetic data can then be passed to the data owner if the collaborators are not permitted to access the original data. Even if the collaborators are eventually granted access, synthetic data allow the work to be conducted in parallel with tasks required to access the real data, allowing an efficient path to the final analysis.

One method for producing synthetic data with SAS was presented at SAS Global Forum 2017 with this corresponding paper (Bogle and Erickson 2017), which implemented an algorithm originally proposed in another paper (Bogle and Mehrotra 2016, subscription required). The implementation uses the OPTMODEL procedure and solvers in SAS/OR to create synthetic data with statistical raw moments similar to the real data.

Most SAS users are familiar with mean, variance, and covariance. The mean is the first-order raw moment, \(\). The variance is the second-order marginal central moment, \(\), and covariance is the second-order mixed central moment, \(\). Skewness and kurtosis are the third- and fourth-order normalized moments.

This method considers only raw moments, which keeps the optimization problem linear. The more of these moments that are matched, the more like the original data it will be. It is difficult to match the moments exactly, so the algorithm tries to keep them within desired bounds. For example, one of the fourth-order moments (meaning the sum of the exponents is four) could be \(\), which could have a value of 0.5 in the real data set. The method might try to "match" this moment by constraining that moment in the new data set to be between 0.48 and 0.52. If \(\) observations are desired, this would ideally mean that the moment value would be constrained between those bounds, \(\).

There are a couple of problems with solving a system of inequalities like this. First, it is not linear or convex, so it would be very hard to solve, especially when some variables need to take integer values. This macro instead creates candidate observations and uses mathematical optimization to select which observations to include. Second, there might not be a solution that solves such a system of inequalities, so the method allows violations to the bounds on the moments and minimizes them.

There are three basic steps to the algorithm. The first step reads in the original data, computes the upper and lower bounds on the moments, and sets up other parameters for the optimization models. The second step uses column generation with a linear program to produce candidate synthetic observations. The third step uses an integer program to select the desired number of synthetic observations from the candidate synthetic observations, while minimizing the largest scaled violation of the moment bounds. The following code is the macro that generates synthetic data by calling a different macro for each of the three steps.

%macro GENDATA(INPUTDATA, METADATA, OUTPUTDATA=SyntheticData, MOMENTORDER=3, NUMOBS=0, MINNUMIPCANDS=0, LPBATCHSIZE=10, LPGAP=1E-3, NUMCOFORTHREADS=1, MILPMAXTIME=600, RELOBJGAP=1E-4, ALPHA=0.95, RANDSEED=0); proc optmodel printlevel=0; call streaminit(&RANDSEED); %PRELIMINARYSTEP(INPUTDATA=&INPUTDATA, METADATA=&METADATA, MOMENTORDER=&MOMENTORDER, ALPHA=&ALPHA); %LPSTEP(MOMENTORDER=&MOMENTORDER, NUMOBS=&NUMOBS, MINNUMIPCANDS=&MINNUMIPCANDS, LPBATCHSIZE=&LPBATCHSIZE, LPGAP=&LPGAP, NUMCOFORTHREADS=&NUMCOFORTHREADS); %IPSTEP(OUTPUTDATA=&OUTPUTDATA, MOMENTORDER=&MOMENTORDER, NUMOBS=&NUMOBS, MINNUMIPCANDS=&MINNUMIPCANDS, MILPMAXTIME=&MILPMAXTIME, RELOBJGAP=&RELOBJGAP); quit; %mend GENDATA; |

You can see that there are several parameters, which are described in the paper. Most of them control the balance of the quality of the synthetic data with the time needed to produce the synthetic data.

The macro demonstrates some useful features of PROC OPTMODEL. One option for the macro is to use PROC OPTMODEL's COFOR loop. This allows multiple LPs to be solved concurrently, which can speed up the LP step. The implementation breaks the assignment of creating candidate observations into NUMCOFORTHREADS groups, and one solve statement from each group can be run concurrently. The use of COFOR reduces the total time from five hours to three hours in the example below.

The code also calls SAS functions not specific to PROC OPTMODEL, including RAND, STD, and PROBIT. The RAND function is used to create random potential observations during the column generation process. The STD and PROBIT functions are used to determine the desired moment range based on the input data and a user-specified parameter.

The examples in the paper are based on the *Sashelp.Heart* data set. The data set is modified to fit the requirements that all variables must be numeric and there can be no missing data.

The example randomly divides the data into Test and Training data sets so that a model based on the Training data set can be tested against a model based on a synthetic data set derived from the Training data set.

The following call of the macro generates the synthetic data based on the Training data set with the appropriate METADATA data set. It allows up to four LPs to be solved concurrently, matches moments up to the third order, allows 20 minutes for the MILP solver, and uses a fairly small range for the moment bounds.

%GENDATA(INPUTDATA=Training, METADATA=Metadata, NUMCOFORTHREADS=4, MOMENTORDER=3, MILPMAXTIME=1200, ALPHA=0.2); |

One way to verify that the synthetic data are similar to the real data is to compare the means, standard deviations, and covariances of the input and synthetic data sets. The paper shows how to do this using PROC MEANS and PROC CORR on a smaller synthetic data set.

In this case we want to compare the training and synthetic data by comparing how well the models they fit score the Test data set. One of the variables indicates if a person is dead, so we can use a logistic regression to predict if a person is dead or alive. Because BMI is a better health predictor than height or weight, we have added a BMI column to the data sets. The following code calls PROC LOGISTIC for the two data sets.

proc logistic data=TrainingWithBmi; model Dead = Male AgeAtStart Diastolic Systolic Smoking Cholesterol BMI / outroc=troc; score data=TestingWithBmi out=valpred outroc=vroc; run; proc logistic data=SyntheticDataWithBmi; model Dead = Male AgeAtStart Diastolic Systolic Smoking Cholesterol BMI / outroc=troc; score data=TestingWithBmi out=valpred outroc=vroc; run; |

This produces quite a bit of output. One useful comparison is the area under the curve (AUC) of the ROC curve, which plots a comparison of sensitivity and specificity of the model. The first plot is for the original training data scored with the testing data from the original data set. The second plot is for the synthetic data scored with the testing data from the original data set. You can see that the curves are fairly similar and there is little difference between the AUCs.

This synthetic data set behaves similarly to the original data set and can be safely shared without risk of exposing personally identifiable information.

What uses do you have for synthetic data?

The post Creating Synthetic Data with SAS/OR appeared first on Operations Research with SAS.

]]>