Operations Research

We are joining forces with the SAS Data Science blog!

Udo Sglavo — Wed, 10 Feb 2021 14:30:41 +0000

It's been a great ride

Since 2014, the Operations Research Blog has covered a broad range of topics related to real applications of operations research. Paraphrasing Principal Product Manager for Optimization Ed Hughes' first post - this blog covered how OR methods could be applied to organizational and business planning problems as well as our individual lives. It demonstrated how other SAS analytical products, business intelligence, and reporting capabilities interact with and enhance operations research techniques. Topics covered everything from computing an optimal blackjack strategy with SAS/OR to improving hidden Markov models with black-box optimization.

Life moves on and so are we

The term data science isn’t new, but it's gotten more and more press over the last several years. As developing methods to handle big data became more important, so did the prominence of data science. Data science encompasses mathematicians, statisticians, computer scientists, and many other related fields. The SAS Data Science Blog evolved from discussing trends that drove innovation and challenges that expanded the boundaries of what we thought was possible to one that features the perspectives of SAS data scientists as they share the technical methods used to solve many of the challenging problems facing organizations today.

While data scientists typically are concerned with machine learning techniques such as supervised and unsupervised learning, many realized that broadening their horizon helps them address business challenges as well. We believe that an advanced analytics ecosystem must feature optimization capabilities at SAS, or it is not complete. In particular, in times of COVID-19 and its specific challenges, it has become clear that optimization techniques are fundamental for data scientists.

Come visit our new home

You will still be able to access all our previous posts. Look for new operations research posts and the different ways optimization can be used with SAS Viya at our new home. We hope you will join us on the SAS Data Science blog site to discover how operations research will move forward helping data scientists in the future.

The post We are joining forces with the SAS Data Science blog! appeared first on Operations Research with SAS.

Why venue optimization is critical and how it works

Sertalp B. Cay — Mon, 09 Nov 2020 14:04:53 +0000

NCSU used SAS Venue Optimization for their seating layout on the October 17th game versus Duke.
(Image by ACC Digital Network @ YouTube)

A year ago we could not imagine stadiums being empty during the most exciting sports events, but it is a common sight now. The entertainment sector is one of the hardest hit sectors because of the COVID-19 pandemic [1]. Social distancing requirements made it impossible to have viewers in stadiums and indoor venues.

The main discussion around venues started with the Champions League game (football/soccer) between Atalanta and Valencia on February 19th, an event that accelerated the spread of the pandemic in Italy [2]. The spread of the virus in Lombardy was quite noticeable, especially compared to the rest of the country. Extreme social distancing rules were implemented after this incident, preventing fans from being in the stands all over the world. Obviously, stadiums and event centers are some of the most difficult places to practice social distancing unless enough precautions are taken. Many leagues were either postponed or cancelled in April, even though most of the leagues eventually continued without fans despite the profit loss [3]. Countries are still trying to decide whether to have fans in venues, and the rules keep changing as number fluctuate. Belgium banned fans in March, allowed them in June, and banned them again in October because of increasing numbers of patients.

There is a growing uncertainty about the future of viewers in venues. Back in April, we had an internal discussion about the issue in our Sports Analytics discussion group when Brian Miles asked us a question. The question was whether we can utilize analytics and optimization to figure out how many people can be accepted into venues while adhering to social distancing. I quickly wrote an optimization model for a small section, and the result from that experiment holds true even after we eventually developed a full solution:

With 6 feet social distancing restrictions, you can fill seated venues up to 30% utilization.
If you want to allow people to have clear access to exits, it is between 11 to 15%.

This is the cold hard fact about venues right now. This number depends on the seat and section dimensions and also what kind of regulations you have, but the results we have seen so far are usually around this number. Our belief is that with all the financial troubles sports clubs are going through, if clubs are allowed to have fans in stadiums, then they should optimize it. You might think it is not a difficult problem to solve on the surface, but it gets complicated when details are added:

Whether to accept people as individuals or as groups
Restrictions on seats that are close to the stairs and exits
Different rules for section places; for example, NC State University (NCSU) has student sections at the end zones
Expected demand for various group sizes
Maximizing number of people may not be equivalent to maximizing the revenue

Because of the difficult nature of the problem, we collaborated with NCSU to build plans for their Carter-Finley Stadium.

Photo by Ethan Hyman @ The News & Observer

Ticket sales are complicated by nature, but there are more aspects to consider during this pandemic: If fans can buy any ticket they want in the venue (and can have any number of people in their group as they want), it will lead to results that are far from the optimal values. The one question that every venue manager should answer is this: "How can I maximize my revenue (or fans in stands) under ever-changing COIVD-19 restrictions?" More than anything, venues need a way to build seating plans in an automated way.

Enter SAS Venue Optimization.

Group Selections

SAS Venue Optimization is powered by SAS Optimization under the hood. I will show how we built the optimization model below.

Let me start with a simple problem: consider a section with 2 rows and 5 seats in each row.

The total number of different groups we can sell in this small section is astonishing: there are 30 different configurations. In each row you can sell:

5 individual seats (S1, S2, S3, S4, S5)
4 groups of pairs (S1-S2, S2-S3, S3-S4, S4-S5)
3 groups of triplets (S1-S2-S3, S2-S3-S4, S3-S4-S5)
2 groups of 4 (S1-S2-S3-S4, S2-S3-S4-S5)
1 group of 5 (S1-S2-S3-S4-S5)

In total, we have 5+4+3+2+1=15 configurations per row, 30 in total.

Assume that the social distancing limit is equal to twice the width of a seat and depth of one row. The optimal strategy for this small section is clear; we can sell either the entire first row or the entire second row. A suboptimal solution would be selling R1S1-R1S2 and R2S4-R2S5. Consider this: North Carolina State University's Carter-Finley stadium has over 65,000 seats, and the total number of configurations is over 268,000. The question you need to answer is this: "Which one of these 268,000 groups should I choose?" For each group you can either sell or block: There are $2^{268,000}$ combinations.

Of course the total combinations are much less than this number; we cannot really sell S1-S2 and S1-S2-S3 at the same time. Because of social distancing, we cannot sell the following combinations either:

S1-S2 and S3-S4
S1-S2 and S4-S5
R1S1 and R2S1
R1S1 and R2S2

The problem boils down to finding all these combinations you cannot sell at the same time and choosing the ones that maximize your objective.

SAS Venue Optimization can show the safety bubble for individual and group seats.

Let set $G$ be the set of all possible groups in the section, let $d$ be the distance parameter between each pair of groups, and let $t$ be the social distancing limit. We can define the set $P$ of conflicting pairs of groups as follows:

$P = \{g_1 \in G, g_2 \in G: g_1 < g_2\text{ and }d(g_1,g_2) \lt t\}$

Using conflicting pairs, we can write the optimization model as follows, where $n_g$ is the number of people in group $g$ and $u_g$ is a binary decision variable that indicates whether group $g$ is selected:

$\begin{align} &\text{maximize} & \sum_{g \in G} n_g u_g \\ &\text{subject to} & u_{g_i} + u_{g_j} &\le 1 && \text{for $(g_i, g_j) \in P$} \\ && u_g &\in \{0,1\} && \text{for $g \in G$}\end{align}$

The objective (1) is to maximize the total number of people. The conflict constraint (2) prevents selecting both groups in a conflicting pair. Constraint (3) forces $u_g$ to be binary.

The major challenge with this formulation is that there are too many conflicting pairs. We can strengthen the formulation by finding maximal cliques. This is where the network solver comes into play. Using the network solver, we can generate the set of maximal cliques $C$ such that at most one group in each clique can be selected. For $c\in C$ , let $G_c$ be the groups that appear in clique $c$ . Then we can call the mixed integer linear optimization solver to choose which groups to sell, in order to maximize the objective:

$\begin{align} &\text{maximize} &\sum_{g \in G} n_g u_g \nonumber \\ &\text{subject to} &\sum_{g \in G_c} u_g &\le 1 && \text{for $c \in C$} \\ &&u_g &\in \{0,1\} && \text{for $g \in G$} \nonumber \end{align}$

The clique constraint (4) enforces selection of at most one group per clique.

This intermediate operation to find maximal cliques reduces the optimization time roughly 60x.

Venue-wide Restrictions

We call this optimization model through the SAS Optimization Python interface, namely the runOptmodel CAS action. This action has a nice feature called "groupBy" that can split tables into groups at run time. By using it, we are able to run scenarios for all 82 sections of NCSU's Carter-Finley Stadium in parallel.

NCSU came up with a challenging question: What if there is a lower limit on the number of people for a certain group type? Consider that the forecasted demand for groups of 2 is 500. So the solution should have at least 500 groups of 2 to meet this demand. For this purpose, we split the problem into two parts. In the first part, we find all cliques in each section separately and in parallel. Then we combine all these cliques into a single optimization problem and use the DECOMP algorithm which uses decomposition to exploit the decomposed structure of this combined instance.

You can find a simplified version of our optimization model in our GitHub repository.

Performance

As I mentioned, Carter-Finley has 268K group combinations. In our test server with 96 CPU cores, the problem is solved between 15 seconds and 7 minutes, depending on what kinds of constraints are active. Our original model was taking 30 minutes to solve before we used the maximal cliques, where most of time was being spent on writing the problem.

To solve a seating layout problem of this size, you need to use optimization or accept the fact that your solution is suboptimal at best. It is why having an optimization method is critical for venues.

SAS Venue Optimization

Once we were convinced that we can solve these problems efficiently, we started offering SAS Venue Optimization in a few ways. It can be deployed to the customer's existing Viya services, offered as SaaS, or offered as RaaS. I am really happy with two aspects of how this project turned out:

SAS Viya and SAS Optimization provide a great Python interface to develop custom apps, and we were able to connect the pieces quite easily
How everyone in our sports analytics channel worked on full cylinders in a cross-divisional team effort to come up with an offering that can help sports clubs and venues

We turned this idea into a product with impressive speed. If you would like to learn more, see our post on LinkedIn.

Flow

SAS Optimization is the biggest piece in this puzzle and does the heavy lifting to solve massive problem instances. The remainder of the tool is written in Python for the most part: initial data processing input, web server, parameter selection, optimization calls, and export utility are all in Python. We used Flask and Gunicorn to create a web application that is responsive and can work with multiple clients. The web application benefits from several JavaScript technologies, including Vue.js and Leaflet. We also used EventSource technology to give a visual feedback to the user about the progress of the optimization.

The application depends on a SAS Viya installation and can be deployed very quickly using a Docker container. Usually after spending a few days with customers to standardize their input data, we can deploy SAS Venue Optimization within 30 minutes.

Seating App

If you are planning to attend SAS Forum UK & Ireland, SAS Forum Nordics or Beyond Tomorrow by SAS CEMEA virtual events this November, there is a surprise for you!

We prepared a smaller version of the venue optimization for an interactive corner. Here, you can try to beat SAS Venue Optimization to find (or get closer to) the optimal solution.

The app tells you if you are satisfying the social distance rule and how close you are to the optimal result when you submit your solution. You can see a sneak peek below:

If you are attending any of these events, let me know about your experience! This was the first time I have built something like this, so I am especially curious. You can reach me here or on Twitter @sertalpbilal. I am looking forward to see how close you can get to the optimal!

As I have mentioned, a ready-to-use and simplified version of SAS Venue Optimization is in our GitHub repository.

References

[1] https://www.mckinsey.com/featured-insights/coronavirus-leading-through-the-crisis/charting-the-path-to-the-next-normal/covid-19-recovery-in-hardest-hit-sectors-could-take-more-than-5-years
[2] https://www.wsj.com/articles/the-soccer-match-that-kicked-off-italys-coronavirus-disaster-11585752012
[3] https://www.dw.com/en/coronavirus-sports-cancellations/a-52569936

The post Why venue optimization is critical and how it works appeared first on Operations Research with SAS.

Back to School Optimization

Subbu Pazhani — Mon, 26 Oct 2020 15:12:22 +0000

Public and private schools are struggling to figure out how to bring face-to-face instruction to students during this pandemic. Health risks to students and teachers, parents struggling with child-care options and/or support for virtual learning, and schools’ capacities and budget limitations make this problem a severe logistical challenge. Schools need to abide by CDC social distancing regulations as well as by each individual state’s guidance to minimize exposure and risk. In North Carolina, three different plans were considered by the state government:

Plan A: All students back to school with minimal social distancing
Plan B: Schools must limit the number of students and staff to ensure six feet of separation when stationary
Plan C: Remote-only learning

In July 2020, Governor Cooper announced guidance for all schools to operate under Plans B or C. On September 17th, he issued a statement allowing elementary schools to open under Plan A, reiterating that Plan A might not be right for all schools and each district is in control of their reopening plan.

Most school districts are considering Plan B, which presents specific challenges to school administrators. One of the challenges is to find the best possible schedule for groups of students assigned to classrooms and matching teachers to those groups such that state and federal mandates are respected. Given the number of possible combinations of schedules, this is not a trivial problem to solve. Fortunately, operations research is the best analytical tool to support decision makers with data insights and scenario recommendations for this case.

In this article, we consider Plan B and propose a mixed integer linear programming (MILP) model that maximizes the number of student hours of in-class instruction, given restricted room capacities due to social distancing requirements.

Problem Description

In this section, we describe the structure of the problem. Let $S = \{1,2,\dots,\text{n}_{S}\}$ be the set of schools in a school district. The set of schools includes elementary, middle, and high school grade levels in that school district. We assume that the students are not transferrable between schools and the schools do not share room spaces. Thus, the optimization model can be defined independently of the school index and is executed in parallel for the schools in $S$ . Let $G = \{1,2,\dots,\text{n}_{G}\}$ be the set of grades, and $R = \{1,2,\dots,\text{n}_{R}\}$ be the set of rooms in a school $s \in S$ . Due to the social distancing requirement, the rooms have a maximum occupancy restriction that is lower than its original capacity. Let $\text{capacity}_{r}$ be the restricted capacity of room $r \in R$ , and let $\text{population}_{g}$ be the number of students enrolled in grade $g \in G$ .

The scheduling time horizon is divided into time blocks $B = \{1,2,\dots,\text{n}_{B}\}$ . We consider three different time horizon scenarios (all under Plan B): monthly rotating, weekly rotating, and daily rotating. Figure 1 shows the different time horizon scenarios. Under the monthly rotation scenario, students attend full day school during certain weeks of a month and do remote learning for the rest of the time. For example (as shown in Figure 1 below), 1^st grade might be scheduled to come on the third and fourth week of each month. Under the weekly rotation scenario, students attend full day school during certain days of a week, every week. For example, 3^rd grade might be scheduled to come to school on Tuesdays and Thursdays. Under the daily rotation scenario, students attend school daily, but for a predefined block of time. For example, 5^th grade might be scheduled to attend school between 8 and 10am each day.

Figure 1: Time horizon scenarios

The set $B$ of time blocks for monthly, weekly, and daily plans are defined as {Week1, Week2, Week3, Week4}, {Monday, Tuesday, Wednesday, Thursday, Friday}, and {8am, 9am, 10am, 11am, 12pm, 1pm}, respectively. Let $\text{duration}_{b}$ be the duration of each time block $b \in B$ . In this study, we set $\text{duration}_{b}$ to be 1. It is defined as a parameter to give flexibility for the user to change if needed.

The proposed model is developed to create a schedule by assigning students in grades to time blocks and rooms. The following assumptions are considered in the development of the proposed model:

Students in a grade cannot be split between time blocks. If a grade $g$ is assigned to a time block $b$ , then all the students in that grade $g \ \text{(population}_{g}\text{)}$ should be accommodated in time block $b$ .
Each grade in a school $s$ should attend an equal number of time blocks. Say, in an elementary school, if 1^st grade attends two time blocks, then all other grades (K, 2^nd-5^th grades) should attend two time blocks.
In the daily rotation scenario, the assignment of time blocks for grades should be consecutive as transportation of students multiple times a day is not feasible.
Under the daily rotation scenario, a common break or transition time is added to the schedule to enable transportation, cleaning/sanitation of classrooms, other logistics, etc. Let $\text{transitionwindow}$ be the transition time, defined in number of time blocks, in the daily rotation.
A fraction of students, denoted by $\text{virtualpercent}$ , attend full online / remote learning sessions. This fraction of students is not included in the scheduling.

Input Data

This section describes the input data tables required for the optimization model. The input data include three tables: the schools table, the rooms table, and the time blocks table. The schools table contains the population (enrollment) data for each grade at the school. The rooms table contains data on the rooms available in the school along with its restricted capacity (capacity column) and room size (room_size_sqft column).

Figures 2 and 3 shows sample school and room data for an elementary school (School1). We can use the restricted capacity from the capacity column or we can calculate it using the room size and square feet per student guideline. For example, suppose you want to analyze a scenario with 60 square feet per student. In the sample data (Figure 3), room R1 is 923 square feet and can accommodate $\lfloor 923 / 60\rfloor = 15$ students.

Figure 2: Schools table

Figure 3: Rooms table

The data in the time blocks table depend on the selection of the time horizon scenario. Figure 4 shows a time blocks table for the monthly rotation scenario. It has four time blocks: {Week_1, Week_2, Week_3, Week_4}.

Figure 4: Time blocks table

Figures 5 (a) and (b) show the time blocks tables for the weekly and daily rotation scenarios, respectively.

Figure 5: Weekly and Daily rotation scenarios

Model Formulation

Considering the student enrollment, and a finite number of rooms with restricted capacity, the problem can be formulated as a mixed integer linear programming (MILP) model, where the objective is to maximize the in-person instructional hours for the students. The decision variables for the model are shown below:

Decision Variables

$\text{NumStudents}_{g, r, b} \ge 0$ is the number of students in grade $g$ , room $r$ , and time block $b$
$\text{AssignGrRmBl}_{g,r,b} \in \{0,1\}$ is a binary variable indicating if room $r$ is in use at time block $b$ for grade $g$
$\text{AssignGrBl}_{g,b} \in \{0,1\}$ is a binary variable indicating if grade $g$ is assigned to time block $b$
$\text{AssignGrRm}_{g,r} \in \{0,1\}$ is a binary variable indicating if grade $g$ is assigned to room $r$
$\text{AvgNumStudents} \ge 0$ is the average number of student hours per time block in any grade
$\text{ConsecutiveGrBl}_{g,b} \in \{0,1\}$ is a binary variable indicating if block $b$ is the starting time block for grade $g$
Note that $\text{ConsecutiveGrBl}_{g,b}$ is used only in the daily rotation scenario where assignment of time blocks for grades must be consecutive.

Objective Function

The objective function $\text{(M1)}$ is to maximize the instructional hours for the students and can be written as:

$\text{(M1) maximize TotalStudentHours}=\sum\limits_{g,r,b} \text{duration}_b \ \text{NumStudents}_{g,r,b}$

Constraints

Constraint (1) is the room capacity constraint. This constraint ensures that the number of students in grade $g$ assigned to room $r$ in time block $b$ should not exceed the capacity of the room $r$ . Recall that the capacity of each room has already been modified to allow for social distancing restrictions. When $\text{AssignGrRmBl}_{g,r,b}$ is zero, the right-hand side of the constraint becomes zero and forces $\text{NumStudents}_{g, r, b}$ to be zero.

$\text{NumStudents}_{g,r,b} \leq \text{capacity}_r \ \text{AssignGrRmBl}_{g,r,b} \quad \forall~g,r,b$ $\text{(1)}$

The room block assignment constraint (2) ensures that a room $r$ in time block $b$ should be assigned to at most one grade.

$\sum_{g} \text{AssignGrRmBl}_{g,r,b} \leq 1 \quad \forall~r,b$ $\text{(2)}$

Constraint (3) ensures that the number of students assigned in a grade $g$ and time block $b$ across all rooms should be equal to the enrollment $(\text{population}_{g})$ in that grade attending in-class instruction.

$\sum_r \text{NumStudents}_{g,r,b} = (1 - \text{virtualpercent}) \ \times \ \text{population}_{g} \ \times \ \text{AssignGrBl}_{g,b} \quad \forall~g,b$ $\text{(3)}$

Constraint (4) ensures that the average number of student hours per block in a grade $g$ is the same across all grades.

$\sum_{r,b} \frac{\text{NumStudents}_{g,r,b}}{\text{n}_{b} \ \times \ \text{population}_{g} \ \times \ (1-\text{virtualpercent})} = \text{AvgNumStudents} \quad \forall~g$ $\text{(4)}$

Constraints (5) and (6) compute grade-block assignment and grade-room assignment variables from the grade, room, and block assignment variable.

$\text{AssignGrRmBl}_{g,r,b} \leq \text{AssignGrBl}_{g,b} \quad \forall~g,r,b$ $\text{(5)}$

$\text{AssignGrRmBl}_{g,r,b} \leq \text{AssignGrRm}_{g,r} \quad \forall~g,r,b$ $\text{(6)}$

Constraints specific to the daily rotation scenario

When students are scheduled to attend part-day classes, the constraints (7) through (11) ensure that the students in a grade $g$ are assigned to consecutive time blocks.

$\text{ConsecutiveGrBl}_{g,1} = \text{AssignGrBl}_{g,1} \quad \forall~g$ $\text{(7)}$

$\text{ConsecutiveGrBl}_{g,b} \geq \text{AssignGrBl}_{g,b} - \text{AssignGrBl}_{g,b-1} \quad \forall~g, b \in \{2,\dots,\text{n}_{B}\}$ $\text{(8)}$

$\sum_{b} \text{ConsecutiveGrBl}_{g,b} \leq 1 \quad \forall~g$ $\text{(9)}$

$\text{ConsecutiveGrBl}_{g,b} \leq \text{AssignGrBl}_{g,b} \quad \forall~g,b \in \{2,\dots,\text{n}_{B}\}$ $\text{(10)}$

$\text{ConsecutiveGrBl}_{g,b} \leq 1 - \text{AssignGrBl}_{g,b-1} \quad \forall~g,b \in \{2,\dots,\text{n}_{B}\}$ $\text{(11)}$

Constraint (12) ensures that there is a break for $\text{transitionwindow}$ time blocks for cleaning and/or transportation.

$\text{AssignGrBl}_{g1, b-\text{transitionwindow}} \leq 1- \text{ConsecutiveGrBl}_{g,b} \quad \forall~g,g1 \in G, \text{and } b \in \{1+\text{transitionwindow},\dots,\text{n}_{B}\}$ $\text{(12)}$

We will now discuss an optional constraint that tightens constraint (12) and improves computational time of the model. Let $Q = \{1,2,\dots,\text{n}_{G}\}$ be the reordered set of grades, where $\text{population}_{1} \leq \text{population}_{2} \le \dots \leq \text{population}_{n_{G}}$ . Let $\text{cum_population}_{q}= \sum_{i=1}^{q} \text{population}_{i}$ be the cumulative sum of enrollments for the smallest $q$ grades. Let $\text{totcapacity}$ be the total capacity of the school across all rooms. Let $\text{MaxGradesBlock} = \text{max} \{q \in \text{Q: } \text{cum_population}_{q} \leq \text{totcapacity} \}$ be the maximum number of grades in a time block. Constraint (13) ensures that no grades are assigned to the break period.

$\sum_{g1 \in G} \text{AssignGrBl}_{g1, b-\text{transitionwindow}} \leq \text{MaxGradesBlock} \ \times \ (1- \text{ConsecutiveGrBl}_{g,b}) \quad \forall~g,b \in \{1+\text{transitionwindow},\dots,\text{n}_{B}\}$ $\text{(13)}$

Constraints (14) and (15) ensure that the students assigned to room $r$ in grade $g$ do not change rooms. There is no reason for a grade to switch rooms during the day, because room capacities per time block and number of students per grade are constant. These constraints force grades to stay in the same rooms once they have been assigned. Note that the constraint (14), which ensures students to stay in the same room, is valid only if the $\text{transitionwindow}$ is at least one period. If the $\text{transitionwindow}$ is zero, it might be optimal to move students to a different room in a time block when a new grade is starting.

$\text{AssignGrRmBl}_{g,r,b} + 1 \geq \text{AssignGrBl}_{g,b} + \text{AssignGrRm}_{g,r} \quad \forall~g,r,b$ $\text{(14)}$

$\text{AssignGrRmBl}_{g,r,b} \leq \text{NumStudents}_{g,r,b} \quad \forall~g,r,b$ $\text{(15)}$

Special case

When solving for the daily rotation scenario and $\text{transitionwindow}$ = 0, the assigned time blocks for two grades can overlap. For example, in the optimal solution 1^st grade could attend school between 8am - 11am and 2^nd grade could attend between 10am - 1pm. It might be beneficial for 1^st grade students to change rooms at 10am in order to accommodate the 2^nd grade. Therefore, when solving the daily rotation scenario and $\text{transitionwindow}$ = 0, we drop constraints (12) and (13) because they are not relevant, and we drop constraint (14) because we want to allow students to change rooms. We also add a secondary objective $\text{(M2)}$ to minimize the number of room changes across all grades.

$\text{(M2) minimize RoomChanges}=\sum\limits_{g,r} \text{AssignGrRm}_{g,r}$

We solve a bi-criteria problem using a sequential approach – solve first for (M1), and then solve for (M2) with a constraint on (M1). Let $\text{primary_objective_value}$ be the objective value obtained when solving for (M1). Constraint (16) is added when we solve for minimizing room changes (M2):

$\sum\limits_{g,r,b} \text{duration}_{b} \ \text{NumStudents}_{g,r,b} \geq \text{primary_objective_value}$ $\text{(16)}$

Solving the Model

The model is coded and solved using the runOptmodel CAS action in SAS Optimization. We start by defining the index sets, parameters, and read data into the index sets and parameters by using the READ DATA statements. The decision variables are then declared by using the index sets.

proc cas;
   loadactionset 'optimization';
   run;
   source pgm;
 
   /*************************************************/
   /* Define sets                                   */
   /*************************************************/
   set  ROOMS;
   set  GRADES;
   set  BLOCKS;
 
   /*************************************************/
   /* Define inputs                                 */
   /*************************************************/
   str block_id {BLOCKS};
   num duration {BLOCKS};
   num capacity {ROOMS};
   num population {GRADES};
 
   /*************************************************/
   /* Read data                                     */
   /*************************************************/
   read data &caslib..input_room_data into ROOMS=[roomID] capacity;
   read data &caslib..input_school_data into GRADES=[grade] population; 
   read data &caslib..input_block_data into BLOCKS=[block] block_id duration;
 
   /*************************************************/
   /* Decision Variables                            */
   /*************************************************/
   var NumStudents {GRADES, ROOMS, BLOCKS} >= 0;
   var AssignGrRmBl {GRADES, ROOMS, BLOCKS} binary;
   var AssignGrBl {GRADES, BLOCKS} binary;
   var AssignGrRm {GRADES, ROOMS} binary;
   var AvgNumStudents >= 0;
   var ConsecutiveGrBl {GRADES, BLOCKS} binary;

The parameters needed to compute maxGradesinBlock are calculated in a DATA step prior to calling PROC CAS. As discussed, the maxGradesinBlock parameter is used in constraint (13) to improve the computation time.

   /* Number of grades in a block */ 
   num totcapacity = sum {r in ROOMS} capacity[r];
   num cum_population{GRADES};
   num grade_count{GRADES};
 
   read data &caslib..output_grades_in_block into [grade] cum_population grade_count;
   num maxGradesinBlock = max {q in GRADES: cum_population[q] <= totcapacity} grade_count[q];

The objective of the problem is to maximize the instructional hours for the students and is written as follows:

   /*************************************************/
   /* Objective Functions                           */
   /*************************************************/
   /* Objective function returns the total student hours (weighted by duration of the block) in the classroom */
   max TotalStudentsHours = sum {g in GRADES, r in ROOMS, b in BLOCKS} duration[b] * NumStudents[g, r, b];

The RoomChanges objective function is used only under the special case and is written as follows:

   /* Adding a small penalty for choosing different rooms */
   min RoomChanges = sum {g in GRADES, r in ROOMS} AssignGrRm[g, r];

Next, we show the sets of constraints in the model. The constraints are shown in the order discussed in the model formulation section. As you can observe, both $\text{virtualpercent}$ and $\text{transitionwindow}$ parameters are defined as macro variables ( $\text{virtual_percent}$ , $\text{transition_window}$ ) and can be modified by the user.

   /*************************************************/
   /* Constraints                                   */
   /*************************************************/
   /********************* Room Capacity related constraints *********************/  
   /* Students assigned to a grade,block,room should not exceed the capacity the room. */
   con GradeRoomBlockAssignment {g in GRADES, r in ROOMS, b in BLOCKS}:
      NumStudents[g, r, b] <= capacity[r] * AssignGrRmBl[g, r, b];
 
   /* Room/block assigned to a grade should be less than or equal to 1. - A room can be assigned to a max. of one grade in a time block */
   con RoomBlockAssignment {r in ROOMS, b in BLOCKS}:
      sum {g in GRADES} AssignGrRmBl[g, r, b] <= 1;
 
   /********************* Number of students constraints *********************/  
   /* Number of students in a grade, block across all rooms should be equal to the population in that grade */
   con GradePop {g in GRADES, b in BLOCKS}:
      sum {r in ROOMS} NumStudents[g, r, b] = (1 - (&virtual_percent. / 100)) * population[g] * AssignGrBl[g, b];
 
   /********************* Student Equality constraints *********************/  
   /* Computing average number students attending in a grade and Ensuring that the AvgNumStu is same across all grades*/
   con GradeAvg {g in GRADES}:  
      sum {r in ROOMS, b in BLOCKS} NumStudents[g, r, b] / (card(BLOCKS)* population[g] * (1 - (&virtual_percent. / 100))) = AvgNumStudents;
 
   /********************* Deriving Grade-Block and Grade-Room variables constraints **********************/  
   /* Deducing grade, block assignment from grade,block,room assignment. */
   con GradeBlockAssignment {g in GRADES, r in ROOMS, b in BLOCKS}:
      AssignGrRmBl[g, r, b] <= AssignGrBl[g, b];
 
   /* Deducing grade, room assignment from grade,block,room assignment. */
   con GradeRoomAssignment {g in GRADES, r in ROOMS, b in BLOCKS}:
      AssignGrRmBl[g, r, b] <= AssignGrRm[g, r];
 
   /********************* Constraints specific to Plan 4 *********************/
   /* Constraint that ensures a grade is assigned only to continuous blocks */
   con ConsFirstBlock {g in GRADES}:
      ConsecutiveGrBl[g, 1] = AssignGrBl[g, 1];
 
   con ConsPattern {g in GRADES, b in 2..card(BLOCKS)}:
      ConsecutiveGrBl[g, b] >= AssignGrBl[g, b] - AssignGrBl[g, b-1];
 
   con ConsPatternRes {g in GRADES}:
      sum {b in BLOCKS} ConsecutiveGrBl[g, b] <= 1;
 
   /* If grade g is assigned to block b-1 or if grade g is not assigned to block b, then force ConsecutiveGrBl[g, b] to be 0 */
   con ConsAddCuts {g in GRADES, b in 2..card(BLOCKS)}:
      ConsecutiveGrBl[g, b] <= AssignGrBl[g, b];
 
   con ConsAddCuts1 {g in GRADES, b in 2..card(BLOCKS)}:
      ConsecutiveGrBl[g, b] <= 1 - AssignGrBl[g, b-1];
 
   /* Breaks in between for cleaning */
   con Breakconstraint {g in GRADES, g1 in GRADES, b in (1+&transition_window.) ..card(BLOCKS)}:
      AssignGrBl[g1, (b-&transition_window.)] <= 1 - ConsecutiveGrBl[g, b];
 
   con Breakconstraint1 {g in GRADES, b in (1+&transition_window.) ..card(BLOCKS)}:
      sum {g1 in GRADES} AssignGrBl[g1, (b-&transition_window.)] <= maxGradesinBlock * (1 - ConsecutiveGrBl[g, b]);
 
   /* Constraint that ensures students assigned to a room r in a grade g does not change rooms */
   con NoRoomChanges {g in GRADES, r in ROOMS, b in BLOCKS}:
      AssignGrRmBl[g, r, b] + 1 >= AssignGrBl[g, b] + AssignGrRm[g, r];
 
   con NoRoomChanges2 {g in GRADES, r in ROOMS, b in BLOCKS}:
      AssignGrRmBl[g, r, b] <= NumStudents[g, r, b];
 
   /* Constrain each of the objective functions to prevent worse solutions */
   num primary_objective_value init 0;
 
   con PrimaryObjConstraint:
      sum {g in GRADES, r in ROOMS, b in BLOCKS} duration[b] * NumStudents[g, r, b] >= primary_objective_value;

Next, we discuss the solve statements. We have two solve cases: one for the monthly and weekly rotation scenarios and the other for the daily rotation scenario. The plan_num macro variable is used to define the rotation scenario: plan_num=2 is the monthly rotation scenario, plan_num=3 is the weekly rotation scenario, and plan_num=4 is the daily rotation scenario. For the monthly and weekly rotation scenarios, we drop the constraints that are specific to the daily rotation scenario.
In the special case of daily rotation scenario with $\text{transitionwindow}$ =0, we drop the constraints named Breakconstraint, Breakconstraint1, and NoRoomChanges. We also solve for objective $\text{M2}$ ( $\text{minimize RoomChanges}$ ), given the optimal value of objective $\text{M1}$ ( $\text{maximize TotalStudentHours}$ ) is preserved.

We are using the MILP solver to solve the problem, allowing a relative objective gap of 1%.

   /*************************************************/
   /* Solve                                         */
   /*************************************************/
   if &plan_num. = 2 or &plan_num. = 3 then do;
      drop Breakconstraint;
      drop Breakconstraint1;
      drop ConsFirstBlock;
      drop ConsPattern;
      drop ConsPatternRes;
      drop ConsAddCuts;
      drop ConsAddCuts1;
      drop NoRoomChanges;
      drop PrimaryObjConstraint;
 
      solve obj TotalStudentsHours with milp / loglevel=3 relobjgap=0.01;  
   end;
 
   if &plan_num. = 4 then do;
      drop PrimaryObjConstraint;
 
      if &transition_window. = 0 then do;
         drop Breakconstraint; 
         drop Breakconstraint1;
         drop NoRoomChanges;
      end;
 
      solve obj TotalStudentsHours with milp / loglevel=3 relobjgap=0.01;
 
      /* Cleaning step - before primalin */
      for {g in GRADES, r in ROOMS, b in BLOCKS} AssignGrRmBl[g, r, b] = round(AssignGrRmBl[g, r, b]);
      for {g in GRADES, b in BLOCKS} AssignGrBl[g, b] = round(AssignGrBl[g, b]);
      for {g in GRADES, r in ROOMS} AssignGrRm[g, r] = round(AssignGrRm[g, r]);
      for {g in GRADES, b in BLOCKS} ConsecutiveGrBl[g, b] = round(ConsecutiveGrBl[g, b]);
 
      if &transition_window. = 0 then do;
         /* Solve for secondary objective only if primary objective solve was successful */
         if _NSOL_ > 0 then
            do;
            primary_objective_value = TotalStudentsHours.sol;
            restore PrimaryObjConstraint;
            solve obj RoomChanges with milp / primalin loglevel=3 relobjgap=0.01;
         end;
      end;
   end;

After we solve the problem, we use a CREATE DATA statement to create the output table with all the required data for the output visualization. As mentioned previously, the optimization model can be run independent of the school index and can be executed in parallel for the schools in $S$ (i.e., for the entire school district). We use the groupBy option in the runOptmodel statement to execute the model for multiple schools in parallel. Note that if the input data have only school, the runOptmodel statement is still relevant and is executed for that one school. The nGroupByTasks = ‘ALL’ option in the runOptmodel statement enables the model to use all the threads on each worker node when optimizing the BY groups in parallel.

   /*************************************************/
   /* Create output data                            */
   /*************************************************/
   num total_capacity = 6 * sum {r in ROOMS} capacity[r];
   num total_population = sum {g in GRADES} population[g];
   num StudentHoursDay = (TotalStudentsHours.sol * 6) / card(BLOCKS);
   num AvgHoursPerStuWeek = (StudentHoursDay*5) / total_population;
   num totHoursStuWeek = (StudentHoursDay*5);
   num num_blocks_scen = card(BLOCKS);
 
   num num_students {g in GRADES, r in ROOMS, b in BLOCKS} = round(NumStudents[g,r,b].sol);
   num grade_room_block_assignment {g in GRADES, r in ROOMS, b in BLOCKS} = round(AssignGrRmBl[g,r,b].sol);
   num AssignGrBlSol {g in GRADES, b in BLOCKS} = round(AssignGrBl[g,b].sol);
 
   create data &caslib..output_full_assignment from [grade roomID block] = {g in GRADES, r in ROOMS, b in BLOCKS} 
      num_students 
      grade_room_block_assignment
      AssignGrBl = AssignGrBlSol[g,b]
      block_id[b] 
      duration[b] 
      population[g]
      capacity[r]
      total_capacity
      StudentHoursDay
      AvgHoursPerStuWeek
      num_blocks_scen
      total_population
      totHoursStuWeek;
 
   endsource;
   runOptmodel / code=pgm groupBy='School_Name' nGroupByTasks='ALL';
   run;
quit;

Figure 6 shows the optimization log from executing the model for an elementary school under the weekly rotation scenario. The data for this problem are available in GitHub. First, the optimization log shows the details of this problem instance such as the numbers of variables and constraints, before and after the presolver step.. Next, you can see the iteration log with information about each iteration until convergence criteria are reached (we specified a relative objective gap of 1%). For this problem instance, the solver was able to find an optimal solution of 2091 student hours in 4 seconds of run time.

Figure 6: Optimization log

Visualizing the Output

To begin analyzing solutions and options for a full school district, we wanted to compare the relevant KPIs such as average weekly in-person instructional hours per student, average room utilization, and teacher workload. We ran the optimization engine described above for the three rotation scenarios (monthly, weekly, and daily) and for different capacity settings (measured as minimum square footage per student). Figure 7 displays the results of these optimization runs. We used SAS Visual Analytics to build user-friendly visual representations. Note: These results have been fully anonymized and do not represent any specific school or school district.

Figure 7: Minimum square footage per student vs. student hours per week

In most capacity settings, the weekly rotation outperforms the other scenarios, meaning we are able to deliver more in-person instructional hours while satisfying capacity constraints and better utilizing the limited room availability. Figure 8 shows the scenario for a weekly rotation and 60 square feet per student. Most schools can only accommodate an average of 12 weekly in-person instructional hours while a few schools can accommodate as high as 30 hours. The number of required teachers to generate this schedule has a wider distribution across all schools, varying from 17 to 40. The numbers of hours a teacher works per week varies between 21 and 33.

Figure 8: KPIs for weekly rotation scenario and 60 square feet per student

Once the school administrators choose a time-horizon plan and the social distancing guideline, we can then drill down into recommend detailed student schedule and classroom assignments. Figure 9 shows the grade and time block schedule for one elementary school under the monthly rotation scenario. In this example, K and 5^th grade students attend in-person instruction in weeks 1 and 2; 1^st, 2^nd, 3^rd, and 4^th grade students come to school in weeks 3 and 4.

Figure 9: Grade, time block schedule

Figure 10 shows the grade, room, and time block assignment schedule for students. The blue dots in the chart represent the room number where the students are assigned in the corresponding time block.

Figure 10: Grade, room, time block schedule

Figure 11 shows the number of teachers required at the school by time block. In this example, the school needs 20 teachers in weeks 1 and 2, and 37 teachers in weeks 3 and 4.

Figure 11: Number of teachers by time block

Summary

This article addresses an important and relevant problem of reopening schools and safely bringing students back to in-class instruction. Given a finite number of rooms and maximum occupancy restrictions, this study proposes an optimal schedule to get students back to school in order to maximize the in-person instructional hours of students. The optimization problem is modeled as a mixed integer linear programming model and solved using the runOptmodel action in SAS Optimization. We use realistic data from a school district and execute the model for different time horizon scenarios and capacity settings. We also developed visualizations using SAS Visual Analytics to aggregate and analyze the results from the runs. For a desired scenario, SAS Visual Analytics reports can be used to drill down into detailed schedules for students, classrooms, and teachers. This model supports school administrators in making analytics-based decisions for safely bringing students back to classrooms. We are eager to partner with school organizations to share these insights and potentially aid their decision planning during these challenging times.

Acknowledgments

The author would like to thank Michelle Opp, Rob Pratt, Natalia Summerville, and Matt Fletcher for their contributions to the model formulation and the blog content.

Additional reading

A high level description of the problem can be found in the Mathematical Optimization to Support Safe Back-to-School LinkedIn article.

Code

The code and data for this problem are available in GitHub at
https://github.com/sascommunities/sas-optimization-blog/tree/master/back_to_school_optimization.

The post Back to School Optimization appeared first on Operations Research with SAS.

Workforce Scheduling at an Animal Shelter

Michelle Opp — Mon, 14 Sep 2020 13:00:07 +0000

[Nabaruna Karmakar was coauthor of this post]

A study was conducted at the University of Denver on The Economic Impacts of the Austin, Texas "No Kill" Resolution. The study found great value in creating an animal welfare-focused community. It highlighted the benefits of economic growth due to an increased need in veterinary care, and the potential for attracting a like-minded, pet-loving workforce for companies located in Austin. The study also spoke about the costs associated with developing such a community – part of which included expenses related to staffing the animal shelters.

Like many non-profits, a shelter’s necessity to keep staffing costs low while scheduling enough employees to cover the demand during peak hours can be a struggle. In many organizations, managers often rely on personal judgment to create staff schedules, leading to higher labor costs in the case of overstaffing, or unmet needs in the case of understaffing.

In this article, we present a mixed integer linear programming (MILP) model to help an animal shelter create an optimal workforce schedule that meets the staffing demands at minimum cost. With minor tweaks, the model could easily be adapted to many other non-profit or for-profit settings, such as scheduling bartenders, servers, and hosts in a restaurant, or scheduling nurses, nursing assistants, and administrative staff at a COVID-19 testing site.

Problem Description

We begin any optimization problem by identifying the three main components: the decisions we need to make, the goal we want to achieve by those decisions, and the constraints that prevent us from making certain decisions. We describe each of these next.

At this animal shelter, we have three types of employees: full-time employees, part-time employees, and volunteers. The animal shelter is open daily from 9 a.m. to 8 p.m. Therefore, our decisions are which employees should be scheduled during each hour of each day in a one-week time frame. Our goal (or objective) is to minimize the total labor cost, which includes regular wages and overtime wages for the full-time and part-time employees. The volunteers have no labor cost.

The constraints are what make the optimization problem interesting. Without constraints, our minimum labor cost would be zero, because we simply wouldn’t schedule any full-time or part-time employees. But we have a limited number of volunteers, and they are not qualified to perform every job at the shelter. Therefore, relying only on volunteers is not an acceptable solution because it leaves many needs unmet. In our problem, the schedule must satisfy the following constraints:

Full-time employees must be assigned either five or six shifts per week, with each shift being between eight and 10 consecutive hours.
Part-time employees must be assigned either five or six shifts per week, with each shift being between four and seven consecutive hours.
Volunteers can be scheduled for shifts of up to three consecutive hours, and they can specify the maximum number of days they are willing to volunteer in each week.
Full-time and part-time employees must receive overtime pay equal to 1.5 times their hourly wage for any hours worked beyond a specified threshold. For full-time employees, this threshold is 40 hours; for part-time employees, it is 20 hours.
Full-time and part-time employees should be given two consecutive days off unless they are assigned overtime, and volunteers should be given two consecutive days off. The employees and volunteers can specify which two days they would like to take off. If no preference is specified for some employees and volunteers, the model chooses two consecutive days off for these employees and volunteers.
The shelter has four different jobs that the employees and volunteers must perform: walking, feeding, administrative, and grooming. The schedule must satisfy the demand for each job in each hour of each day.
Employees and volunteers have different skill sets, which dictate the set of jobs each person can perform. We assume the jobs are nested, such that an employee who is qualified for one job is also qualified for all the jobs that require a lower skill level. Specifically:
- An employee or volunteer who is qualified for grooming can also perform administrative jobs, feeding, and walking.
- An employee or volunteer who is qualified for administrative jobs can also perform feeding and walking.
- An employee or volunteer who is qualified for feeding can also perform walking.
- An employee or volunteer who is qualified only for walking cannot perform any other jobs.

Input Data Overview

Before we describe the mathematical formulation for this optimization problem, let's take a look at the input data in order to better understand the decisions, objective, and constraints described in the previous section. The input data include an Employees table, a Jobs table, a Demand table, and a Demand Coefficients table.

Employees Table and Jobs Table

The Employees table contains the full employee and volunteer roster, consisting of 16 full-time employees, 15 part-time employees, and 10 volunteers. A portion of the table is shown in Figure 1. The skill column identifies the employee’s skill level; the numeric values {1, 2, 3, 4} are mapped to the jobs of {Walking, Feeding, Administrative, Grooming} in the Jobs table shown in Figure 2.

The cost column in the Employees table contains the hourly wage for each employee; note that the hourly wage for volunteers is always zero. In the off_days column, you can see that some employees and volunteers have specified which two consecutive days they would like to receive off, but other employees and volunteers have not indicated a preference and will let the scheduler decide. Finally, the max_volunteer_days column indicates the maximum number of days that a volunteer is willing to work each week. In our data, volunteers are willing to work either two or three days per week.

Figure 1: Employees Table

Figure 2: Jobs Table

Demand Table and Demand Coefficients Table

The Demand table shown in Figure 3 contains a prediction of the number of animals that will be in the animal shelter during each hour of each day of the week; col9 represents the predicted number of animals at 9 a.m., col10 represents the predicted number of animals at 10 a.m., and so on. We calculate the demand for the four jobs (Walking, Feeding, Administrative, and Grooming) as a proportion of the expected number of animals in the shelter, and the Demand Coefficients table shown in Figure 4 contains the multipliers that are used to calculate the demand for each job at each time. Note that the demand coefficients do not vary by day of the week, but they do vary by time. For example, feeding happens only between 9 a.m. and noon, and again between 5 p.m. and 8 p.m.

Figure 3: Demand Table

Figure 4: Demand Coefficients Table

The bar chart and accompanying detail table in Figure 5 show the number of employees required to cover each job throughout the week, calculated using both the predicted demand from the Demand table and the job demand coefficients from the Demand Coefficients table.

Figure 5: Demand for each job throughout the week

Model Formulation

We use the runOptmodel CAS action in SAS Optimization to model and solve this optimization problem, so we begin our code with a call to PROC CAS followed by a SOURCE statement that enables us to assign the OPTMODEL statements that follow to a variable called pgm. Then we declare the index sets and parameters that we use to store the data for the problem, and we use READ DATA statements to read the four input tables and populate the index sets and parameters.

proc cas; 
source pgm;
 
   num first_hour = 9;
   num last_hour = 19;
   set  EMPLOYEES;
   set  DAYS;
   set  JOBS;
   set TIMES={first_hour..last_hour};
 
   num demand{DAYS, TIMES};
   num demand_coef{JOBS, TIMES};
   num req_skill{JOBS};
   num base_cost init 1;
   num over_cost init 1.5;
 
   num skill{EMPLOYEES};
   num cost{EMPLOYEES};
   num min_shift{EMPLOYEES};
   num max_shift{EMPLOYEES};
   num min_hour{EMPLOYEES};
   num max_hour{EMPLOYEES};
   num max_volunteer_days{EMPLOYEES};
   str type{EMPLOYEES};
   str off_days{EMPLOYEES};
 
   read data casuser.ASO_employees 
      into EMPLOYEES=[id] skill type cost off_days max_volunteer_days;
 
   read data casuser.ASO_jobs 
      into JOBS=[job] req_skill;
 
   read data casuser.ASO_demand
      into DAYS=[day] {t in TIMES} [day,t]= col('col'||t)>;
 
   read data casuser.ASO_demand_coef
      into [job time] demand_coef;

Next we construct some additional derived sets that are helpful in modeling the problem. For each employee, we create the following derived sets:

QUALIFIED_JOBS: The set of jobs that the employee is qualified to perform.
EMPLOYEE_REGULAR_DAYS: The set of days that the employee has not specifically requested to take off.
EMPLOYEE_OT_DAYS: The set of days that the employee could potentially be assigned an overtime shift. For volunteers, this is an empty set because volunteers are not allowed to work overtime. For part-time and full-time employees, the set depends on whether the employee has requested specific days off. If an employee has requested specific days off, then EMPLOYEE_OT_DAYS contains the requested days off; if an employee is assigned to work on any of these days, it must be an overtime shift. If an employee has not requested specific days off, then EMPLOYEE_OT_DAYS contains the set of all days because any day is a potential overtime day if the employee exceeds the number of regular time shifts.
EMPLOYEE_DAYS: The set of days that an employee is eligible to work. This is simply the union of EMPLOYEE_REGULAR_DAYS and EMPLOYEE_OT_DAYS. For part-time and full-time employees, this is the same as the set DAYS, but for volunteers it excludes any days that the volunteer has specifically requested to take off.

   set QUALIFIED_JOBS{i in EMPLOYEES} = {j in JOBS: skill[i] >= req_skill[j]};
 
   set EMPLOYEE_REGULAR_DAYS{i in EMPLOYEES} = DAYS diff {scan(off_days[i],1,'-')} diff {scan(off_days[i],2,'-')};
 
   set EMPLOYEE_OT_DAYS{i in EMPLOYEES} = if type[i]='Volunteer' then {} 
                                          else (if missing(off_days[i]) then DAYS 
                                                else DAYS diff EMPLOYEE_REGULAR_DAYS[i]);
 
   set EMPLOYEE_DAYS{i in EMPLOYEES} = EMPLOYEE_REGULAR_DAYS[i] union EMPLOYEE_OT_DAYS[i];

The next section of the code populates some parameters with the hard-coded values we are using for the minimum and maximum number of shifts and length of shifts for each type of employee. We could have included additional columns in the Employees table and added these columns to the READ DATA statement to populate the parameters, and the employees could have had different values for the minimum and maximum number of shifts and number of hours per shift. However, since we are assuming constant values for all employees within each employee type, we are populating the values directly within the OPTMODEL code. The COALESCE function means that if a Volunteer has a missing value for max_volunteer_days, we are assigning a default value of 3.

We also create a parameter called next_day that stores the string corresponding to the next day of the week for each day. We use this parameter to ensure that employees get two consecutive days off even when they haven't requested specific days off.

   for {i in EMPLOYEES} do;
      if type[i] = 'Full-Time' then do;
         min_shift[i] = 5;
         max_shift[i] = 6;
         min_hour[i] = 8;
         max_hour[i] = 10;
      end;
      else if type[i] = 'Part-Time' then do;
         min_shift[i] = 5;
         max_shift[i] = 6;
         min_hour[i] = 4;
         max_hour[i] = 7;
      end;
      else if type[i] = 'Volunteer' then do;
         min_shift[i] = 0;
         max_shift[i] = coalesce(max_volunteer_days[i],3);
         min_hour[i] = 0;
         max_hour[i] = 3;
      end;
   end;
 
   str next_day{DAYS};
   for {d in DAYS} do;
      if      d = 'Mon' then next_day[d] = 'Tue';
      else if d = 'Tue' then next_day[d] = 'Wed';
      else if d = 'Wed' then next_day[d] = 'Thu';
      else if d = 'Thu' then next_day[d] = 'Fri';
      else if d = 'Fri' then next_day[d] = 'Sat';
      else if d = 'Sat' then next_day[d] = 'Sun';
      else if d = 'Sun' then next_day[d] = 'Mon';
   end;

The main decision variables in this problem are Assign_To_Job[i,j,t,d], and they are binary variables. Assign_To_Job[i,j,t,d] has a value of 1 if employee i is assigned to job j at time t on day d, and it has a value of 0 otherwise. If we know the values of these decision variables, we have everything we need to construct the employee schedule and determine the associated cost. However, the model needs some additional auxiliary variables to force relationships in the constraints; for example, to make sure that each employee satisfies the minimum and maximum number of shifts.

The additional variables are also binary. For example, Assign_To_Regular_Day[i,d] has a value of 1 if employee i is assigned to work a regular shift on day d, and it has a value of 0 otherwise. Likewise, Assign_To_Overtime_Day[i,d] has a value of 1 if employee i is assigned to work an overtime shift on day d, and it has a value of 0 otherwise.

Note that the variables Is_Working_Hour and Is_Working_Day could have been modeled as implicit variables by using the IMPVAR statement instead of the VAR statement. Is_Working_Hour[i,t,d] is simply the sum of Assign_To_Job[i,j,t,d] over all qualified jobs j, and Is_Working_Day[i,d] is simply the sum of Assign_To_Regular_Day[i,d] and Assign_To_Overtime_Day[i,d]. We tested both formulations, and for this particular problem, modeling them as explicit variables resulted in a slight improvement in the run time. In general, the explicit variable formulation can reduce the number of constraint coefficients when the variable appears in several places, but it will not always result in faster performance. You should experiment with both formulations in your models if run-time performance is a primary concern.

We use the variables called Two_Days_Off_Start_Day to indicate the first day off in a consecutive two-day period. For example, if Two_Days_Off_Start_Day[i,'Mon'] = 1, it means that employee i is not scheduled on Monday or Tuesday as a regular day. These variables are defined only for employees who did not request specific days off and whose maximum number of shifts per week is greater than three. If an employee works three or fewer days in the week, we know that the employee must already receive at least two consecutive days off. In our problem instance, this is the case only for volunteers.

The Switch variables are needed to make sure that employees work consecutive hours during a shift. For example, if a full-time employee is scheduled for eight hours on Monday, we would not want those hours to be 8 a.m. to 9 a.m., then 11 a.m. to 1 p.m., then 2 p.m. to 4 p.m., then 5 p.m. to 8 p.m. The Switch variables will be explained in more detail when we get to the corresponding constraints.

Finally, we can fix the values of some variables for decisions that we already know must be true. Full-time and part-time employees must work a minimum of five days. If an employee has requested two specific days off, then we know which five days must be worked as regular shifts before any overtime shifts can be assigned to the employee on one of the requested days off. In general, if the number of regular days available for an employee is equal to the minimum number of days the employee must work, then we know the employee must be assigned to work on each available regular day. Therefore, we can fix both Assign_To_Regular_Day[i,d] and Is_Working_Day[i,d] to 1 for these days.

   /******************************** Variables *******************************/
 
   var Assign_To_Job {i in EMPLOYEES, QUALIFIED_JOBS[i], TIMES, EMPLOYEE_DAYS[i]} binary;
 
   var Assign_To_Regular_Day {i in EMPLOYEES, EMPLOYEE_REGULAR_DAYS[i]} binary;
   var Assign_To_Overtime_Day {i in EMPLOYEES, EMPLOYEE_OT_DAYS[i]} binary;
 
   var Is_Working_Hour {i in EMPLOYEES, TIMES, EMPLOYEE_DAYS[i]} binary;
   var Is_Working_Day {i in EMPLOYEES, EMPLOYEE_DAYS[i]} binary;
 
   var Two_Days_Off_Start_Day {i in EMPLOYEES, DAYS: missing(off_days[i]) and max_shift[i] > 3} binary;
 
   var Switch {i in EMPLOYEES, TIMES, EMPLOYEE_DAYS[i]} binary;
 
   for {i in EMPLOYEES: card(EMPLOYEE_REGULAR_DAYS[i])=min_shift[i]} do;
      for {d in EMPLOYEE_REGULAR_DAYS[i]} do;
         fix Assign_To_Regular_Day[i,d] = 1;
         fix Is_Working_Day[i,d] = 1;
      end;
   end;

The objective is to minimize the total cost, which consists of a base cost and an overtime cost. Notice that we are calculating the overtime cost as only the additional premium beyond an employee's regular hourly wage. For example, suppose a full-time employee with an hourly wage of $20 works 48 hours in the week. We are considering the base cost to be $20 * 48, and the overtime cost to be $10 * 8. In some situations, you might want to instead consider the base cost to be $20 * 40 and the overtime cost to be $30 * 8. Since we are minimizing the total cost, both interpretations of overtime cost are equivalent in terms of the objective, and we have chosen the former interpretation because it simplifies the model. If you want to report base and overtime costs differently in the solution, a simple postprocessing step after the SOLVE statement could be used to calculate different costs per employee.

   /******************************** Objective *******************************/
 
   min BaseCost
      = base_cost * sum{i in EMPLOYEES, t in TIMES, d in EMPLOYEE_DAYS[i]} cost[i] * Is_Working_Hour[i,t,d];
 
   min OverTimeCost
      = (over_cost - base_cost) * sum {i in EMPLOYEES} 
                                       cost[i] * (sum {t in TIMES, d in EMPLOYEE_DAYS[i]} Is_Working_Hour[i,t,d] 
                                                  - min_shift[i] * min_hour[i]);
 
   min TotalCost
      = BaseCost + OverTimeCost;

Before we move on to the constraints, we first need to say a word about DECOMP. Most of our constraints are applied individually to each employee, and the only constraints that span multiple employees are the demand satisfaction constraints. Therefore, this problem's structure is a natural fit for using a decomposition algorithm. When we define the constraints, we use a constraint suffix to identify each constraint with a particular decomposition block. First, however, we need to define the blocks. We have chosen to define the blocks according to each employee/day pair. We tested blocks at the employee level and at employee/day level and we saw slightly better performance from the employee/day level. In your applications, you might want to experiment with different block assignments to find the one that works best. If you don't know what your blocks should be, you can still use the DECOMP algorithm, which first attempts to identify suitable blocks. But because we already know something about the structure of the constraints, it is helpful to identify them to the DECOMP algorithm. Here we are simply creating a unique numeric index called block_id[i,d] for each [i,d] pair.

   /****************************** Blocks Setup ******************************/
   num block_id{EMPLOYEES, DAYS};
   num id init 0;
   for {i in EMPLOYEES, d in DAYS} do;
      block_id[i,d] = id;
      id = id + 1;
   end;

Next we describe the constraints. The first few constraints are fairly intuitive. Is_Working_Hour and Is_Working_Day are the variables that we chose to model as explicit variables rather than implicit variables for a slight boost in performance, and the corresponding constraints simply define the relationship between these variables and the other binary variables.

The Satisfy_Demand constraint ensures that we schedule enough employees to cover the requirements for each job during each time period on each day. Assign_Day_If_Assign_Job is a linking constraint that ensures that if an employee is working during any time period of a day, the employee is considered to be working on that day.

Note the suffixes=(block=block_id[i,d]) option at the end of each of these constraints except the Satisfy_Demand constraint. This is how we identify a constraint with a particular block to be used in the decomposition algorithm—we assign to the block keyword the unique numeric index block_id[i,d] that we created in a previous step. We cannot include this suffix on the Satisfy_Demand constraint because it spans multiple employees, but the other constraints are applied individually to each employee/day pair and therefore can be assigned to the DECOMP blocks.

   /******************************* Constraints ******************************/
 
   con Define_Is_Working_Hour {i in EMPLOYEES, t in TIMES, d in EMPLOYEE_DAYS[i]}:
      Is_Working_Hour[i,t,d] = sum{j in QUALIFIED_JOBS[i]} Assign_To_Job[i,j,t,d]
         suffixes=(block=block_id[i,d]);
 
   con Define_Is_Working_Day {i in EMPLOYEES, d in EMPLOYEE_DAYS[i]}:
      Is_Working_Day[i,d] = (if d in EMPLOYEE_REGULAR_DAYS[i] then Assign_To_Regular_Day[i,d]) 
                          + (if d in EMPLOYEE_OT_DAYS[i] then Assign_To_Overtime_Day[i,d])
         suffixes=(block=block_id[i,d]);
 
   con Satisfy_Demand {j in JOBS, t in TIMES, d in DAYS}:
      sum {i in EMPLOYEES: j in QUALIFIED_JOBS[i] and d in EMPLOYEE_DAYS[i]} Assign_To_Job[i,j,t,d] 
      >= demand_coef[j,t] * demand[d,t];
 
   con Assign_Day_If_Assign_Job {i in EMPLOYEES, j in QUALIFIED_JOBS[i], t in TIMES, d in EMPLOYEE_DAYS[i]}:
      Is_Working_Day[i,d] >= Assign_To_Job[i,j,t,d]
         suffixes=(block=block_id[i,d]);

The next constraints enforce the minimum and maximum number of shifts (that is, days) per employee, and the minimum and maximum number of hours per shift. Note that Min_Num_Shifts uses the Assign_To_Regular_Day variables, while Max_Num_Shifts uses Is_Working_Day. This is because we want to include the possibility of overtime shifts in the maximum, but we only want to assign an overtime day if the employee has already been assigned the minimum number of shifts for regular time. Also note that we cannot use the suffixes=(block=block_id[i,d]) option on the Min_Num_Shifts and Max_Num_Shifts constraints. Even though these constraints are applied individually for each employee, they span multiple days, and we have defined our decomposition blocks according to employee/day pairs.

   con Min_Num_Shifts {i in EMPLOYEES: min_shift[i] > 0}:
      sum {d in EMPLOYEE_REGULAR_DAYS[i]} Assign_To_Regular_Day[i,d] >= min_shift[i];
 
   con Max_Num_Shifts {i in EMPLOYEES}:
      sum{d in EMPLOYEE_DAYS[i]} Is_Working_Day[i,d] <= max_shift[i]; 
 
   con Min_Hours_Per_Shift {i in EMPLOYEES, d in EMPLOYEE_DAYS[i]: min_hour[i] > 0}:
      sum {t in TIMES} Is_Working_Hour[i,t,d] >= min_hour[i] * Is_Working_Day[i,d]
         suffixes=(block=block_id[i,d]);
 
   con Max_Hours_Per_Shift {i in EMPLOYEES, d in EMPLOYEE_DAYS[i]}:
      sum {t in TIMES} Is_Working_Hour[i,t,d] <= max_hour[i]
         suffixes=(block=block_id[i,d]);

The next three constraints are a bit more complicated than the previous constraints. We need to force the hours that an employee works in a day to be consecutive, and for this we use the Switch variables. We are defining Switch[i,t,d] to have a value of 1 if employee i is not working in time t and begins working in time t+1 on day d, and 0 otherwise. We need to create constraints that force Switch to follow this definition, and then we can force Switch[i,t,d] to take a value of 1 for at most one time period t. In other words, we have at most one "switch" from non-working to working on a given day. The comments in the following section of code describe the conditions that we are forcing with each constraint; you can substitute different combinations of values of 0 and 1 for the other variables to understand how Switch[i,t,d] must behave in each case.

   /* If employee is not working in time t but working in time t+1, force Switch[i,t,d] to be 1. */
   con Force_Switch_to_1 {i in EMPLOYEES, t in TIMES, d in EMPLOYEE_DAYS[i]: t+1 <= last_hour}: 
      Is_Working_Hour[i,t,d] + Switch[i,t,d] >= Is_Working_Hour[i,t+1,d]
         suffixes=(block=block_id[i,d]);
 
   /* If employee is working in time t and also working in time t+1, force Switch[i,t,d] to be 0. 
      If employee is not working in time t and also not working in time t+1, force Switch[i,t,d] to be 0.
      If employee is working in time t and not working in time t+1, force Switch[i,t,d] to be 0. */
   con Force_Switch_to_0 {i in EMPLOYEES, t in TIMES, d in EMPLOYEE_DAYS[i]: t+1 <= last_hour}:
      Is_Working_Hour[i,t,d] + 2 * Switch[i,t,d] <= 1 + Is_Working_Hour[i,t+1,d]
         suffixes=(block=block_id[i,d]);
 
   /* Allow at most one switch per day. But if the employee begins work in the first hour, don't allow 
      any switches. */
   con Max_One_Switch_Per_Day {i in EMPLOYEES, d in EMPLOYEE_DAYS[i]}:
      sum{t in TIMES: t+1 <= last_hour} Switch[i,t,d] <= 1 - Is_Working_Hour[i,first_hour,d] 
         suffixes=(block=block_id[i,d]);

Finally, we need to add constraints to make sure that the employees who did request specific days off do get two consecutive days off unless they are assigned overtime. Recall that the binary variable called Two_Days_Off_Start_Day[i,d] has a value of 1 if day d is the first of two consecutive days off for employee d, and 0 otherwise. The first constraint below forces Two_Days_Off_Start_Day to have a value of 0 if either of the two days is assigned as a regular day, and the second constraint forces us to choose at least one day d with Two_Days_Off_Start_day[i,d] = 1. We only need to apply these constraints to the set of employees who did not request specific days off and who can work more than three days per week. If an employee can only work three or fewer days per week, then we know the employee is already receiving two consecutive days off.

   con Two_Days_Off_Zero_if_Working {i in EMPLOYEES, d in DAYS: missing(off_days[i]) and max_shift[i] > 3}:
      Assign_To_Regular_Day[i,d] + Assign_To_Regular_Day[i,next_day[d]] + 2*Two_Days_Off_Start_Day[i,d] <= 2; 
 
   con Force_Two_Days_Off {i in EMPLOYEES: missing(off_days[i]) and max_shift[i] > 3}:
      sum{d in DAYS} Two_Days_Off_Start_Day[i,d] >= 1;

Solving the Problem

After doing the work of modeling this optimization problem, solving it is the easy part! We are using the MILP solver to minimize the TotalCost objective. We have specified a maximum time of 300 seconds. Although this problem solves very quickly in under 30 seconds, it is a good practice to include a time bound in case we decide to modify the problem formulation in the future to accommodate additional constraints or to try different solver options. We are also specifying a relative objective gap of 0.1%.

We are using the DECOMP algorithm, and we specify HYBRID=FALSE so the root node processing is handled entirely by the decomposition algorithm, instead of first processing the root node using standard MILP techniques. Recall that we added the suffixes=(block=block_id[i,d]) option to the decomposable constraints in order to identify the block for each constraint. When blocks are defined, the default DECOMP option is METHOD=USER, so we do not need to specify it as an option in the SOLVE statement.

   /********************************** Solve *********************************/
 
   solve obj TotalCost with milp / maxtime=300 relobjgap=0.001 decomp=(hybrid=false);

After we solve the problem, we calculate the weekly base cost and overtime cost for each employee, and we use CREATE DATA statements to create output tables containing the job assignments and the employee costs.

   /****************************** Create Output *****************************/
 
   num regCost{EMPLOYEES}, OTCost{EMPLOYEES} init 0;
   for {i in EMPLOYEES} do;
      regCost[i] = base_cost * cost[i] * sum {t in TIMES, d in EMPLOYEE_DAYS[i]} 
                                               round(Is_Working_Hour[i,t,d].sol,1);
      OTcost[i] = (over_cost - base_cost) * cost[i] 
                   * (sum {t in TIMES, d in EMPLOYEE_DAYS[i]} round(Is_Working_Hour[i,t,d].sol,1) 
                      - min_shift[i] * min_hour[i]);
   end;
 
   create data casuser.job_assignments
      from [employee job time day] = {i in EMPLOYEES, j in QUALIFIED_JOBS[i], t in TIMES, d in EMPLOYEE_DAYS[i]:
                                      Assign_To_Job[i,j,t,d].sol > 0.9};
 
   create data casuser.employee_costs
      from [employee] = {i in EMPLOYEES}
         base_cost = regCost[i]
         overtime_cost = OTcost[i];

As the last step in our code, we need to use an ENDSOURCE statement to terminate the block of code that is stored in the pgm variable. Then we use the ACTION statement to call the runOptmodel CAS action from the optimization action set, and we use code=pgm to pass the OPTMODEL statements to the runOptmodel action.

endsource;
 
    action optimization.runOptmodel / code=pgm printlevel=2; run; 
 
quit;

Optimization Log and Output Tables

The optimization log is shown below. It first describes the problem structure, including the numbers of variables and constraints both before and after the presolver step. Next you can see information about the DECOMP algorithm, such as the number of blocks (275) and the decomposition subproblem coverage (approximately 97% of variables and 95% of constraints). Finally, the iteration log shows information about each iteration until we reach the convergence criteria. The optimal total cost for our problem is $22,010.

NOTE: Active Session now MYCASSESSION.
NOTE: There were 41 rows read from table 'ASO_EMPLOYEES' in caslib 'CASUSER'.
NOTE: There were 4 rows read from table 'ASO_JOBS' in caslib 'CASUSER'.
NOTE: There were 7 rows read from table 'ASO_DEMAND' in caslib 'CASUSER'.
NOTE: There were 44 rows read from table 'ASO_DEMAND_COEF' in caslib 'CASUSER'.
NOTE: Problem generation will use 16 threads.
NOTE: The problem has 14957 variables (0 free, 210 fixed).
NOTE: The problem uses 2 implicit variables.
NOTE: The problem has 14957 binary and 0 integer variables.
NOTE: The problem has 18244 linear constraints (3411 LE, 3300 EQ, 11533 GE, 0 range).
NOTE: The problem has 62397 linear constraint coefficients.
NOTE: The problem has 0 nonlinear constraints (0 LE, 0 EQ, 0 GE, 0 range).
NOTE: The remaining solution time after problem generation and solver initialization is 299.65 seconds.
NOTE: The initial MILP heuristics are applied.
NOTE: The MILP presolver value AUTOMATIC is applied.
NOTE: The MILP presolver removed 5994 variables and 11044 constraints.
NOTE: The MILP presolver removed 31843 constraint coefficients.
NOTE: The MILP presolver added 2827 constraint coefficients.
NOTE: The MILP presolver modified 723 constraint coefficients.
NOTE: The presolved problem has 8963 variables, 7200 constraints, and 30554 constraint coefficients.
NOTE: The MILP solver is called.
NOTE: The Decomposition algorithm is used.
NOTE: The Decomposition algorithm is executing in single-machine mode.
NOTE: The DECOMP method value USER is applied.
NOTE: The problem has a decomposable structure with 275 blocks. The largest block covers 0.5556% of the constraints in the problem.
NOTE: The decomposition subproblems cover 8718 (97.27%) variables and 6817 (94.68%) constraints.
NOTE: The deterministic parallel mode is enabled.
NOTE: The Decomposition algorithm is using up to 16 threads.
      Iter         Best       Master         Best       LP       IP  CPU Real
                  Bound    Objective      Integer      Gap      Gap Time Time
NOTE: Starting phase 1.
         1       0.0000     589.0000            . 5.89e+02        .    4    3
        10       0.0000      14.5333            . 1.45e+01        .    5    4
        17       0.0000       0.0000            .    0.01%        .    6    4
        18       0.0000       0.0000            .    0.00%        .    6    4
NOTE: Starting phase 2.
        19   14510.0000   30203.5804            .  108.16%        .    7    5
        20   14510.0000   26027.3000   24617.0000   79.37%   69.66%   11    8
        26   14510.0000   23147.0000   23546.0000   59.52%   62.27%   14   10
        28   15701.9000   22998.5000   23457.5000   46.47%   49.39%   14   10
        30   15701.9000   22900.2500   23214.5000   45.84%   47.85%   14   11
        33   15701.9000   22697.5000   23133.5000   44.55%   47.33%   15   11
        35   15701.9000   22593.5000   23091.5000   43.89%   47.06%   15   11
        38   15701.9000   22541.5000   23046.5000   43.56%   46.78%   16   11
        40   15701.9000   22457.0000   22479.5000   43.02%   43.16%   18   14
        43   15701.9000   22422.5000   22472.0000   42.80%   43.12%   19   14
        46   21627.5000   22422.5000   22442.0000    3.68%    3.77%   19   14
        50   21627.5000   22415.0000   22359.5000    3.64%    3.38%   19   15
        52   21627.5000   22229.5000   22287.5000    2.78%    3.05%   20   15
        54   21627.5000   22182.5000   22251.5000    2.57%    2.89%   20   15
        59   21627.5000   22010.0000   22118.0000    1.77%    2.27%   20   15
        60   22010.0000   22010.0000   22118.0000    0.00%    0.49%   20   15
         .   22010.0000   22010.0000   22118.0000    0.00%    0.49%   21   16
NOTE: Starting branch and bound.
         Node  Active   Sols         Best         Best      Gap    CPU   Real
                                  Integer        Bound            Time   Time
            0       1     26   22118.0000   22010.0000    0.49%     21     16
            7       9     27   22091.0000   22010.0000    0.37%     44     27
            8       0     29   22010.0000   22010.0000    0.00%     44     27
NOTE: The Decomposition algorithm used 16 threads.
NOTE: The Decomposition algorithm time is 27.87 seconds.
NOTE: Optimal.
NOTE: Objective = 22010.
NOTE: The output table 'JOB_ASSIGNMENTS' in caslib 'CASUSER' has 1043 rows and 4 columns.
NOTE: The output table 'EMPLOYEE_COSTS' in caslib 'CASUSER' has 41 rows and 3 columns.
{,,,,,,,status=OK,algorithm=DECOMP,solutionStatus=OPTIMAL,objective=22010,numIterations=223,
presolveTime=1.9837138653,solutionTime=27.870479107,numSolutions=29,primalInf=4.440892E-16,
boundInf=4.440892E-16,bestBound=22010,numNodes=9,relObjGap=0,absObjGap=0,integerInf=4.440892E-16}

Figure 6 shows a portion of the output table called JOB_ASSIGNMENTS, and Figure 7 shows a portion of the output table called EMPLOYEE_COSTS. In these tables, we have everything we need to create an employee schedule for the week and to calculate the resulting wage costs. However, it is often useful to analyze the solution visually, which we do next.

Figure 6: JOB_ASSIGNMENTS output table

Figure 7: EMPLOYEE_COSTS output table

Visualizing the Output

Now let's take a look at some helpful reports to visualize the output from the optimization model. The runOptmodel action produced the tabular output tables that were shown in Figure 6 and Figure 7. We used SAS Visual Analytics to analyze these tables with user-friendly visual representations.

Employee Schedule

Figure 8 shows a heat map displaying the assignment of the full-time employees on each day of the week marked by the green boxes. We can see that all full-time employees except Morgan Sound have been given two consecutive days off. The assigned days off are a mix of those provided as input by the user for the full-time employees that requested specific days off, or a decision by the model for the full-time employees who did not request specific days off. Morgan Sound requested Wednesday and Thursday off, but the optimization model has assigned him to work overtime on Thursday, so he receives only one day off.

Figure 8: Days worked for each full-time employee

Figure 9 shows a similar heat map for the part-time employees. We can see that five part-time employees work an additional overtime day and receive only one day off, while the remaining part-time employees receive two consecutive days off. Again, the assigned days off are a mix of those provided as input by the user for the part-time employees that requested specific days off, or a decision by the model for the part-time employees who did not request specific days off.

Figure 9: Days worked for each part-time employee

The weekly schedule of the volunteers is shown in Figure 10. The number of shifts assigned varies according to the max_volunteer_days input provided for each volunteer.

Figure 10: Days worked for each volunteer

Overtime Analysis

The bar chart in Figure 11 shows the total cost for each full-time employee. The green portion of the bars represents regular wages, while the red portion represents overtime pay. We can see that Morgan Sound is receiving overtime pay for the extra shift that was assigned to him during the week. None of the other full-time employees receive overtime pay, which means that in addition to not working extra shifts, they are not assigned to work any additional hours for their regular shifts.

Figure 11: Full-time employee costs

Figure 12 shows a similar chart for the part-time employees. The five part-time employees who were assigned an additional shift are all receiving overtime pay. Similar to the full-time employees, none of the other part-time employees receive overtime pay, which means that in addition to not working extra shifts, they are not assigned to work any additional hours for their regular shifts.

Figure 12: Part-time employee costs

Summary

In this article, we showed a useful application of using the runOptmodel action in SAS Optimization to determine an optimal employee schedule at an animal shelter, and we used SAS Visual Analytics to create user-friendly reports to analyze the optimal solution. OPTMODEL separates the model from the data, and this enables the same model to be used for different problem instances simply by changing the input data tables. For example, if an employee decides to switch from full-time to part-time, or if some employees want to change their days off, or if any employees receive a pay raise, the model can easily handle the changes in the input data and provide a new optimal schedule.

Acknowledgments

The authors would like to thank Rob Pratt and Hossein Tohidi for their contributions to the model formulation.

Code

The code and data for this problem are available in GitHub at
https://github.com/sascommunities/sas-optimization-blog/tree/master/animal_shelter_workforce_optimization.

The post Workforce Scheduling at an Animal Shelter appeared first on Operations Research with SAS.

Improving Hidden Markov Models with Black-Box Optimization

Suyun Liu — Mon, 10 Aug 2020 14:20:49 +0000

Hidden Markov Models

Introduction

Statistical models of hidden Markov modeling (HMM) have become increasingly popular in the last several years. The models are very rich in mathematical structures and can form the theoretical basis of many real applications. In the classical continuous/discrete Markov process, each state corresponds to an observed (physical) event. The hidden Markov modeling focuses on the case where the observation is a probabilistic function of the hidden state. In other words, the hidden Markov model is a doubly embedded stochastic process with an underlying stochastic process hidden (unobservable) and another observable stochastic process producing the sequence of observations.

Elements of an HMM

An HMM is characterized by the following five key elements:

$K$ , the number of hidden states. Denote the set of states by $\{1, \ldots, K\}$ .
Initial state probability vector (ISPV), $\pi = \{\pi_i, i = 1, ...., K\}$ .
Transition Probability Matrix (TPM), $A =\{a_{ij}\}$ where $a_{ij} = \mathbb{P}(S_t = j| S_{t - 1} = i)$ .
$M$ , the number of distinct observation symbols per state.
$B = \{b_j(k)\}$ , The probability distribution of observation symbols for each state $j$ .

The set of model parameters is denoted by $\lambda = (A, B, \pi)$ .

The HMM Procedure/Action set

Given the sequence of observations $O$ , a basic problem of interest to be addressed for the HMM in real-world applications is: How do we adjust model parameters $(A, B, \pi)$ to maximize the probability of observing such a sequence of observations, that is, $\max~\mathbb{P}(O | \lambda)$ ? The HMM procedure in SAS Econometrics 8.5 (its corresponding CASL version is the HMM action) was developed to solve the problem, that is, to adjust model parameters to maximize the probability of the observation sequence given a model.

Regime-Switching Autoregression Model

We attempt to fit the observation sequence to the Regime-Switching model when the parameters of the data generating process (DGP) vary over a set of different unobserved states. The Regime-Switching Autoregression (RS-AR) model allows states to switch according to a Markov process and is often applied to lower frequency data (quarterly, yearly, and so on). A typical application of such a model is stock returns. Generally, the RS-AR model can be formulated as follows:

$y_t = \mu_{S_t} + x_t\alpha + z_t\beta_{S_t} + \displaystyle\sum_{i = 1}^P \phi_{i, S_t} (y_{t - i} - \mu_{S_{t - i}} - x_{t - i}\alpha + z_{t - i}\beta_{S_{t - i}}) + \epsilon_{S_t}$

where

$y_t$ : Dependent variables.
$\mu_{S_t}$ : State-dependent intercept/mean.
$z_t$ : Vector of exogenous variables with state-dependent coefficients $\beta_{S_t}$ .
$\phi_{i, S_t}$ : $i$ -th AR term in state $S_t$ .
$\epsilon_{S_t}$ : i.i.d. normal distribution $N(0, \sigma^2_{S_t})$ .
$P$ : order of regression.

Two key methods for parameter estimation include the maximum a posteriori method (MAP) and the maximum likelihood method (ML). Three optimization algorithms supported by the HMM action are the active-set algorithm, the interior-point algorithm, and the stochastic gradient descent algorithm. As the HMM solver iteratively updates and improves model parameters, the choice of initial parameters is crucial in searching for a good local optimizer.

An example: Discovering the Hidden Market States

Each RS-AR model is characterized by the number of states $k$ and the order of regression $p$ . Given a fixed pair $(k, p)$ , we can use the HMM action set to estimate the model parameters. The sequence of observations used in the numerical experiments is called vwmi and is extracted from the CRSP data. In the example, we tested 48 models with $k = 2, \ldots, 9$ and $p = 0,\ldots,5$ . In order to select a good initial point for the iterative procedure, the multistart mode is enabled. The following CASL script shows how to call the HMM solver and get estimation results for all 48 AR models.

 
%macro estimateRSAR(myds, inEstDs, kStart, kEnd, pStart, pEnd, method, maxiter, qMultiStart);
   proc cas;
      hiddenMarkovModel.hmm result=r/
         data = {caslib='casuser', name="&myds."},
         id={time='date'},
         outstat={name="&myds.&method.Stat_k&kStart.To&kEnd._p&pStart.To&pEnd.", caslib="casuser", replace=true},
         model={depvars={'returnw'}, method="&method.", nState=&kStart., nStateTo=&kEnd., ylag=&pStart., yLagTo=&pEnd., type = 'AR'},
         optimize = {algorithm='interiorpoint', printLevel=3, printIterFreq=1, maxiter=&maxiter., Multistart = &qMultiStart.},
         score = {outmodel={name = "&myds.&method.Model_k&kStart.To&kEnd._p&pStart.To&pEnd.", caslib="casuser", replace=true}},
         learn = {out={name = "&myds.&method.Learn_k&kStart.To&kEnd._p&pStart.To&pEnd.", caslib="casuser", replace=true} 
                  %if %length(&inEstDs.)>0 %then %do; , in={name = "&inEstDs.", caslib="casuser"} %end;},
         labelSwitch={sort="NONE"},
         display = {names = {"Optimization.Algorithm", "ModelInfo", "ParameterMatrices.TPM", "Optimization.FinalParameterEstimates", "Optimization.IterHistory", 
                             "FitStatistics", "Optimization.InitialObjectiveFunction", "Optimization.FinalObjectiveFunction"}};
        print r;
   run; quit;
   CAS mysess listhistory;
%mend;

/*use MAP and enable multistart for p = 0 models to obtain initial parameter estimations*/
%estimateRSAR(myds=vwmi, inEstDs=, kStart=2, kEnd=9, pStart=0, pEnd=0, method=MAP, maxiter=128, qMultiStart=1);

/*use ML and disable multistart for p = 0 to 5 models to get final paramter estimations*/
%estimateRSAR(myds=vwmi, inEstDs=vwmiMAPLEARN_K2TO9_P0TO0, kStart=2, kEnd=9, pStart=0, pEnd=5, method=ML, maxiter=128, qMultiStart=0);

For AR models with $p = 0$ , the common practice is first to take MAP as the estimation method and enable multistart mode to get an initial parameter estimate. Then we take the output of the MAP method to hot-start the ML method, using output parameter estimates as the initial values and calling the HMM procedure the second time using the ML method. For the other models of $p \ge 1$ , the HMM solver automatically takes outputs from the corresponding AR ( $p - 1$ ) model as initial values.

To compare the quality of different statistical models, we could measure the Akaike information criterion (AIC) value, which is basically an estimator of out-of-sample prediction error. Therefore, the minimal AIC value in the following table identifies the best model ( $k = 7, p = 2$ ) among the 48 AR models. Note that the multistart mode of the HMM action takes around 17 hours to get the full AIC table as it currently supports only single-machine mode. Refer to Example 13.1 in the HMM documentation for a more detailed interpretation.

Black-Box optimization solver

The black-box optimization solver has rich applications in hyperparameter tuning whose objective functions are usually nonsmooth, discontinuous, and unpredictably varying in computational expense. The hyperparameters we want to tune might be mixtures of continuous, categorical, and integer parameters. The solveBlackbox action was developed based on an automated parallel derivative-free optimization framework called Autotune. Autotune combines a number of specialized sampling and search methods that are very efficient in tuning complex models like in machine learning. The Autotune framework is illustrated in Fig.2. Autotune has the ability to simultaneously apply multiple instances of global and local search algorithms in parallel, among which a global algorithm is first applied to determine a good starting point to initialize a local algorithm. An extended suite of search methods is driven by the Hybrid Solver Manager that controls concurrent execution of different search methods. One can easily add new search methods to the framework. Overall, search methods can learn from each other, discover new opportunities, and increase the robustness of the system.

The solveBlackbox action based on the Autotune framework takes the Genetic Algorithm (GA) as the global solver and Generating Set Search (GSS) to perform local search in a neighborhood of selected members from the current GA population. It is notable that the solveBlackbox action can handle more than one objective and support both linear and nonlinear constraints. Furthermore, it can work well in either single-machine mode or distributed mode.

Due to the existence of many local optima, the choice of starting points (called initial values) in HMM has a significant impact on the quality of the solution. The usage of multistart mode helps identify a better initial point and hence better local optimum at the expense of time. The black-box optimization solver is a good alternative to the multistart approach due to the following key properties:

The HMM objective function is a black-box function of the initial values/parameters. Most importantly, we do not need to have an explicit mathematical form of such an objective function. It might be smooth or nonsmooth, continuous or discontinuous, very expensive in function value evaluations, and so on.
The hyperparameters considered in the black-box optimization solver could be continuous (for example, the initial values), integer (for example, the number of states and the order of regression), as well as categorical (for example, type of methods and algorithms).
The black-box optimization solver supports parallel function evaluations automatically, which parallelizes multiple HMM procedures starting from different initial points.

Numerical Experiments

We then take initial values as the tuning hyperparameters. Specifically, the set of hyperparameters to be tuned includes three sets of variables: 1) Transition probability matrix of hidden states, where each row sums up to one and all elements should be nonnegative; 2) State-dependent intercepts, which can be any real numbers; 3) Covariance matrix or variance, which should be positive semidefinite or nonnegative.

Three configurations in solveBlackbox are tested for tuning hyperparameters for two data sets (vwmi and jpusstock).

Blackbox-MAP: Tuning models with $p = 0$ using the MAP estimation method.
Blackbox-ML: Tuning models with $p = 0$ using the ML estimation method.
Blackbox-ML-All: Tuning each model (any $k$ , $p$ ) using the ML estimation method.

The following example code demonstrates how we use the solveBlackbox action to tune HMM initial values.

%macro TuneHmm(ns, nvars, pStart, maxgeneration, method, maxiter);
proc cas noqueue;
   /*use an Initial function to load data to servers*/
   source casInit;
      loadTable / caslib='casuser', path='vwmi.sashdat', casout='vwmi';
   endsource;

   /*specify blackbox objective function*/
   source caslEval;
      hiddenMarkovModel.hmm result=r /
         data = "vwmi",
         id={time='date'},
         model={depvars={'returnw'}, method="&method.", nState=&ns., ylag=&pStart., type = 'AR'},
         initial={'TPM={' || (String) x1 || ' ' || (String) x2 || ', ' || (String) x3 || ' ' || (String) x4 || '}',
                  'MU={'|| (String) x5 || ', ' || (String) x6 || '}',
                  'SIGMA={' || (String) x7 || ', ' || (String) x8 || '}'}, /*the initial value list is different for different nState*/
         optimize = {algorithm='interiorpoint', printLevel=3, printIterFreq=1, maxiter=&maxiter., Multistart = 0},
         score= {outmodel={name = "&ds.&method.Model_k&ns.p&pStart._" || (String) _bbEvalTag_ || "_&nworkers.", caslib="casuser", replace=true}},
         learn= {out={name = "&ds.&method.Learn_k&ns.p&pStart._" || (String) _bbEvalTag_ || "_&nworkers.", caslib="casuser", promote=true}},
         labelSwitch={sort="NONE"};
      f['obj1'] = r.FitStatistics[1, 2];  /*read log likelihood value from FitStatistics table*/
      send_response(f);
   endsource;

   /* Invoke the solveBlackbox action */
   optimization.solveBlackbox result=blackr/
      decVars = myvars, /*define tuning variables as the initial values of hmm*/
      obj = {{name='obj1', type='max'}},
      maxGen= &maxgeneration., /* by default 10 */
      popSize=15, /* by default 40 */
      maxTime = 3600,
      linCon={name = "lindata", caslib = "casuser"}, /*read coefficients of linear constraints from lindata*/
      func = {init=casInit, eval=caslEval},
      nParallel=30; /* specify the number of parallel sessions*/
run; quit;
%mend;

/* call the tuning process for any k, p */
%TuneHmm(ns = 2, nvars = 8, pStart = 0, maxgeneration = 5, method = MAP, maxiter = 128); /*Blackbox-MAP*/
%TuneHmm(ns = 2, nvars = 8, pStart = 0, maxgeneration = 1, method = ML, maxiter = 128); /*Blackbox-ML and Blackbox-ML-All*/

Tuning RS-AR models with $p = 0$

Instead of using multistart for $p = 0$ models, we use the solveBlackbox action to tune $p = 0$ models using MAP and ML, denoted Blackbox-MAP and Blackbox-ML, respectively. Then we run the second-stage optimization using ML as the multistart did to obtain the full AIC table.

First data set vwmi: For this vwmi data set, we considered values of $k$ ranging from 2 to 9, and values of $p$ from 0 to 5. The following histogram shows the comparison of AIC values for Multistart, Blackbox-MAP, and Blackbox-ML. The 48 models are ordered by $p$ and $k$ in ascending order. The model with $k = 2$ and $p = 0$ is indexed by 1, the model with $k = 2$ and $p = 1$ is indexed by 2, and so on.The results tell us the black-box optimization solver does help identify a better local optimum among half of the 48 models, especially when the model size is large. Compared with multistart mode, Blackbox-MAP is 76% faster and Blackbox-ML is 91% faster. Overall, Blackbox-MAP performs slightly better than Blackbox-ML in terms of solution quality, while Blackbox-ML is more computationally efficient than Blackbox-MAP.

Second data set jpusstock: This data set contains two sets of historical weekly returns and results in RS-AR models of two dependent variables. Considering AR models with $k = 2, \ldots, 10$ and $p = 0 ,\ldots,6$ , some of the resulting AIC values of Multistart, Blackbox-MAP, and Blackbox-ML are presented below. Except for two models ( $k = 9$ , $p = 3$ and $4$ ), Blackbox-MAP and Blackbox-ML lead to a solution that is at least as good as Multistart, with significant improvements in large models. Furthermore, both tuning configurations are 90% faster than the Multistart method.

Note that among 63 models, the Multistart method is able to help find a reasonable parameter estimation for 39 models, while Blackbox-MAP and Blackbox-ML work for 48 and 50 models, respectively. These results indicate that using the output for $p = 0$ from these methods does not necessarily work well in the second-stage optimization. A possible good solution for this case is explored in the next experiments, that is, tuning each model directly using ML.

Tuning RS-AR models with any $k$ , $p$

Due to the ill-posed objective functions of HMM, we observed that the two-stage strategy based on results from Multistart, Blackbox-MAP, and Blackbox-ML sometimes results in unrealistic parameter estimations, especially for $p > 0$ models. For instance, the eigenvalues of the covariance matrix in one of the hidden states are very close to zero when $k = 8$ . This motivates us to try to find a good initial point for any model directly. Therefore, we further consider tuning each model directly using ML, denoted Blackbox-ML-All. The resulting AIC values from this strategy are compared with Multistart and Blackbox-MAP for both data sets.

We can see from the above results that Blackbox-ML-All further improves the solution quality for most of the models, though the time of Blackbox-ML-All is around six times slower than Blackbox-ML. Despite Blackbox-ML-All being slower than Blackbox-ML, it provides higher quality solutions and is still 50% faster than Multistart in finding the best model and the corresponding parameter estimation. More importantly, for the second data set, Blackbox-ML-All is able to produce good final parameter estimations for all 63 models. Therefore, we believe that Blackbox-ML-All is a good alternative to Multistart mode whenever Multistart results in bad estimations of parameters.

Summary

The multistart mode supported by the HMM procedure and the HMM action can only run in a single machine, leading to high time cost. The numerical test for the data set jpusstock revealed that multistart is not reliable to find a good estimation of parameters. Given that the solveBlackbox action supports parallel function evaluations and applies to any type of black-box functions and hyperparameters, we frame the solveBlackbox action to tune the initial values for the HMM. Exploring three different black-box tuning configurations, we are able to improve the solutions of the HMM models in terms of both solution quality and computational cost. In conclusion, the black-box optimization solver is a good alternative to identify good initial parameter estimates in the HMMs.

References

[1] Patrick Koch, Oleg Golovidov, Steven Gardner, Brett Wujek, Joshua Griffin, and Yan Xu. 2018. Autotune: A Derivative-free Optimization Framework for Hyperparameter Tuning. In KDD. 443--452.

[2] S. Gardner et al., "Constrained Multi-Objective Optimization for Automated Machine Learning," 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington, DC, USA, 2019, pp. 364-373, doi: 10.1109/DSAA.2019.00051.

[3] L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", Proc. IEEE, vol. 77, pp. 257-286, Feb. 1989.

[4] Kim, C.-J., C. R. Nelson, and R. Startz. 1998. Testing for mean reversion in heteroskedastic data based on Gibbs-sampling-augmented randomization. Journal of Empirical Finance 5: 115-143.

[5] https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_hmm.pdf

The post Improving Hidden Markov Models with Black-Box Optimization appeared first on Operations Research with SAS.

SAS® Optimization Web App Using REST APIs with Embedded SAS® Visual Analytics Reports

Nabaruna Karmakar — Wed, 11 Mar 2020 14:26:59 +0000

Mathematical optimization can help business leaders make better decisions in every aspect of their business. After a model has been built, end users are usually interested in doing some sort of scenario analysis to test its robustness and visualizing key performance metrics. SAS has various products that can work with each other to provide a holistic view of the analytical problem. Using SAS Optimization, we can solve myriads of complicated real-world problems, while SAS Visual Analytics allows us to interactively visualize the optimal solution.

In this article, I will provide a demo of a web app that connects SAS Visual Analytics and other SAS® Viya services through REST APIs. The web interface provides an interactive platform to demonstrate small-scale MILP (Mixed Integer Linear Programming) models to non-technical audiences and the ability to perform some scenario analysis by changing certain parameters.The purpose of this demo tool is to enable users with little to no knowledge of SAS programming to invoke the power of SAS Optimization by simply clicking a few buttons.

Since this is a web application project, instead of describing all components of the project I will provide only the sample code of the optimization model, but you may clone my GitHub repository to recreate this project.

High-level Architecture

Server-side library - restaf-server

Everyone with moderate to good front-end coding experience knows the necessity of finding good JavaScript libraries for developing dynamic web applications. The server-side library is restaf-server, which authenticates the connection to a SAS Viya server. It serves up the entry point to the app or the first page and provides the session context to the restaf library.

Front-end library - ReactJS

The restaf library is UI framework agnostic, which means it can be seamlessly integrated into any front-end framework like Angular, React, etc. It provides a user-friendly way to make REST API calls to SAS Viya. A very simple web application showing how to run a small optimization problem and fetch a SAS Visual Analytics report to visualize output using restaf can be found here. In this blog we use the concepts from the simple example to build a much richer application. The UI framework used in this project is ReactJS due to its component-based rendering. ReactJS is easily regarded as the UI framework of choice for most front-end developers and is reportedly being used by all the big companies like Facebook, Uber, Netflix, etc. In fact, it was started by a software engineer at Facebook and has steadily gained popularity. The styling of the web page is done using MaterialUI, which includes styling of the sidebar and the fonts. Some other libraries are also used in this web application, notably, react-bootstrap-table. This library helped me in creating the editable table which ultimately lets the user change input parameters and perform scenario analysis. When you look at the video below, it may seem like a multi-page app. However, it is a misnomer. This app is in fact a single-page app and I am rendering different React components onto the same HTML element. Pretty cool, huh? I am using React Router, another JavaScript library, to enable the routing between the different components.

See below a flowchart of the underlying technologies:

Demo workflow

In this section, I will describe in words the workflow of the Facility Location demo, walk you through each of the interactive components, and demonstrate how the user can do a scenario analysis as well.

But first, check out the video on our YouTube channel for this demo!

Problem Description

Now, let me give you a quick overview of the optimization problem I am trying to demonstrate using the web app. It is a facility location problem for a car manufacturing company. The company has 25 existing facilities, each of which is eligible to produce up to 5 types of cars. The problem is to optimally assign cars to be manufactured at the eligible facilities. A variable cost is incurred per car manufactured at a facility, while a fixed cost is a one-time operating cost incurred to keep a facility open or closed. Each facility can manufacture at most one car. Capacity and demand are also provided in terms of numbers of cars.

Analyze Input Data

On mounting this component, a SAS Visual Analytics report created by the user on the server to analyze the input data is rendered onto the page. In this report, we can interactively see which facilities are eligible to produce a particular type of car. The bubbles on the map show the different locations of the existing facilities. By clicking on each of the bars, we can filter out the facilities for that type of car. The color of each bubble represents the variable cost for producing a car at that facility and the size of the bubble shows the fixed cost incurred for keeping the facility open. The goal is to select the smallest green bubble for each of the cars.

Input Optimization Parameters

In this component, the user is able to set a few parameters and run the optimization code based on these parameters. They can select the objective function, which can be to minimize fixed costs, variable costs, or total costs.They may also change the demand for each type of car and/or assign a fixed facility to manufacture it using the editable table. Hitting the 'Optimize' button, a series of promise-based API calls is submitted to the Viya server. First, the updated table from the client-side is passed to the back-end as a JSON object and a table.upload action is used to upload the table to the user's own caslib. Then a string of CASL statements are submitted using the runCASL action, which includes code for the optimization model. Finally, the results are retrieved from the Viya server using REST APIs and the user is redirected to the output report.

Optimization code

The MILP model is coded as a JavaScript string and passed later on into some restAF functions. Anyone with some OPTMODEL experience will easily be able to recognize the syntax in the following code block. It is written as a JavaScript function, which takes the objective type and scenario name as input from the form on the web page. This piece of code, along with a few data pre-processing and post-processing steps, is submitted purely as a string of CASL statements and the 'RunCASL' action is executed using REST APIs.

  function optCode(ObjType, appEnv, scenario) {
  let pgm = `
  set  PRODUCTS;
  set  FACILITIES init {};
  set  FIXED;
 
  num demand {PRODUCTS};
  num fixed_cost {FACILITIES};
  num close_cost {FACILITIES};
  num x{FACILITIES};
  num y{FACILITIES};
  num var_cost {PRODUCTS, FACILITIES};
  num viable_flg {PRODUCTS, FACILITIES};
  num Obj_Type=${ObjType};
  num capacity {FACILITIES};
  str soln_type init '${scenario}';
 
  read data demandTable into PRODUCTS=[Product_Name] demand;
  read data demandTable  (where = (facility_name ne 'None')) into FIXED=[Product_Name facility_name];
  read data siteTable into FACILITIES=[facility_name] close_cost fixed_cost x y capacity;
  read data costTable into [Product_Name facility_name] var_cost viable_flg;
 
  var Assign {PRODUCTS, FACILITIES} binary;
  var Units {PRODUCTS, FACILITIES} >=0;
  var Close {FACILITIES} binary;
 
  min VarCost
    = sum {i in PRODUCTS, j in FACILITIES} var_cost[i,j]*Units[i,j];
 
  min FixedCost
    = sum {j in FACILITIES} (fixed_cost[j]*(1-Close[j]) + close_cost[j]*Close[j]);
 
  min TotalCost
    = VarCost + FixedCost;
 
  /* Constraints */
 
  /* PRODUCTS demand needs to be satisfied */
  con Demand_Con {i in PRODUCTS}:
    sum {j in FACILITIES} Units[i,j] >= demand[i];
 
  /* each facility have capacity constraints */
  con Capacity_Con2 {j in FACILITIES}:
    sum {i in PRODUCTS} Units[i,j] <= capacity[j]*(1-Close[j]);
 
  /* if operation i assigned to site j, then facility must not be closed at j */
  con If_Used_Then_Not_Closed {j in FACILITIES}:
    sum {i in PRODUCTS} Assign[i,j] = (1-Close[j]);
 
  /*not viable assignemnts*/
  con Viable_Assignemnts {i in PRODUCTS, j in FACILITIES}:
    Assign[i,j] <= viable_flg[i,j];
 
  con Viable_Num_Assignemnts {i in PRODUCTS, j in FACILITIES}:
    Units[i,j] <= capacity[j]*Assign[i,j];
 
  /*Fixing facility - what-if analysis*/
  for{ in FIXED} fix Units[i,j]=min(demand[i], capacity[j]);
 
  /* solve the MILP */
  if Obj_Type=1 then do;
    solve obj VarCost with milp;
  end;
  else if Obj_Type=2 then do;
    solve obj FixedCost with milp;
  end;
  else if Obj_Type=3 then do;
    solve obj TotalCost with milp;
  end;
 
  /* clean up the solution */
  for {i in PRODUCTS, j in FACILITIES} Assign[i,j] = round(Assign[i,j]);
  for {j in FACILITIES} Close[j] = round(Close[j]);
 
  num siteTotalCost {j in FACILITIES}=
    sum{i in PRODUCTS} var_cost[i,j] * Units[i,j].sol + (fixed_cost[j]*(1-Close[j].sol) + close_cost[j]*Close[j].sol);
 
  num variableCosts{i in PRODUCTS, j in FACILITIES}=
    var_cost[i,j]*Units[i,j].sol;
 
  /* create output */
  create data optimal_cost from
    [product site]={i in PRODUCTS, j in FACILITIES: Assign[i,j] = 1}
    var_cost variableCosts Assign.sol Units.sol;
 
  create data optimal_facilities from
    [site]={j in FACILITIES}
    x y Close.sol siteTotalCost soln_type capacity;
  `;
  return pgm;
}
 
export default optCode;

Analyze Output Report

Finally, using React Router I am directly routing the user to the output SAS Visual Analytics report. This report helps visualize the output of the optimization model for each scenario, by showing which facilities to close and which ones to keep open. It also shows the assignment of the products to the facilities, along with the total cost and the variable cost for each scenario. The tree map on the left shows the different products, and the size of each rectangle represents the number of units of each product that is produced. It is connected to the geographic map on the right and enables the user to filter out the facilities for each product. In the next section, I will describe how we can create new scenarios and compare them side by side.

Scenario Analysis

To create a new scenario, the user needs to go back to the 'Input Optimization Parameters' component using the burger menu. A new scenario may be created and the user can either change the number of units required for a product or fix a facility for a product. For the example in the video, I increased the number of units required for Sedan from 200 to 400 and fixed its assignment to Site2. When we analyze the output solutions in the SAS Visual Analytics report, we can see that the total and variable costs for the new scenario increase. The 'Solutions Comparison' tab in the report enables the user to visualize the decisions made by each solution side-by-side. We can see that the variable cost for Sedan has increased due to the increase in number of units. The new solution also requires Site2 to be open, while the optimal solution recommended that it be closed. Similarly, the user can create many more scenarios by giving each scenario a unique name and analyze them in the output report.

What's Next?

Now, that I've shown you how easy it is to combine various SAS products using REST APIs, you can clone my repository and try building one for yourself. The code is sufficiently commented to enable users with some front-end experience to customize it for a different optimization problem. Happy Coding!

The post SAS® Optimization Web App Using REST APIs with Embedded SAS® Visual Analytics Reports appeared first on Operations Research with SAS.

Batter-pitcher matchup analysis in the 2019 Major League Baseball playoffs

Brandon Reese — Thu, 24 Oct 2019 18:48:28 +0000

After a marathon of a season, 162 games in each team's schedule to be precise, the stakes for Major League Baseball are higher in October, and postseason play is underway. Whether it's the renewal of an old rivalry, redemption for last year's runners up, or rooting for this season's breakout rookie sensation, baseball fans have plenty to cheer for each fall. In addition to being a celebrated American pastime, baseball is a sport whose structure lends itself to many flavors of statistical analysis.

This article demonstrates how you can use SAS to quantify past years' batter versus pitcher statistics. You can use these historical trends to highlight the matchups to watch out for this postseason. The article also lays the groundwork for solving a lineup optimization problem. This problem is of great interest to fantasy buffs, hobbyists, as well as actual MLB managers!

The data source

One abundant source of Major League Baseball data is the event files archived by season at Retrosheet.org. The code below assumes that the files for each year of interest YYYY have been downloaded and extracted into a subdirectory called YYYY. Each event file (one per team, per year) contains game event logs for all home games of a particular team in a particular season. Events recorded per game include information about starting rosters, substitutions, umpires, and game conditions. In addition, they keep a record of every play-by-play event in the game.

Parsing complex event files

Because of the intricacies of baseball score keeping, a detailed notation is used to describe what happened in a baseball play. For the purpose of this blog, however, you can greatly simplify the game data by simply determining whether a particular at bat ended up with a hit or with no hit. Using a regular expression, you can parse this result string to determine which at bats resulted in a hit, and which did not.

Note that many events are neither positively nor negatively considered when computing batting average -- walks and hit-by-pitch, for example -- these events are ignored for this analysis.

To do this, you can parse the raw data as follows:

As you read through the file

keep track of the current pitcher for the home and away teams (see /* 1 */ and /* 2 */ below)

do this by checking for a "start" or "sub" event where the position is listed as a pitcher (position code=1)
store the pitcher id in the appropriate variable

identify which plays result in a hit or no hit (see /* 3 */ through /* 6 */ below)

do this by parsing the result string that corresponds to each play

record the batter, pitcher, and outcome of each applicable play as an observation in the output data set (see /* 7 */ below)

%macro loadYears(YEAR_START, YEAR_END, DATASET, LEAGUE=MLB, DIR=C:/data/baseball);
/** Specify N or NL to import only National League home team data files **/
%if &LEAGUE = NL %then %let LEAGUE=N;
/** Specify A or AL to import only American League home team data files **/
%else %if &LEAGUE = AL %then %let LEAGUE=A;
%else %let LEAGUE=*;
 
data fileList;
   length filename $ 100;
   do YR=&YEAR_START to &YEAR_END;
      filename=CATS("&DIR/",YR,"/",YR,"*.EV&LEAGUE");
      output;
   end;
   drop YR;
run;
 
data &DATASET;
   length pitcher $8 batter $8 outcome 4 event $30;
   if _N_ = 1 then do;
      /** Prefixes that indicate a hit **/
      hitPrefix0 = "S\d"; /* a single */
      hitPrefix1 = "D\d"; /* a double */
      hitPrefix2 = "DGR"; /* a ground rule double */
      hitPrefix3 = "T\d"; /* a triple */
      hitPrefix4 = "HR";  /* a home run */
 
      /** Prefixes that indicate no hit **/
      noHitPrefix0 = "[1-9]"; /* an out in the field */
      noHitPrefix1 = "FC";    /* a fielder's choice */
      noHitPrefix2 = "E";     /* a fielding error */
      noHitPrefix3 = "K";     /* a strikeout */
 
      /** Build a regular expressions that check the prefixes **/
      expHit   =  "/(" || hitPrefix0 || ")|(" || hitPrefix1 || ")|(" || hitPrefix2 || ")|(" || hitPrefix3 || ")|(" || hitPrefix4 || ")/";
      expNoHit =  "/(" || noHitPrefix0 || ")|(" || noHitPrefix1 || ")|(" || noHitPrefix2 || ")|(" || noHitPrefix3 || ")/";
      retain reHit;   /* regular expression matching a hit event */
      retain reNoHit; /* regular expression matching a no-hit event */
      reHit = prxparse(expHit);
      reNoHit = prxparse(expNoHit);
 
      /** Error checking will tell if the regular expression is invalid **/
      if missing(reHit) then do;
         putlog "ERROR: Invalid expression " expHit;
         stop;
      end;  
      if missing(reNoHit) then do;
         putlog "ERROR: Invalid expression " expNoHit;
         stop;
      end;  
   end;
 
   set fileList;
   infile dummy fileVar=filename end=done DLM=',' DSD MISSOVER;
   do while(^done);
      format pitcherHome $10. pitcherAway $10. pitcher $10.;
 
      /** need the current home and away pitchers to carry over between observations **/
      retain pitcherHome pitcherAway;
      input eventType $ col1 $ col2 $ col3 $ col4 $ col5 $ col6 $;
 
      /** 
       Events of interest are given in the formats:
       start, playerID, playerName,isHomeTeam,battingOrder,positionCode
       sub, playerID, playerName,isHomeTeam,battingOrder,positionCode
       play, inningNumber, batterIsHomeTeam, batterPlayerID,ballStrikeCount,pitchResultSequence,playResult
      **/
      if eventType in ("start", "sub") and col3 = "0" and col5 = "1" then do;                                     /* 1 */
         /** pitcher entering for away team **/
         pitcherAway = col1;
      end;
      else if eventType in ("start", "sub") and col3 = "1" and col5 = "1" then do;                                /* 2 */
         /** pitcher entering for home team **/
         pitcherHome = col1;
      end;
      else if eventType="play" then do;                                                                           /* 3 */
         batter=col3;
 
         if col2="0" then pitcher=pitcherHome;                                                                    /* 4 */
         else pitcher=pitcherAway;
 
         event = col6;
         if prxmatch(reHit, event) = 1 then do                                                                    /* 5 */
            outcome = 1;
            output;
         end;
         else if prxmatch(reNoHit, event) = 1 then do;                                                            /* 6 */
            outcome = 0;
            output;
         end;
      end;
   end;
   keep batter pitcher outcome event;                                                                             /* 7 */
run;
 
%mend loadYears;

Aggregating the data

After producing a data set of hit/no-hit outcomes, you can aggregate these data to produce one observation per batter-pitcher pair. For example, you can aggregate the number of at bats "ab" and the batting average "avg". To limit the analysis to players in the 2019 playoffs, here is a roster data file. You can use this file to create a data set containing the 25-player active roster of each playoff team:

data playoffRosters;
   infile "rosters.txt";
   input playerId $ team $;
run;

Next, the following macro aggregates batter-pitcher matchup data for a particular list of MLB players.

%macro dataPrep(outcomes, bvp, playerList);
%global nBatters nPitchers;
 
/* Aggregate per batter-pitcher pair */
proc sql;
   create table &bvp.Tmp as
   select batter, pitcher, sum(outcome)/count(outcome) as avg, count(outcome) as ab
   from &outcomes
   group by batter, pitcher;
quit;
 
/* Aggregate stats per batter and per pitcher. Keep only batters and pitchers in playerList. */
proc sql;
   create table battersTmp as
   select a.batter as batter,
          SUM(a.ab) as ab,
          SUM(a.avg*a.ab)/SUM(a.ab) as avg
   from &bvp.Tmp as a
   join &playerList as b
   on a.batter EQ b.playerid
   group by batter
   order by ab desc
   ;
quit;
proc sql;
   create table pitchersTmp as
   select a.pitcher,
          SUM(a.ab) as ab,
          SUM(a.avg*a.ab)/SUM(a.ab) as avg
   from &bvp.Tmp as a
   join &playerList as b
   on a.pitcher EQ b.playerid
   group by pitcher
   order by ab desc
   ;
quit;
 
/* Join together */
proc sql;
   create table &bvp as
   select a.*,
          b.ab as batter_ab,
          b.avg as batter_avg,
          c.ab as pitcher_ab,
          c.avg as pitcher_avg
   from &bvp.Tmp a
   join battersTmp b
   on a.batter = b.batter
   join pitchersTmp c
   on a.pitcher = c.pitcher
   ;
   select count (distinct batter) into :nBatters
   from &bvp;
   select count (distinct pitcher) into :nPitchers
   from &bvp;
quit;
 
proc sql;
   drop table battersTmp;
   drop table pitchersTmp;
   drop table &bvp.Tmp;
quit;
%mend dataPrep;

Taking a quick look into the data

Now, for the analysis: you can invoke the macros

%let LEAGUE=MLB;
%loadYears(2010, 2018, outcomes, LEAGUE=&LEAGUE);
%dataPrep(outcomes, bvp, playoffRosters);

and inspect the data

proc sort data=bvp;
   by descending ab;
run;
proc print data=bvp (obs=3); run;

This indicates that the batter-pitcher pairs with the most at bats from 2010 to 2018 are Michael Brantley (branm003) against Justin Verlander (verlj001), Asdrubal Cabrera (cabra002) against Max Scherzer (schem001), and Nick Markakis (markn001) against Max Scherzer (schem001).

Next, you can use the following statements to see all the historical batting averages of the key batter-pitcher matchups. To avoid noise from tiny sample sets, you can filter out matchups with fewer than 10 at bats. In the generated heat map visualization, red corresponds to higher batting average and blue corresponds to lower batting average. Since the National League and American League teams play each other infrequently, it is helpful to show them one at a time. Here are the National League matchups:

%let LEAGUE=NL;
%loadYears(2010, 2018, outcomes, LEAGUE=&LEAGUE);
%dataPrep(outcomes, bvp, playoffRosters);
proc sgplot data=bvp(where=(ab GE 10)) noautolegend;
   title "&LEAGUE Matchup Batting Averages 2010-2018";
   heatmap y=batter x=pitcher / weight=avg
      colormodel=(blue lightBlue pink red) outline x2axis;
   xaxis display=none;
   x2axis display=ALL;
   yaxis display=ALL reverse;
run;

and the American League matchups:

%let LEAGUE=AL;
%loadYears(2010, 2018, outcomes, LEAGUE=&LEAGUE);
%dataPrep(outcomes, bvp, playoffRosters);
proc sgplot data=bvp(where=(ab GE 10)) noautolegend;
   title "&LEAGUE Matchup Batting Averages 2010-2018";
   heatmap y=batter x=pitcher / weight=avg
      colormodel=(blue lightBlue pink red) outline x2axis;
   xaxis display=none;
   x2axis display=ALL;
   yaxis display=ALL reverse;
run;

Noteworthy matchups from the heat maps include:

Anthony Rendon (renda001) has a .583 batting average in 12 at bats against Wade Miley (milew001)
Paul Goldschmidt (goldp001) has a .048 batting average in 21 at bats against Pedro Baez (baezp001)
Alex Bregman (brega001) has a .500 batting average in 10 at bats against Blake Snell (snelb001)
Yuli Gurriel (gurry001) has a .136 batting average in 10 at bats against James Paxton (paxtj001)

So, how have these actually panned out? Alex Bregman faced Blake Snell in game 2 of the ALDS. Bregman had 1 hit in 2 at bats including a home run off Snell! Goldschmidt never got to face Baez, as the Dodgers didn't make it out of the NLDS. Yuli Gurriel was hitless in 4 at bats against James Paxton in the ALCS. Anthony Rendon won't get to face Wade Miley since the Astros have not included Miley on the World Series roster.

Coming up next

Baseball managers have often criticized batter-pitcher matchup data because of their poor predictive power. This is typically due to very small sample sizes. In a future post, I will analyze this data using collaborative filtering, a machine learning technique that will help address the problem of small sample sizes. From there, I will explore ways to optimally choose a lineup of batters to face a particular pitcher. On an introductory level, this series of posts will demonstrate an end-to-end approach to lineup optimization.

The post Batter-pitcher matchup analysis in the 2019 Major League Baseball playoffs appeared first on Operations Research with SAS.

Bringing Analytics to the Soccer Transfer Season

Sertalp B. Cay — Tue, 18 Jun 2019 14:58:58 +0000

Major European football (soccer) leagues came to an end after an intensive year. Manchester City claimed the Premier League title after a high-intensity title race against Liverpool. It is unbelievable that Liverpool would have won 25 out of the last 27 PL titles with 97 points. Luckily, the Reds deserved and won the Champions League, and redeemed themselves after losing last year's finale.

Football is over on the pitch for the season, but the battle for the transfer season is beginning now. Money shapes major football leagues all over the world despite rare successes by lower-budget teams such as Leicester City in 2016. Teams across Europe are changing the outcome of their domestic leagues with massive transfer budgets. Transfer season is the time when teams are shaping their potential for the next season for sure. Sports analytics is often used to analyze teams on the pitch, but it is possible to bring it to the transfer season also. We have a chance to analyze the upcoming transfer season using mathematical optimization and the capabilities of SAS Viya.

Analytics for Transfer Season

In football, players can move between clubs during the transfer season. If they are out of contract, clubs can acquire them and sign a contract. Otherwise, the current contract needs to be terminated before any transfer. In this case, the purchasing team pays an amount called the transfer fee.

"How should we allocate our transfer budget to maximize the benefit we gain?" This is the ultimate question that every team needs to answer. (Teams often try to answer "Which player should we get to make our fans happy?", but no one truly knows what could make fans happy.) Maximizing benefit under a limited resource is known as the Knapsack problem in combinatorial optimization. Given a set of items and their values, the Knapsack problem is to find the optimal selection of items to pack within a weight limit to maximize the total value. We can ask a similar question here: given a set of players, their values and ratings, how to choose which players to transfer to maximize total team rating within a budget limit.

Even though writing a detailed mathematical model of the problem is challenging, I will show how a simple model can be written to benefit from the capabilities of optimization. Before we dive any further, note that we are solving a simplified problem under the following assumptions to make things easier:

We consider only the starting lineup to measure team ratings
Teams can transfer any player as long as their current value is paid
We only focus on acquiring players, not selling them
Teams use the same formation for the next year
Players can be played only at the positions they are listed in the data set

Data

One of the most challenging stages of any analytical problem is to obtain clean data. At this point, we are lucky to have a great web resource: sofifa.com. SoFIFA has more data than we need for this problem. By using parallel web requests, we managed to create a database of 12,000 players sorted by their overall rating. The web scraper is available on GitHub and the data are available as a CSV file.

As an important side note, since these models are being run on data based on the football game FIFA, not on real player metrics, they are a better reflection of the players in the computer game, not the players in real life. However, these same concepts can be applied to real player data if you have access to it.

Model

Our aim is to maximize the sum of player ratings in the starting lineup of teams. We will solve the problem separately for each team. For each position, we filter the list of players who have a better rating than what the team currently has. Then, the increase in the total rating is used to measure the performance of the transfer for the team.

Let us define $\textrm{P}$ as the set of all players, $\textrm{S}$ as the set of team positions, and $\textrm{E}$ as the set of player-position pairs. The following parameters are used to define the problem:

$\bar{R}_j$ : Current rating of the player at position $j$
$R_i$ : Overall rating of player $i$
$B$ : Team budget
$V_i$ : Transfer value of player $i$

The main decision variable $t_{ij}$ represents a binary variable, whether player $i$ is transferred for position $j$ . We also have an auxiliary variable $r_j$ to define the final rating for position $j$ in the formation.

The objective function can be written as the summation of the final ratings:

$\textrm{maximize} \sum\limits_{j \in \textrm{S}} r_j$

Our first constraint is the budget for the transfer. The total value of players transferred cannot exceed the team budget:

$\sum\limits_{(i,j) \in \textrm{E}} V_i \cdot t_{ij} \leq B$

The next constraint defines the final rating for each position. This constraint accounts for transfer player $i$ replacing the current player at position $j$ :

$r_j = \bar{R}_j + \sum\limits_{i \in \textrm{P}: (i,j) \in \textrm{E}} (R_{i} - \bar{R}_{j}) \cdot t_{ij} \qquad \forall j \in \textrm{S}$

The following two constraints satisfy conditions that at most one player is transferred for a given position, and the one player cannot be transferred for two different positions:

$\sum\limits_{j \in \textrm{S}: (i,j) \in \textrm{E}} t_{ij} \leq 1 \qquad \forall i \in \textrm{P}$

$\sum\limits_{i \in \textrm{P}: (i,j) \in \textrm{E}} t_{ij} \leq 1 \qquad \forall j \in \textrm{S}$

Python Model

We model this problem using sasoptpy, an open-source Python interface of SAS Optimization.

m = so.Model(name='optimal_team', session=session)
 
rating = m.add_variables(POSITIONS, name='rating')
transfer = m.add_variables(ELIG, name='transfer', vartype=so.BIN)
 
m.set_objective(
  so.quick_sum(rating[j] for j in POSITIONS), name='total_rating', sense=so.MAX)
 
m.add_constraint(
  so.quick_sum(transfer[i, j] * value[i] for (i, j) in ELIG) <= budget, name='budget_con')
 
m.add_constraints((
  rating[j] == overall[member[j]] + so.quick_sum(
      transfer[i, j] * (overall[i] - overall[member[j]]) for (i, j2) in ELIG if j==j2) for j in POSITIONS), name='transfer_con')
 
m.add_constraints((
  so.quick_sum(transfer[i, j] for (i2, j) in ELIG if i==i2) <= 1 for i in PLAYERS), name='only_one_position')
 
m.add_constraints((
  so.quick_sum(transfer[i, j] for (i, j2) in ELIG if j==j2) <= 1 for j in POSITIONS), name='only_one_transfer')
 
m.solve()

Notice that it is very easy to model this problem using the Python interface. Our open-source optimization modeling package sasoptpy uses the runOptmodel action under the hood, as shown in examples in the documentation. If you are familiar with PROC OPTMODEL, you can write the SAS code and run it on SAS Viya directly.

Results

We have run the optimal transfer problem for the top six teams in Premier League standings: Manchester City, Liverpool, Chelsea, Tottenham, Arsenal, Manchester United. The current team and budget information are obtained from SoFIFA at the time of execution. We filtered out all the players older than 33 years old since a majority of players reach their peak before 33 and steadily lose performance.

See the table below for a comparison between optimal transfers for each team. The positions of the transfers are given in the following figures below the table.

Team	Old Rating	Avg	New Rating	Avg	Budget	Money Spent	Efficiency	Transfers
Manchester City	944	85.818	972	88.364	€170.0M	€170.0M	0.164706	Giorgio Chiellini, Thiago Emiliano da Silva, Jordi Alba Ramos, C. Ronaldo dos Santos Aveiro
Liverpool	932	84.727	949	86.273	€90.0M	€89.5M	0.189944	Łukasz Piszczek, Giorgio Chiellini, Sergio Busquets Burgos
Chelsea	925	84.091	948	86.182	€95.0M	€94.0M	0.244681	Samir Handanovič, Giorgio Chiellini, Thiago Emiliano da Silva, Marco Parolo
Tottenham Hotspur	933	84.818	949	86.273	€85.0M	€82.0M	0.195122	Filipe Luís Kasmirski, Marco Parolo, Sergio Busquets Burgos
Arsenal	905	82.273	933	84.818	€92.5M	€90.0M	0.311111	Lars Bender, Giorgio Chiellini, Filipe Luís Kasmirski, Fernando Luiz Rosa
Manchester United	915	83.182	951	86.455	€175.0M	€174.0M	0.206897	César Azpilicueta Tanco, Giorgio Chiellini, Thiago Emiliano da Silva, Filipe Luís Kasmirski, Luka Modrić

As mentioned above, we do not consider the likelihood of the transfer itself. We consider what money could buy if teams are able to get players at their current valuation.

Manchester City increases its total team rating from 944 to 972 by 28 points if they spend all of their current transfer budget of €170M. It is not surprising to see that with a rather limited budget of €90M, Liverpool can increase its total rating by 17 points, whereas Manchester United's total team rating can increase 36 points with their massive budget of €175M.

The efficiency column is calculated by dividing the change in total rating by total money spent in million euros. We expect the efficiency of the transfer to be larger when a few players have significantly lower ratings compared to the rest of the team and can be replaced with rather cheap alternatives. Arsenal has the highest efficiency and can increase its total rating 0.31 per million euros by purchasing 4 players.

The reason why the total rating of Liverpool does not increase as much as Arsenal's despite having close transfer budgets can be explained by the variation of the player ratings. The rating of the right back (RB) is increased 9 points (from 73 to 82) with a transfer worth of €17M for Arsenal. Liverpool's lowest rating in the current team is 80. Player values tend to increase sharply as we increase the rating:

Therefore, it is clear why some teams have an advantage in the transfer season. For these teams, it is easy to improve the team by replacing the weakest player. Consider these two extremes: Manchester City has to spend €170M to improve its total rating by 28 points, whereas Arsenal increases its total rating the same amount by spending €90M only.

Here's how the old and new lineups look for each team. New transfers are colored red while existing players are in blue:

Budget Limitations

In the last problem, we will have a look at how the budget is affecting the decisions. We will be varying the transfer budget of Liverpool from €0 to €200M in increments of €10M to see how it affects the outcome.

New Rating	Avg.New	Budget	Money Spent	Efficiency	Transfers
932	84.727	€0M	€0M	0
933	84.818	€10M	€7M	0.142857	Łukasz Piszczek
936	85.091	€20M	€20M	0.205128	Łukasz Piszczek, João Miranda de Souza Filho
939	85.364	€30M	€24M	0.291667	Thiago Emiliano da Silva
942	85.636	€40M	€38M	0.263158	Łukasz Piszczek, Giorgio Chiellini
943	85.727	€50M	€48M	0.229167	Lars Bender, Giorgio Chiellini
945	85.909	€60M	€56M	0.234234	Kyle Walker, Giorgio Chiellini
946	86.000	€70M	€62M	0.227642	César Azpilicueta Tanco, Giorgio Chiellini
947	86.091	€80M	€76M	0.197368	Joshua Kimmich, Giorgio Chiellini
949	86.273	€90M	€90M	0.189944	Łukasz Piszczek, Giorgio Chiellini, Sergio Busquets Burgos
950	86.364	€100M	€98M	0.183673	Giorgio Chiellini, Luka Modrić
952	86.545	€110M	€107M	0.186916	Kyle Walker, Giorgio Chiellini, Sergio Busquets Burgos
953	86.636	€120M	€116M	0.181818	Kyle Walker, Giorgio Chiellini, David Josué Jiménez Silva
955	86.818	€130M	€129M	0.178988	César Azpilicueta Tanco, Giorgio Chiellini, Luka Modrić
955	86.818	€140M	€129M	0.178988	César Azpilicueta Tanco, Giorgio Chiellini, Luka Modrić
957	87.000	€150M	€149M	0.167785	César Azpilicueta Tanco, Giorgio Chiellini, Fernando Luiz Rosa, Luka Modrić
958	87.091	€160M	€160M	0.163009	César Azpilicueta Tanco, Giorgio Chiellini, Jordi Alba Ramos, David Josué Jiménez Silva
959	87.182	€170M	€167M	0.162162	César Azpilicueta Tanco, Giorgio Chiellini, Jordi Alba Ramos, Luka Modrić
961	87.364	€180M	€180M	0.161111	César Azpilicueta Tanco, Giorgio Chiellini, Luka Modrić, Sergio Busquets Burgos
962	87.455	€190M	€189M	0.159151	César Azpilicueta Tanco, Giorgio Chiellini, Luka Modrić, David Josué Jiménez Silva
962	87.455	€200M	€189M	0.159151	César Azpilicueta Tanco, Giorgio Chiellini, Luka Modrić, David Josué Jiménez Silva

As seen below in detail, efficiency (total rating increase per million euros) decreases as we pay more money for a relatively lower change, as expected.

It seems Liverpool gets the best worth of its money if the Reds transfer Thiago Emiliano da Silva for CB position. Notice that efficiency converges to 0.16 total rating increase per million euros spent as we keep increasing the budget.

Increasing the potential

We have looked only at the current ratings of the players up to this point. The next problem we solve includes "potential" ratings of the new transfers. Naturally, young players have a significantly higher potential value compared to the old players. We need to replace the rating constraint as follows:

$r_j = \bar{P}_j + \sum\limits_{i \in \textrm{P}: (i,j) \in \textrm{E}} (P_{i} - \bar{P}_{j}) \cdot t_{ij} \qquad \forall j \in \textrm{S}$

where $P_i$ is the potential rating of a player, and $\bar P_j$ is the potential of the current player at position $j$ in the team.

For players under 25 years old, the optimal solution is to replace Henderson and Matip with Melo and de Ligt for €36M and €44M, respectively. These changes increase the potential rating by 18 points:

Edit: An earlier version of the blog post compared potential ratings of new transfers to current ratings of the current team. After fixing the problem, results have changed slightly.

Edit #2: We have updated results after fixing a filtering issue with the CSV database.

Dream Team under 23

Based on reader suggestions, we had a look at the optimal squad under €150M budget. Our objective is to maximize the potential rating and create a full team. I chose 4-4-2 formation for illustration purposes. The optimal squad cost €148.3M and the potential rating is 982:

Pos	Player	Rating	Potential	Paid
GK	Gianluigi Donnarumma	83	94	33.5M
LB	Thilo Kehrer	79	87	16.0M
LCB	William Saliba	71	88	4.2M
RCB	Boubacar Kamara	75	88	10.5M
RB	Trent Alexander-Arnold	80	89	19.0M
LCM	Rodrigo Bentancur	78	90	18.5M
CM	Ricard Puig Martí	69	89	2.1M
RCM	Sandro Tonali	73	90	7.5M
CAM	Phil Foden	75	90	13.5M
LS	Christian Kouamé	75	89	15.0M
RS	Ezequiel Barco	73	88	8.5M
Total		831	982	148.3M

I hope you enjoyed this brief analysis of potential transfers for top Premier League teams using Python and SAS Viya. I would be happy to answer if you have any questions, especially about how the model is constructed and how it is easily implemented in Python using sasoptpy. As usual, all the code for the problem is available at GitHub.

The post Bringing Analytics to the Soccer Transfer Season appeared first on Operations Research with SAS.

"Are You The One?" Finding the matching through SAS Constraint Programming

Natalia Summerville — Tue, 14 Aug 2018 10:00:19 +0000

Are You The One? Season 7 CR: Brian Bielmann/MTV

I bet you were not expecting a photo of 22 youngsters in swimsuits in a math blog, were you? The first time I heard about this reality TV show called "Are You The One?", my nerdiness and love for OR yelled at me: "There is so much math in this show!!!"

I'm sure producers did not mean it that way, but I saw mathematical optimization in it and just HAD to code it. Season 7 is premiering August 15th, and if you are as nerdy as I am, you should use my code to figure out the solution way before the all the participants do.

The show

According to Wikipedia, "Are You The One?" is a reality TV show produced by MTV, in which a group of men and women are secretly paired into male-female couples via a matchmaking algorithm. Then, while living together, the contestants try to identify all of these "perfect matches." If they succeed, the entire group shares a prize of up to $1 million.

In each episode, the participants get two new pieces of information about the coupling:

They get to choose one couple to go to the "truth booth" and ask if that couple is a perfect match or not.
They all pair up in a "matching ceremony" and get to know how many couples they got right (not which ones though).

This way, after each episode they have an increasing amount of information to find their matches. If they don't find it by episode 10, they lose.

The math

This problem can be solved through constraint programming because we're attempting to find all possible solutions (couples matches) that satisfy certain constraints (such as "Pete and Lola are NOT a couple" or "out of these given 10 couples only 2 are correct"). The more information you get as the weeks go by, the more constrained your problem is and the fewer feasible solutions you have - until you are left with only one (hopefully at most by week 10).

With SAS/OR Constraint Programming (PROC OPTMODEL or PROC CLP), we are able to discover this perfect matching much earlier than the participants.

The formulation shown in the next section uses a set of binary decision variables called Assign[b,g], with the interpretation that Assign[b,g] = 1 if boy b is assigned to girl g in the matching. The first two sets of constraints enforce a one-to-one matching between boys and girls. The Ceremony_Matches constraint makes sure that each solution respects the matching ceremony results. The Truth_Booth constraint forces Assign[b,g] = 1 or 0 (depending on the truth booth outcome) for the couple (b,g) that entered the truth booth.

The code

Here is the SAS code I used to solve this problem. To run it for the next season, just copy all the code and update the "STEP 1" portion each week.

/***********************************************/
/******* STEP 1 - UPDATE YOUR INPUTS ***********/
/***********************************************/
 
/* Input number of weeks of data */
%let num_weeks = 5;
 
/* Input weekly matches (from matching ceremony) up to current week */
/* For example, in Season 6, as of week 5, these were the couples in each ceremony */
 
data input_ceremony_pairs(keep=week boys_name girls_name);
   array girl_week{&num_weeks.} $;
   input boys_name $ girl_week{*};
   do week = 1 to &num_weeks.;
      girls_name = girl_week[week];
      output;
   end;
   datalines;
Anthony  Geles    Diandra  Jada     Keyana   Nicole
Clinton  Uche     Uche     Uche     Uche     Jada
Dimitri  Diandra  Nicole   Nurys    Alexis   Uche
Ethan    Jada     Jada     Alexis   Nicole   Geles
Joe      Zoe      Audrey   Zoe      Zoe      Zoe
Kareem   Alivia   Alivia   Alivia   Diandra  Alivia
Keith    Alexis   Alexis   Diandra  Nurys    Alexis
Malcolm  Nurys    Nurys    Geles    Alivia   Diandra
Michael  Keyana   Keyana   Audrey   Geles    Nurys
Shad     Audrey   Geles    Keyana   Audrey   Audrey
Tyler    Nicole   Zoe      Nicole   Jada     Keyana
;
 
/* Input number of correct matches (from matching ceremony) up to current week */
/* For example, in Season 6, as of week 5, these were the correct guesses */
 
data input_ceremony_correct_matches;
   input week correct_guesses;
   datalines;
1 3
2 1
3 2
4 3
5 1
;
 
/* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */
/* For example, in Season 6, as of week 5, these were the truth booths */
 
data input_truth_booth;
   input week boys_name $ girls_name $ outcome;
   datalines;
1  Ethan    Keyana   0
2  Anthony  Geles    0
3  Malcolm  Nurys    0
4  Dimitri  Nicole   0
5  Clinton  Uche     0
;
 
/***********************************************/
/****** STEP 2 - RUN WITHOUT MODIFYING CODE ****/
/***********************************************/
 
/* Constraint Programming Formulation */
 
proc optmodel;
 
   /* SETS */
 
   set  CEREMONY_PAIRS;
   set BOYS = setof { in CEREMONY_PAIRS} b;
   set GIRLS = setof { in CEREMONY_PAIRS} g;
   set WEEKS = 1..&num_weeks.;
 
   set  TRUTH_BOOTH_PAIRS;
 
   set SOLS init 1.._NSOL_;
 
   num correctMatches{WEEKS};
   num truthBoothOutcome{TRUTH_BOOTH_PAIRS};
 
   /* READ DATA */
 
   read data input_ceremony_pairs
      into CEREMONY_PAIRS=[week boys_name girls_name];
 
   read data input_truth_booth
      into TRUTH_BOOTH_PAIRS=[week boys_name girls_name] truthBoothOutcome=outcome;
 
   read data input_ceremony_correct_matches
      into [week] correctMatches=correct_guesses;
 
   /* DECISION VAR */
   /* Assign[b,g] = 1 if boy b is assigned to girl g in the matching; 0 otherwise */
   var Assign{BOYS,GIRLS} BINARY;
 
   /* CONSTRAINTS */
   con One_Girl_Per_Boy{b in BOYS}: sum{g in GIRLS} Assign[b,g] = 1;
   con One_Boy_Per_Girl{g in GIRLS}: sum{b in BOYS} Assign[b,g] = 1;
 
   /* WEEKLY CONSTRAINTS */
   con Ceremony_Matches{w in WEEKS}:
      sum{<(w),b,g> in CEREMONY_PAIRS} Assign[b,g] = correctMatches[w];
   con Truth_Booth{ in TRUTH_BOOTH_PAIRS}:
      Assign[b,g] = truthBoothOutcome[w,b,g];
 
   /* SOLVE */
   solve with clp / FINDALLSOLNS;
 
   /* CREATE OUTPUT */
   num couplesCount{BOYS, GIRLS} init 0;
   for{s in SOLS} do;
      for{b in BOYS} do;
         for{g in GIRLS: Assign[b,g].sol[s] > 0.5} do;
            couplesCount[b,g] = couplesCount[b,g] + 1;
            leave;
         end;
      end;
   end;
   create data output_couples 
      from [BOYS GIRLS] couplesCount;
 
   call symputx('Number_Solutions',_NSOL_);
 
quit;

The solution

To visualize the solution we can use the SAS SGPLOT procedure to create a heat map that illustrates the number of feasible solutions that contain a certain couple matched.

proc sort data=output_couples;
   by boys girls;
run;
 
proc sgplot data=output_couples noautolegend;
   title "Couples Occurrence Count in &Number_Solutions. Solutions";
   heatmap y=boys x=girls / weight=couplesCount 
      colormodel=(white pink red) outline x2axis;
   text y=Boys x=Girls text=couplesCount / textattrs=(size=10pt);
   xaxis display=none;
   x2axis display=(nolabel);
   yaxis display=(nolabel) reverse;
run;

For example, if we run the code above with the given data (Season 6, Week 5) we can see this diagram:

In this output we can see, for example, that Ethan and Geles are for sure not a couple (bummer) and that Tyler and Nicole seem to be the most promising couple (so far).

Your turn

Now use the following data (Season 6, Week 8) and my optimization and heat map code. Be ready for a surprise!

/***********************************************/
/******* STEP 1 - UPDATE YOUR INPUTS ***********/
/***********************************************/
 
/* Input number of weeks of data */
%let num_weeks = 8;
 
/* Input weekly matches (from matching ceremony) up to current week */
/* For example, in Season 6, as of week 8, these were the couples in each ceremony */
 
data input_ceremony_pairs(keep=week boys_name girls_name);
   array girl_week{&num_weeks.} $;
   input boys_name $ girl_week{*};
   do week = 1 to &num_weeks.;
      girls_name = girl_week[week];
      output;
   end;
   datalines;
Anthony  Geles    Diandra  Jada     Keyana   Nicole   Keyana   Keyana   Alivia
Clinton  Uche     Uche     Uche     Uche     Jada     Geles    Geles    Geles
Dimitri  Diandra  Nicole   Nurys    Alexis   Uche     Diandra  Diandra  Diandra
Ethan    Jada     Jada     Alexis   Nicole   Geles    Jada     Zoe      Alexis
Joe      Zoe      Audrey   Zoe      Zoe      Zoe      Alexis   Uche     Jada
Kareem   Alivia   Alivia   Alivia   Diandra  Alivia   Nurys    Nurys    Nurys
Keith    Alexis   Alexis   Diandra  Nurys    Alexis   Zoe      Jada     Audrey
Malcolm  Nurys    Nurys    Geles    Alivia   Diandra  Alivia   Alexis   Uche
Michael  Keyana   Keyana   Audrey   Geles    Nurys    Uche     Audrey   Keyana
Shad     Audrey   Geles    Keyana   Audrey   Audrey   Audrey   Alivia   Zoe
Tyler    Nicole   Zoe      Nicole   Jada     Keyana   Nicole   Nicole   Nicole
;
 
/* Input number of correct matches (from matching ceremony) up to current week */
/* For example, in Season 6, as of week 8, these were the correct guesses */
 
data input_ceremony_correct_matches;
   input week correct_guesses;
   datalines;
1 3
2 1
3 2
4 3
5 1
6 4
7 5
8 3
;
 
/* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */
/* For example, in Season 6, as of week 8, these were the truth booths */
 
data input_truth_booth;
   input week boys_name $ girls_name $ outcome;
   datalines;
1  Ethan    Keyana   0
2  Anthony  Geles    0
3  Malcolm  Nurys    0
4  Dimitri  Nicole   0
5  Clinton  Uche     0
6  Keith    Alexis   0
7  Keith    Alivia   0
8  Michael  Audrey   0
;

FYI - The participants in Season 6 did not find the perfect matching till week 10. I bet they would have loved to have SAS Constraint Programming handy.

Full season summary

To understand better how the number of possible matches (feasible solutions) decreases with each additional data piece (constraints), you can see the following super cool animation (thanks, Rob Pratt!):

Notice that the truth booths in weeks 4 and 6 were wasted because the status of the selected couple was already able to be deduced, so the number of possible solutions did not change. In a later update, I'll show how to optimally select which couple should go to the truth booth.

Going forward

So, now that you know how to use SAS Constraint Programming, let's help these poor guys find their perfect matches in Season 7. Follow this blog and see how much faster than the contestants we'll figure out the matching!

Season 7 Week 1 update

And so it begins. After the first "truth booth" and "matching ceremony" (and lots of drama), here is the first heat map.

Teaser from Rob:
If the contestants have a choice, they should choose either Shamoy-Maria or Tomas-Morgan to go to the next truth booth, but NOT because those couples appear in the most remaining solutions.

Season 7 Week 2 update

Unfortunately the contestants did not follow Rob's suggestion. After the second truth booth and matching ceremony, we now have 287,580 solutions and actually Shamoy-Maria is now the most promising couple! Here is the heat map.

Update from Rob:
The fate button required the contestants to pair Andrew or Cam with Asia or Nutsa for the truth booth. Because the four pairs (Andrew-Asia, Andrew-Nutsa, Cam-Asia, Cam-Nutsa) all appeared in the same number of solutions (165,326) after week 1, these four choices were equally desirable to send to the truth booth, with respect to reducing the number of possible solutions. After week 2, Shamoy-Maria would be the best choice for the truth booth, but again NOT because they have the highest solution count (136,848). More details next week...

Season 7 Week 3 update

The third truth booth revealed that Shamoy and Maria are in fact a perfect match. After the third matching ceremony, there are 37,681 solutions.

Optimal truth booth recommendation from Rob:
Choosing either Cam-Kayla or Tevin-Kenya would yield the smallest number of resulting solutions in the worst case.

Optimal truth booth strategy

The "Couples Occurrence Count" table provides a way to determine which couple should go to the truth booth if you want to minimize the worst-case (maximum) resulting number of solutions. If couple (b,g) goes to the truth booth, there are two possibilities:

If that couple is a perfect match, then the resulting number of solutions is couplesCount[b,g].
If that couple is not a perfect match, then the resulting number of solutions is _NSOL_ - couplesCount[b,g].

So in the worst (worse?) case, the resulting number of solutions is max(couplesCount[b,g], _NSOL_ - couplesCount[b,g]), that is, the larger of these two counts. Because we want to reduce the number of solutions, the best strategy is to choose a couple that minimizes this quantity. Equivalently, we want to choose a couple that comes the closest to appearing in half the solutions. The following statements implement this idea:

   /* find best pair for truth booth by minimizing the maximum number of resulting solutions */
   num maxSolsTruth {b in BOYS, g in GIRLS} = max(couplesCount[b,g], _NSOL_ - couplesCount[b,g]);
   num min_maxSolsTruth init constant('BIG');
   set  BEST_PAIR;
   for {b in BOYS, g in GIRLS} do;
      if min_maxSolsTruth > maxSolsTruth[b,g] then do;
         min_maxSolsTruth = maxSolsTruth[b,g];
         BEST_PAIR = {};
      end;
      else if min_maxSolsTruth = maxSolsTruth[b,g] then do;
         BEST_PAIR = BEST_PAIR union {};
      end;
   end;
   put BEST_PAIR= min_maxSolsTruth;
   create data output_truth 
      from [BOYS GIRLS] maxSolsTruth;

After week 3, the PUT statement reports in the log:

BEST_PAIR={<'Cam','Kayla'>,<'Tevin','Kenya'>} 21889

The interpretation is that by choosing either of these two couples to go to the truth booth, the resulting number of solutions will be at most 21,889, and this number is best possible in the sense that choosing any other couple could yield a larger number of solutions.

As before, it is useful to display the results in a heat map:

proc sort data=output_truth;
   by boys girls;
run;
 
proc sgplot data=output_truth noautolegend;
   title "Maximum Number of Solutions if Couple goes to Truth Booth";
   heatmap y=boys x=girls / weight=maxSolsTruth 
      colormodel=(white pink red) outline x2axis;
   text y=Boys x=Girls text=maxSolsTruth / textattrs=(size=10pt);
   xaxis display=none;
   x2axis display=(nolabel);
   yaxis display=(nolabel) reverse;
run;

The lowest numbers appear in the lightest cells.

Season 7 Week 4 update

Here's the update after last night's episode. We are down to 12,860 solutions.

Using Rob's optimal strategy, the contestants should send Tevin and Kenya to the booth next episode such that in the worst-case scenario, the number of solutions will be down to 7,081.

Avoiding a blackout

If the matching ceremony yields no matches (a "blackout") beyond those confirmed by the truth booth, the prize pool gets cuts in half. So it would be good to figure out a way to avoid a blackout if possible. After the week 4 truth booth (before the matching ceremony), here was the heat map:

To avoid a blackout, you can solve an auxiliary optimization problem that maximizes the minimum number of matches. The formulation shown in the next code snippet uses a set of binary decision variables called IsCeremonyPair[b,g], with the interpretation that IsCeremonyPair[b,g] = 1 if boy b is paired with girl g in the matching ceremony. The first two sets of constraints enforce a one-to-one matching between boys and girls. The PerfectMatch constraint makes sure that any couple that is confirmed a "perfect match" by the truth booth is also paired together in the matching ceremony. The MaxMinCon constraint linearizes the maximin objective, as in the Fun with Flags post from July 2017. The following statements build and solve this auxiliary problem:

   /* MATCHES_sol[s] = the set of pairs  for which Assign[b,g] = 1 in solution s */
   set  MATCHES_sol {SOLS} init {};
   for{s in SOLS} do;
      for{b in BOYS} do;
         for{g in GIRLS: Assign[b,g].sol[s] > 0.5} do;
            MATCHES_sol[s] = MATCHES_sol[s] union {};
            leave;
         end;
      end;
   end;
 
   /* find best ceremony pairs by maximizing the minimum number of matches (to avoid blackout) */
   var IsCeremonyPair{b in BOYS, g in GIRLS} binary;
   con One_Girl_Per_Boy_Ceremony{b in BOYS}: sum{g in GIRLS} IsCeremonyPair[b,g] = 1;
   con One_Boy_Per_Girl_Ceremony{g in GIRLS}: sum{b in BOYS} IsCeremonyPair[b,g] = 1;
   con PerfectMatch{ in TRUTH_BOOTH_PAIRS: truthBoothOutcome[w,b,g] = 1}:
      IsCeremonyPair[b,g] = 1;
   var MaxMin >= 0 integer;
   max MinNumMatches = MaxMin;
   con MaxMinCon{s in SOLS}:
      MaxMin <= sum{ in MATCHES_sol[s]} IsCeremonyPair[b,g];
   problem CeremonyProblem include
      IsCeremonyPair MaxMin
      MinNumMatches
      One_Girl_Per_Boy_Ceremony One_Boy_Per_Girl_Ceremony PerfectMatch MaxMinCon;
   use problem CeremonyProblem;
   solve;
   put (min{s in SOLS} sum{ in MATCHES_sol[s]} IsCeremonyPair[b,g])=;
   num couplesCountCeremony {b in BOYS, g in GIRLS} = (if IsCeremonyPair[b,g].sol > 0.5 then couplesCount[b,g] else .);
   create data output_couples 
      from [BOYS GIRLS] couplesCount couplesCountCeremony;

For week 4, the solver recommends an optimal matching ceremony that guarantees at least three matches and hence avoids a blackout. You can use the following PROC SGPLOT code to display the resulting solution:

proc sort data=output_couples;
   by boys girls;
run;
 
proc sgplot data=output_couples noautolegend;
   title "Couples Occurrence Count in &Number_Solutions. Solutions";
   heatmapparm y=boys x=girls colorResponse=couplesCountCeremony / 
      colormodel=(white pink red) outline x2axis;
   text y=Boys x=Girls text=couplesCount / textattrs=(size=10pt);
   xaxis display=none;
   x2axis display=(nolabel);
   yaxis display=(nolabel) reverse;
run;

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

It turns out that the actual matching ceremony chosen by the contestants in week 4 could have yielded a blackout. That is, there was at least one remaining solution that contained no matches from the ceremony.

Season 7 Week 5 update

Now we're down to 1,696 solutions:

Optimal truth booth recommendation from Rob:
Cam-Kayla would yield at most $\max(884, 1696-884) = 884$ solutions, and they are the best choice. Note that Tevin-Kenya appear in more solutions together but have a higher worst-case count of $\max(1005, 1696-1005) = 1005$ solutions.

Season 7 Week 6 update

We're getting there: 574 possible solutions left.

Optimal truth booth recommendation from Rob:
Cam-Kayla would yield at most $\max(285, 574-285) = 289$ solutions, and they are still the best choice. Also, we can now conclude that Brett-Kayla are not a match, even though they never went to the truth booth.

Season 7 Week 7 update

The contestants voted to send Zak and Nutsa to the truth booth, and they were not a match, so $574-36=538$ possible solutions remain.

Optimal matching ceremony strategy

We already saw one way to avoid a blackout by solving an auxiliary optimization problem to maximize the minimum number of matches. But there might be many such possible matching ceremonies that yield the same number of matches, and (as with the truth booth strategy) we would like to minimize the number of resulting solutions in the worst case. We can solve a different auxiliary problem to both avoid a blackout and minimize the number of solutions. This problem involves the IsCeremonyPair[b,g] variables and related constraints from before, as well as additional binary variables IsBeamCount[s,c] with the interpretation that IsBeamCount[s,c] = 1 if and only if the beam count for solution s is c. Because Shamoy and Maria are the only perfect match so far, we need at least two beams to avoid a blackout. The following statements build and solve this auxiliary problem:

   /* find best ceremony pairs by minimizing the maximum number of resulting solutions */
   /* minBeamCount is minimum number of beams to avoid blackout */
   num minBeamCount = 1 + sum{ in TRUTH_BOOTH_PAIRS} truthBoothOutcome[w,b,g];
   set BEAM_COUNTS = minBeamCount..card(BOYS) diff {card(BOYS) - 1};
   /* IsBeamCount[s,c] = 1 iff sum{ in MATCHES_sol[s]} IsCeremonyPair[b,g] = c */
   var IsBeamCount{SOLS, BEAM_COUNTS} binary;
   impvar NumSolsPerBeamCount{c in BEAM_COUNTS} = sum{s in SOLS} IsBeamCount[s,c];
   var MinMax >= 0 integer;
   min MaxNumSolutions = MinMax;
   con MinMaxCon{c in BEAM_COUNTS}:
      MinMax >= NumSolsPerBeamCount[c];
   con OneBeamCount{s in SOLS}:
      sum{c in BEAM_COUNTS} IsBeamCount[s,c] = 1;
   con Link{s in SOLS}:
      sum{ in MATCHES_sol[s]} IsCeremonyPair[b,g] = sum{c in BEAM_COUNTS} c * IsBeamCount[s,c];
   problem CeremonyProblem2 include
      IsCeremonyPair IsBeamCount MinMax
      MaxNumSolutions
      One_Girl_Per_Boy_Ceremony One_Boy_Per_Girl_Ceremony PerfectMatch MinMaxCon
      OneBeamCount Link;
   use problem CeremonyProblem2;
   solve;
   print NumSolsPerBeamCount;
   create data output_couples 
      from [BOYS GIRLS] couplesCount couplesCountCeremony;

For week 7, the solver recommends an optimal matching ceremony that guarantees at least two matches (and hence avoids a blackout) and yields at most 142 solutions in the worst case (three beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

It turns out that the actual matching ceremony chosen by the contestants in week 7 (the results of which were left as a cliffhanger to be revealed in next week's episode) could yield up to 296 solutions, so the matching ceremony proposed here is better in the sense that it guarantees a much smaller worst-case number of solutions.

Update from Rob: The matching ceremony yielded four beams, and this implies several no-matches, four of which the contestants realized, and an inferred match (Tevin-Kenya), which the contestants did not realize and instead sent them to the truth booth. Although the truth booth result was treated as a cliffhanger to be revealed in the next episode, we already know the result. (In fairness to the contestants, there does not seem to be a simple explanation that Tevin and Kenya are a perfect match, without enumeration. For example, if you fix Assign['Tevin','Kenya'] = 0, the resulting problem is MILP infeasible but LP feasible.) There are now 128 remaining solutions, and that number will not change when the truth booth confirms the Tevin-Kenya match.

For next week, the solver recommends an optimal matching ceremony that guarantees at least three matches (and hence avoids a blackout) and yields at most 35 solutions in the worst case (five beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

Season 7 Week 8 update

As we already knew would happen, the truth booth confirmed that Tevin and Kenya are a perfect match. The matching ceremony, which could have yielded a blackout, leaves 34 solutions possible:

Season 7 Week 9 update

The contestants sent Cam and Samantha to the truth booth, but the results above already showed they were not a match, so that is two wasted truth booths in a row. The solver recommends an optimal matching ceremony that guarantees at least three matches (and hence avoids a blackout) and yields at most 8 solutions in the worst case (four or five beams):

Here is the resulting heat map, where the couples who are not in the matching ceremony are grayed out:

The contestants chose a different matching ceremony, which could have yielded a blackout. They got four beams, and now only five solutions remain:

By adding a PUT statement, you can write these solutions to the log:

for {s in SOLS} put s= MATCHES_SOL[s];

Here they are:

s=1 {<'Andrew','Cali'>,<'Brett','Nutsa'>,<'Cam','Bria'>,<'Daniel','Jasmine'>,<'Kwasi','Lauren'>,<'Lewis','Samantha'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Asia'>,<'Zak','Morgan'>}
s=2 {<'Andrew','Cali'>,<'Brett','Samantha'>,<'Cam','Lauren'>,<'Daniel','Bria'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Nutsa'>,<'Zak','Morgan'>}
s=3 {<'Andrew','Cali'>,<'Brett','Samantha'>,<'Cam','Lauren'>,<'Daniel','Nutsa'>,<'Kwasi','Kayla'>,<'Lewis','Bria'>,<'Moe','Asia'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Jasmine'>,<'Zak','Morgan'>}
s=4 {<'Andrew','Samantha'>,<'Brett','Bria'>,<'Cam','Nutsa'>,<'Daniel','Cali'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Kayla'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Lauren'>,<'Zak','Morgan'>}
s=5 {<'Andrew','Samantha'>,<'Brett','Nutsa'>,<'Cam','Morgan'>,<'Daniel','Lauren'>,<'Kwasi','Asia'>,<'Lewis','Jasmine'>,<'Moe','Cali'>,<'Shamoy','Maria'>,<'Tevin','Kenya'>,<'Tomas','Bria'>,<'Zak','Kayla'>}

Season 7 Week 10 update

Brett and Nutsa are headed to the final truth booth. Out of the four choices allowed by the fate button, they are the only couple whose status is unknown, so that was a good decision. From the latest heat map, we can see that there will be either two solutions (if Brett-Nutsa is a match) or three solutions (if not) left after the truth booth results. So it is not possible for the contestants to have logically deduced the correct matching in time for the final ceremony. It will take some guesswork for them to win the $1 million prize.

The final truth booth revealed that Brett-Nutsa is a perfect match, so we are down to two solutions (s=1 and s=5 above):

The contestants used solution s=1 for the final matching ceremony, and...

...they won the $1 million! Thanks for following along with us this season.

Season 7 data

You can reproduce these results by using the following data.

/* Input number of weeks of data */
%let num_weeks = 10;
 
/* Input weekly matches (from matching ceremony) up to current week */
data input_ceremony_pairs(keep=week boys_name girls_name);
   array girl_week{&num_weeks.} $;
   input boys_name $ girl_week{*};
   do week = 1 to &num_weeks.;
      girls_name = girl_week[week];
      output;
   end;
   datalines;
Andrew Lauren   Morgan   Lauren   Nutsa    Samantha Lauren   Lauren   Samantha Cali     Cali
Brett  Cali     Asia     Cali     Kayla    Nutsa    Nutsa    Nutsa    Nutsa    Bria     Nutsa
Cam    Kayla    Kayla    Kayla    Asia     Kayla    Kayla    Cali     Lauren   Morgan   Bria
Daniel Nutsa    Nutsa    Samantha Lauren   Bria     Samantha Samantha Asia     Lauren   Jasmine
Kwasi  Asia     Lauren   Jasmine  Bria     Jasmine  Asia     Asia     Jasmine  Nutsa    Lauren
Lewis  Samantha Jasmine  Asia     Kenya    Lauren   Bria     Bria     Bria     Asia     Samantha
Moe    Jasmine  Bria     Nutsa    Samantha Asia     Jasmine  Jasmine  Kayla    Kayla    Kayla
Shamoy Maria    Maria    Maria    Maria    Maria    Maria    Maria    Maria    Maria    Maria
Tevin  Kenya    Kenya    Kenya    Jasmine  Kenya    Kenya    Kenya    Kenya    Kenya    Kenya
Tomas  Morgan   Cali     Bria     Cali     Cali     Cali     Kayla    Morgan   Jasmine  Asia
Zak    Bria     Samantha Morgan   Morgan   Morgan   Morgan   Morgan   Cali     Samantha Morgan
;
 
/* Input number of correct matches (from matching ceremony) up to current week */
data input_ceremony_correct_matches;
   input week correct_guesses;
   datalines;
1 3
2 3
3 3
4 2
5 4
6 4
7 4
8 4
9 4
10 11
;
 
/* Input couples that went to truth booth and their outcome (1 if they are a match, 0 otherwise) up to current week */
data input_truth_booth;
   input week boys_name $ girls_name $ outcome;
   datalines;
1  Tomas    Maria    0
2  Andrew   Asia     0
3  Shamoy   Maria    1
4  Brett    Kenya    0
5  Zak      Bria     0
6  Brett    Cali     0
7  Zak      Nutsa    0
8  Tevin    Kenya    1
9  Cam      Samantha 0
10 Brett    Nutsa    1
;

The post "Are You The One?" Finding the matching through SAS Constraint Programming appeared first on Operations Research with SAS.

Visiting all 30 Major League Baseball Stadiums - with Python and SAS® Viya®

Sertalp B. Cay — Wed, 13 Jun 2018 15:33:07 +0000

Ballpark Chasers

Oriole Park at Camden Yards by Ed Hughes

A cross-country trip is pretty much an all-American experience, and so is baseball. Traveling around the country to see all 30 Major League Baseball (MLB) stadiums is not a new idea; there's even a social network between so-called "Ballpark Chasers" where people communicate and share their journeys. Even though I'm not a baseball fan myself, I find the idea of traveling around the country to visit all 30 MLB stadiums pretty interesting.

Since we all lack time, the natural question that might pop into your mind is, "How fast can I visit all 30 stadiums and see a MLB game in each?" This question was first asked by Cleary et al. (2000) in a mathematical context. This is where the math and baseball intersect. Finding the optimal trip is an expensive calculation. Discarding the schedule for a second and focusing only on ordering stadiums to visit results in more than $2.65 \times 10^{32}$ different permutations. When you add the game schedule and distances between stadiums, the problem gets much bigger and more difficult quickly. See the Traveling Salesman Problem if you are interested in difficult scheduling problems, which is the main source of the "Traveling Baseball Fan Problem."

The Optimal Trip

Before starting to talk about "The Optimal Trip" I should make some assumptions. The Optimal Trip is quite a subjective term, so I need to choose a measurable objective. My focus is to complete the schedule within the shortest time possible. Further, I assume that the superfan only uses land transportation (a car) between stadiums. Unlike some variations, I don't require the fan to return back to the origin, so the start and end cities will be different. Each stadium will be visited only once.

The Traveling Baseball Fan Problem (TBFP) has gained quite a bit of attention. There is a book by Ben Blatt and Eric Brewster about their 30 games in 30 days. They created an online visualization of such a tour for those interested; unfortunately the tool only shows schedules for the 2017 season. Their approach here is a heuristic, so the resulting solution is not guaranteed to be the "shortest possible tour." Since the problem is huge, one can only expect the true optimal solution to be obtained after optimization. Ben Blatt also wrote a mathematical optimization formulation for the shortest possible baseball road trip.

There are different ways to model this problem. The model I am going to use to optimize the TBFP is the network-based formulation presented in a SAS Global Forum 2014 paper by Chapman, Galati, and Pratt. Rob Pratt, one of the authors, wrote about the TBFP in this blog before.

Ground Rules and Challenge

Just to reiterate, ground rules for the optimal schedule:

Use ground transportation only.
Use driving distances between stadiums. The driving distances are obtained via OSRM.
Stay until each game ends (assume each game lasts 3 hours).

The main challenge is to gather data, model the problem, and visualize results---all within the Python environment. Moreover, my aim here is to show you that the mathematical formulation for the TBFP can easily be written with our new open-source Python package sasoptpy and can be solved using SAS Viya on the cloud. If you are interested, check our Github repository for the package and our SAS Global Forum 2018 paper (Erickson and Cay) to learn more about it!

To provide variety, let's solve the problem for different time periods (2, 3, and 6 months) with two different objectives. The first objective is to finish the schedule in the shortest time possible. The second objective is to finish the schedule with spending the least amount of money. For this, I will assume $130 accommodation rate per day and $0.25 travel cost per mile. This objective was the main motivation of Rick and Mike's 1980 tour, as mentioned in the paper by Cleary et al. (2000).

TBFP Model

I will use the network formulation from the aforementioned SAS Global Forum 2014 paper. For this model, I define directed arcs between pairs of games, eliminate the arcs that cannot be part of a feasible solution, and optimize the given objective.

The decision variable in this formulation is $u$ , which is defined for each arc $(g_1, g_2)$ as

$u[g_1, g_2] = \begin{cases}1 & \text{if the fan attends games $g_1$ and $g_2$ and no game in between} \\0 & \text{otherwise}\end{cases}$

Denote $c[g_1,g_2]$ as the time between games $g_1$ and $g_2$ in days, including game duration, as follows:

$c[g_1,g_2] = \begin{cases} \textrm{end}[g_2] - \textrm{end}[g_1] & \textrm{if } g_1 \not = \textrm{source and } g_2 \not = \textrm{sink} \\ \textrm{end} [g_2] - \textrm{start}[g_2] & \textrm{if } g_1 = \textrm{source and } g_2 \not = \textrm{sink} \\ 0 & \textrm{otherwise} \end{cases}$

Also denote $l[g]$ as the location of the game $g$ , $\text{NODES}$ as the list of games including dummy nodes 'source' and 'sink', $\text{ARCS}$ as the connections between games, and finally $\text{STADIUMS}$ as the list of all stadiums. Now I can write the Network Formulation as follows:

$\begin{array}{rlcll} \textrm{minimize:} & \displaystyle \sum_{(g_1, g_2) \in \text{ARCS}} c[g_1,g_2] \cdot u[g_1,g_2] \\ \textrm{subject to:} & \displaystyle \sum_{(g,g_2) \in \text{ARCS}} u[g,g_2] - \sum_{(g_1,g) \in \text{ARCS}} u[g_1,g] & = & \begin{cases} 1 & \text{if } g = \text{source,} \\ -1 & \text{if } g = \text{sink,} \\ 0 & \text{otherwise}\end{cases} & & \forall g \in \text{NODES} \\ & \displaystyle \sum_{(g_1,g_2) \in \text{ARCS}: g_2 \not = \text{sink and } l[g_2] = s} u[g_1, g_2] & = & 1 & & \forall s \in \text{STADIUMS} \end{array}$

The solution of this optimization problem should produce a route starting at the source, finishing at the sink, and passing through all 30 ballparks. The objective here is to minimize the total schedule time. The first set of constraints ensures that inflow and outflow are equal for regular nodes. The second set of constraints ensures that the fan visits every stadium once.

For the second objective, I need to replace the objective function with the following:

$\textrm{minimize:} \; \displaystyle 130 \cdot \sum_{(g_1, g_2) \in \text{ARCS}} c[g_1, g_2] \cdot u[g_1, g_2] + 0.25 \cdot \sum_{(g_1, g_2) \in \text{ARCS}: g_1 \not = \text{source} \text{ and } g_2 \not = \text{sink}} d[g_1, g_2] \cdot u[g_1, g_2]$

where $d$ is the distance between games in miles.

Modeling with sasoptpy

Now that I have my formulation ready, it is a breeze to write this problem using sasoptpy. Only part of the code is shown here for illustration purposes. See the Github repository for all of the code, including the code used for grabbing the season schedule from the MLB website, driving distances from OpenStreetMap, and exporting results.

I can write the Network Formulation to solve TBFP in Python as follows:

'''
Defines the optimization problem and solves it.
 
Parameters
----------
distance_data : pandas.DataFrame
    Distances between stadiums in miles.
driving_data : pandas.DataFrame
    The driving times between stadiums in minutes.
game_data : pandas.DataFrame
    The game schedule information for the current season.
venue_data : pandas.DataFrame
    The information regarding each 30 MLB venues.
start_date : datetime.date, optional
    The earliest start date for the schedule.
end_date : datetime.date, optional
    The latest end date for the schedule.
obj_type : integer, optional
    Objective type for the optimization problem,
    0: Minimize total schedule time, 1: Minimize total cost
'''
def tbfp(distance_data, driving_data, game_data, venue_data,
         start_date=datetime.date(2018, 3, 29),
         end_date=datetime.date(2018, 10, 31),
         obj_type=0):
 
    # Define a CAS session
    cas_session = CAS(your_cas_server, port=your_cas_port)
    m = so.Model(name='tbfp', session=cas_session)
 
    # Define sets, parameters and pre-process data (omitted)
 
    # Add variables
    use_arc = m.add_variables(ARCS, vartype=so.BIN, name='use_arc')
 
    # Define expressions for the objectives
    total_time = so.quick_sum(
        cost[g1,g2] * use_arc[g1,g2] for (g1, g2) in ARCS)
    total_distance = so.quick_sum(
        distance[location[g1], location[g2]] * use_arc[g1, g2]
        for (g1, g2) in ARCS if g1 != 'source' and g2 != 'sink')
    total_cost = total_time * 130 + total_distance * 0.25
 
    # Set objectives
    if obj_type == 0:
        m.set_objective(total_time, sense=so.MIN)
    elif obj_type == 1:
        m.set_objective(total_cost, sense=so.MIN)
 
    # Balance constraint
    m.add_constraints((
        so.quick_sum(use_arc[g, g2] for (gx,g2) in ARCS if gx==g) -\
        so.quick_sum(use_arc[g1, g] for (g1,gx) in ARCS if gx==g)\
        == (1 if g == 'source' else (-1 if g == 'sink' else 0) )
        for g in NODES),
        name='balance')
 
    # Visit once constraint
    visit_once = so.ConstraintGroup((
        so.quick_sum(
            use_arc[g1,g2]
            for (g1,g2) in ARCS if g2 != 'sink' and location[g2] == s) == 1
        for s in STADIUMS), name='visit_once')
    m.include(visit_once)
 
    # Send the problem to SAS Viya solvers and solve the problem
    m.solve(milp={'concurrent': True})
 
    # Post-process results (omitted)

Experiments

I ran this problem for the following twelve settings.

ID	Period	Objective (Min)
1	03/29 - 06/01	Time
2	03/29 - 06/01	Cost
3	06/01 - 08/01	Time
4	06/01 - 08/01	Cost
5	08/01 - 10/01	Time
6	08/01 - 10/01	Cost
7	03/29 - 07/01	Time
8	03/29 - 07/01	Cost
9	07/01 - 10/01	Time
10	07/01 - 10/01	Cost
11	03/29 - 10/01	Time
12	03/29 - 10/01	Cost

The last two settings (11 and 12) cover the entire 2018 MLB season from March 29th to October 1st. Therefore, these problems should give the optimal solutions for the best time and best cost objectives, respectively. My aim is to show how the problem size and solution time grow when the problem period is larger.

Visualization

The ultimate benefit of working within the Python environment is ability to use open-source packages for many tasks. I have used the Bokeh package for plots and Folium for generating travel maps below. Bokeh is capable of web-ready interactive plots, which makes the visualization engaging for the user. Folium uses the Leaflet.js Javascript library to generate interactive maps based on OpenStreetMap maps. For interaction between the Bokeh scatter plots and the Leaflet maps, I have used a custom Javascript function provided by the Bokeh CustomJS class. You can see details of how the visualization part works in the Jupyter notebook.

Results

The optimal solution I obtained from the 11th experiment gives the best schedule time for the 2018 season, which is just over 24 days (24 days and 3 hours). The solution starts with Diamondbacks @ Giants in AT&T Park, San Francisco on the 5th of June and ends with Royals @ Mariners in Safeco Field, Seattle on the 29th of June. This is the global best solution among the scheduled games this season. Maps and itineraries of selected solutions are shown below. Click on any of the plots and tables to see the Jupyter notebook. All times in these schedules are in EDT.

As a 22,528-mile trip, this schedule costs roughly $8,767. By changing the objective, it is possible to obtain a better cost, however the schedule takes longer significantly longer. Solution 12 is only 11,914 miles. It's 10,614 miles shorter and $2,149 cheaper compared to Solution 11 but takes 4 days longer.

Among the schedules you can still try at the writing of this post, Solution 9 gives the shortest schedule time (24 days 3 hours) and Solution 10 gives the best cost ($6,899). The latter schedule starts with Tigers @ Angels in Angel Stadium, Anaheim on the 6th of August, and ends with Orioles @ Mariners in Safeco Field, Seattle on the 4th of September. Note that, this solution is the longest schedule among all solutions I have with a little over 29 days.

Here's a list of all solutions:

The objective and the time period of the formulation heavily affect the solution time. Minimizing the cost takes longer due to unique optimal solutions. Moreover, increasing the time period from 3 months to 6 months nearly quadruples the solution time.

Ultimately, the best trip is up to you. You can define another objective of your choice, whether it be minimizing the cost, minimizing the schedule time, avoiding the risky connections (minimum time between games minus the driving time), minimizing the total driving time, or even maximizing the landmarks you have visited along the way. Whatever objective you choose, you can use sasoptpy and use the powerful SAS Viya mixed integer linear optimization solver to generate a trip that is perfect for you! Working in Python allows you to integrate packages you are familiar with and makes everything smoother. Do not forget to check my Jupyter notebook and see the Python files.

Moving Further

You can further improve this model based on your desire. Here I list a few ideas:

If you would like to see an away and a home game for every team exactly once, then you can add the following constraint

$\displaystyle \sum_{(g, g_1) \in \text{ARCS}: \text{away}[g]= t \text{ and } \text{away}[g_1] \not = t} u[g,g_1]+\sum_{(g_1, g) \in \text{ARCS}: \text{away}[g]= t \text{ and } \text{away}[g_1] \not = t} u[g_1,g] = 1 \qquad \forall t \in \text{TEAMS}$

You can try to avoid risky connections you have in the schedule by replacing the objective with

$\displaystyle \text{maximize: } \sum_{(g_1,g_2) \in \text{ARCS}} ( \text{start}[g_2] - \text{end}[g_1] - c[g_1,g_2])\cdot u[g_1, g_2]$

To prevent schedules that are too long, you should add a limit to the total schedule length, for example, 25 days:

$\displaystyle \sum_{(g, \text{sink}) \in \text{ARCS}} \text{end}[g]\cdot u[g, \text{sink}] - \sum_{(\text{source}, g) \in \text{ARCS}} \text{start}[g]\cdot u[\text{source}, g] \leq 25$

Share how you define the perfect schedule below if you have more ideas!

The post Visiting all 30 Major League Baseball Stadiums - with Python and SAS® Viya® appeared first on Operations Research with SAS.

Operations Research

We are joining forces with the SAS Data Science blog!

It's been a great ride

Life moves on and so are we

Come visit our new home

Why venue optimization is critical and how it works

Group Selections

Venue-wide Restrictions

Performance

SAS Venue Optimization

Flow

Seating App

References

Back to School Optimization

Problem Description

Input Data

Model Formulation

Decision Variables

Objective Function

Constraints

Constraints specific to the daily rotation scenario

Special case

Solving the Model

Visualizing the Output

Summary

Acknowledgments

Additional reading

Code

Workforce Scheduling at an Animal Shelter

Problem Description

Input Data Overview

Employees Table and Jobs Table

Demand Table and Demand Coefficients Table

Model Formulation

Solving the Problem

Optimization Log and Output Tables

Visualizing the Output

Employee Schedule

Overtime Analysis

Summary

Acknowledgments

Code

Improving Hidden Markov Models with Black-Box Optimization

Hidden Markov Models

Introduction

Elements of an HMM

The HMM Procedure/Action set

Regime-Switching Autoregression Model

An example: Discovering the Hidden Market States

Black-Box optimization solver

Numerical Experiments

Tuning RS-AR models with p = 0

Tuning RS-AR models with any k, p

Summary

References

SAS® Optimization Web App Using REST APIs with Embedded SAS® Visual Analytics Reports

High-level Architecture

Server-side library - restaf-server

Front-end library - ReactJS

Demo workflow

Problem Description

Analyze Input Data

Input Optimization Parameters

Optimization code

Analyze Output Report

Scenario Analysis

What's Next?

Batter-pitcher matchup analysis in the 2019 Major League Baseball playoffs

The data source

Parsing complex event files

Aggregating the data

Taking a quick look into the data

Coming up next

Bringing Analytics to the Soccer Transfer Season

Analytics for Transfer Season

Data

Model

Python Model

Results

Budget Limitations

Tuning RS-AR models with $p = 0$

Tuning RS-AR models with any $k$ , $p$