- Generate some random parameters and evaluate the function there.
- Use the best values of the parameter to generate new candidates.

One simple way of generating such random parameters is fitting a normal distribution at every iteration: we choose a subset of “elite parameters” (a fraction of those we tried), calculate their mean and covariance and use it to generate a new population of parameters.

This kind of optimization methods is quite useful for instance in reinforcement learning.

For instance, suppose we want to maximize the function:

# the function we want to optimize f <- function(theta){ reward = -sum((solution - theta)**2) return(reward) } solution <- c(0.5, 0.1, -0.3)

Then the cross-entropy algorithm in this case is:

library(MASS) cem <- function(f, n_iter, theta_mean, theta_std, batch_size=25, elite_frac=0.2){ # Now, for the algorithms for(it in 1:n_iter){ # Sample parameter vectors thetas <- matrix(mvrnorm(n=batch_size*dim_theta, mu= theta_mean, Sigma=theta_std) , ncol = dim_theta) rewards <- apply(thetas,1,f) # Get elite parameters n_elite <- as.integer(batch_size * elite_frac) elite_inds <- sort(rewards, decreasing = T, index.return=T)$ix[1:n_elite] elite_thetas <- thetas[elite_inds,] # Update theta_mean, theta_std theta_mean <- apply(elite_thetas, 2,mean) theta_std <- 0.01*diag(dim_theta)+0.99*cov(elite_thetas) } return(theta_mean) } cem(f,300)

and we call this like:

dim_theta <- length(solution) theta_mean <- matrix(0,dim_theta,1) theta_std <- diag(dim_theta) batch_size <- 25 # number of samples per batch elite_frac <- 0.2 # fraction of samples used as elite set cem(f,300, theta_mean=theta_mean, theta_std=theta_std , batch_size=batch_size, elite_frac=elite_frac)]]>

Suppose we want to play rock, paper, scissors for money. Each of us puts 100 EUR on the table, we play and the winner takes the money. Let’s further assume that you play rock and I play paper (so I win) and thus take the 100 EUR from the table. In this case, your **regret **for not playing paper is 100 (because we would have tied), but your regret for not paying scissors is 200, because you did not win.

Formally, if , denote the strategy sets of the players (here and both are equal to {rock, paper, scissors}), then the regret for player 1’s action , given the last-round moves is:

How do we get a strategy out of this? We initialize a counter at the beginning of the game. We play randomly in the beginning, and from the second round on, we compute the regrets for each action and divide each of these numbers by the sum of regrets of our actions. We will choose the action with a random strategy, playing with probability proportional to the normalized cumulative regret.

Let’s take a look at the code in Python:

import random import numpy as np # Base player that chooses a constant mixed strategy every time class Player: def __init__(self): self.my_moves = [] self.other_moves = [] def move(self, strategy): # Input: a vector of probability distributions for actions # Output: a pure action r = random.uniform(0,1) n_actions = len(strategy) a = 0 cumulative_proba = 0.0 while a<n_actions-1: cumulative_proba += strategy[a] if r < cumulative_proba: return a a +=1 return a class RegretPlayer(Player): def __init__(self): super(RegretPlayer, self).__init__() self.regret_sum = np.zeros(3) def regret(self): if len(self.my_moves)>0: my_action = self.my_moves[-1] his_action = self.other_moves[-1] else: return np.zeros(3) # Payoffs from my perspective my_payoff = np.zeros(3) # If we play the same, I don't get any payoff my_payoff[his_action] = 0. # I win when he plays scissors and I pay rock, # or when I play the "next" (Rock = 0, Paper = 1, Scissors = 2) my_payoff[0 if his_action == 2 else his_action + 1] = 1 # I lose when I play scissors and he plays rock, # or when I play the "previous" action my_payoff[2 if his_action == 0 else his_action -1] = -1 regrets = [my_payoff[a]-my_payoff[my_action] for a in range(3)] regrets = np.array(regrets) return regrets def get_regret_mixed_strategy(self): normalize_sum = 0.0 strategy = np.zeros(3) regret = self.regret() for a in range(3): strategy[a] = max(self.regret_sum[a],0) normalize_sum += strategy[a] # If all regrets are positive, play randomly if normalize_sum > 0: strategy = strategy / normalize_sum else: strategy = np.ones(3)/3 self.regret_sum += regret return strategy

We can simulate the game of the regret player:

import matplotlib.pyplot as plt def payoff(my_action, his_action): if my_action == his_action: return 0 if his_action == 0 and my_action==2 or my_action == his_action-1: return -1 return 1 def run(n_rounds = 10): p1 = RegretPlayer() p2 = Player() total_p1 = 0.0 total_p2 = 0.0 rounds_won_p1 = 0. plt.ion() plt.axis([-0.1,n_rounds+0.1,-0.1,1.1]) print("*"*100) print("The match begins") print("*"*100) for n in range(1,n_rounds): regret_strategy_p1 = p1.get_regret_mixed_strategy() m1 = p1.move(regret_strategy_p1) m2 = p2.move([0.4,0.3,0.3]) # Players update the info of the moves p1.my_moves.append(m1) p1.other_moves.append(m2) total_p1 += payoff(m1,m2) #### Show results moves_map = {0:"Rock", 1:"Paper", 2:"Scissors"} print('-'*50) print("Regret:") print(p1.regret_sum) print("Strategy:", regret_strategy_p1) print("My move: %s" % moves_map[m1]) print("His move: %s" % moves_map[m2]) print("Payoffs:") print(total_p1) print(total_p2) rounds_won_p1 += 1 if payoff(m1,m2) >0 else 0 # Plot the moves plt.title("Percentage of rounds won using a regret strategy") plt.scatter(n,rounds_won_p1/n, color = "red") plt.show() plt.pause(0.1) if __name__ == "__main__": run(n_rounds = 100)]]>

- There are possible actions, Actions are sometimes called arms (hence armed).
- At time an action is chosen and a reward

is allocated. - Each action gives an
*unknown*reward, sampled from an*unknown*

probability distribution. - The reward depends only on the action.
**Goal**: Maximize total reward.

These are a simpler class of reinforcement learning problems on which there is no influence from the state. Our actions have a noisy effect in the reward and our goal is to learn from these noisy signals. However, the effect of our actions on the reward is not influenced by any other externalities.

Bandit algorithms are useful in a number of applications:

- Personalized content/news.
- Online advertising (A/B testing and similar). You have probably heard of A/B testing. A/B testing is a bandit algorithm with a pure exploration phase, followed by pure exploitation. Don’t worry, this will be clear soon.

Clinical trials (experimental treatments for patients while

minimizing losses). - Portfolio Optimization.

Let’s figure out how to deal with bandits before moving on to harder reinforcement learning problems.

As we said for every action , the reward might be still random, but it only

depends on the action.

To maximize rewards, we create some estimates of the function

above, and use those estimates to choose the best actions. The function is the best possible return we could get from each action, whereas represents our current estimate of the function based on what we just learned from the interaction.

One way to learn about is to keep track of the average rewards.

By the law of large numbers, if the action is taken infinitely many times,

A dilemma that arises in many decision problems is the trade-off between exploration and exploitation. This simply means that if we spend a lot of time trying things randomly, we are risking our reward in the long term, out of low commitment to well-performing actions. Conversely, sticking to something too soon might fire back: imagine we try 2 out of 10 different arms of our bandit, and we just stick to the best of those two. We might be ignoring a much better arm if we don’t get adventurous and try it! When we just commit to such action (i.e. the best action seen so far) we say we use a **greedy** policy.

- The
**greedy policy**at time is choosing actions such

that:

- If at time we choose action such that is
**exploiting**the action, whereas

means**exploring**.

One way to deal with the exploration-exploitation trade-off is to be -greedy

- Create an
**action-value**table (a Python dictionary) storing the

history of the payoffs obtained by each action. - Calculate the average payoff of each action.
- Choose with probability a non-optimal action.
- Choose with probability one of the greedy actions

(those with highest average payoff). - Break ties randomly.

The choice of is important, and we can observe different types of behaviour.

The figures below illustrate the rewards an agent gets for choosing different arms with different reward probabilities (shown in the third image from the right). The reward is 1 when you get it. So we see that the very conservative player (with small get stuck with a single action from the beginning.

On the opposite behaviour, with a value of , we see a player that does a bit better and explores a wider range of actions (in the second picture). However, the mean reward is barely around the average. This is because he is just playing randomly.

Finally, we have a more balanced agent, which gets a much higher reward in average because is able to identify the correct actions to use with time.

]]>There are two important parts for the implementation: on one hand, we have to implement an environment that simulates the reward of the arms. The skeleton of this class is given below:

```
class Arm(object):
def __init__(self, params):
## passes the required parameters
## this could be the reward probability, or other parameter (mean, sigma) of a lottery
## that gives the rewards.
...
def step(self):
## gives a reward
...
```

From the point of view of the agent, we need to implement the decision making mechanism. The skeleton of this class would be:

```
class Bandit(object):
def __init__(self, *args, **kwargs):
...
def reset(self):
...
def choose_action(self):
## Chooses a pure action, i.e. an arm
...
def update(self, action, reward):
#### Updates the values of N and Q
...
```

As for the greedy part:

```
import random
import numpy as np
def argmax(array):
return array.index(max(array))
class EpsilonGreedy:
def __init__(self, epsilon, n_actions):
self.Q = np.zeros(n_actions)
self.N = np.zeros(n_actions)
self.epsilon = epsilon
self.n_actions = n_actions
def choose_action(self):
if random.random()>self.epsilon:
a = argmax(self.Q)
else:
a = random.randint(0,self.n_actions-1)
return a
def update(self, action, reward):
n = self.N[action]
self.N[action]+=1
self.Q[action] = self.Q[action]+1/(self.n_actions)*(reward-self.Q[action])
```

In reality, we can also change the value of at every iteration, that is , to reduce the exploration rate. As the value of goes to zero, we would approach the greedy policy. Under a good reduction schedule, we would stop the exploration after identifying the good actions. As examples of cooling schedules we have or .

Besides -greedy, other action selection methods can be used. We will briefly describe some of them.

It turns out that -greedy has the (sometimes unpleasant) property of choosing a really bad action with the same probability as choosing some not-too-bad action. To avoid this, we can choose our action via a mixed strategy (that is, randomizing), using the softmax function:

Here is a parameter, called the **temperature**. In the limit, when , softmax action selection behaves like -greedy.

For both softmax and -greedy action selection methods, we can decrease the parameters to zero as the number of steps of the episode decreases. This process is called **annealing** and it causes the bandit

algorithm to explore less over time.

In this method, we want to be sure of the performance of an action before choosing it. Formally, regret-based estimates are used to give us upper bounds on the true action values. The action selection mechanism is:

where is an upper bound for the reward. This gives us a way to control randomness, but it comes at a price. UCB suffers from *back-pedaling*, which roughly means “I know this is bad, but I’m not completely sure about how bad it is”. So you end up choosing bad actions because you are not completely sure of their true value. It also happens that UCB algorithms may not become strictly greedy.

Last but not least, a method closely related to actor-critic algorithms in reinforcement learning. The motivation behind is simple: How much reward is a good reward? Once we decide on this, we can choose our actions accordingly. The algorithm works as follows:

- Set a learned
**preference**for taking action . - The policy

is a

mixed strategy. - The preference is updated as:

where

Note that this reduces to softmax selection for a particular choice of the function.

There are different metrics to see how a bandit algorithm is performing.

**Method 1:**Measure how often is the best arm chosen- That sounds ok, but what if the arms have similar rewards? This metric does not really make sense.

**Method 2:**Average reward at each point in time (is it really

getting better?)- However, this would yield bad average reward for algorithms that explore a lot. So we should be aware of this when we tweak the hyperparameters of our algorith ( in this case).

**Method 3:**Cumulative rewards.- This is a “Big-picture” metric: we can see if the cost of exploration was worth.

Let’s illustrate a couple of application domains.

In clinical trials, the goal is to evaluate possible treatments for a disease. Patients are allocated into groups and the reward is 0 or 1, depending on success of the treatment. This is a typical example where one would rather use softmax than -greedy, because it’s not the same to choose a random treatment than a second most likely to work. Also, annealing should come in, because in later stages of the trial, a greater fraction of the subjects should be assigned to better treatments.

Each time you visit a website the publisher must choose to display one of ads, and the reward for the publisher is 0/1 (no click/click). However, not all users are alike, and they are not alike to themselves in time. Ads should be in **context** of users’ activity. For this, a different class of algorithms, called **contextual bandits** is used. A contextual bandit receives an external signal to know in which context you are, and find, for each context, the best actions for that context. This is almost a reinforcement learning problem, except that the context has no influence in the dynamics to the next state.

A major issue in real life deployments is that reward functions are messy and ill-specified. An amusing example of this can be seen in https://blog.openai.com/faulty-reward-functions/, where an agent learns to maximize its reward function as the score on a videogame. However, the agent learns the periodicity in which some bonus-giving tokens are regenerated on screen, and turns over and over in circles instead of finishing the episode.

]]>

What concrete advantages does open-source data science bring to your organisation?

**1. Develop your own use cases.**

As the dust settles down and clear winners emerge (R & Python), organizations should get ready to jump into data science, identifying relevant use cases for their businesses. The focus is less and less on technologies and is finally moving onto bottom-line value. There is no one better prepared than your own, trusted and experienced employees to come up with the right use cases for data science in your business. You do not need an expensive consultancy, you have already the experts!

**2. In-house customer support.**

Although “open-source software” is understood as free, like in “free-beer”, it comes to a cost. Yes, you don’t have to pay the hefty yearly license, but you have to customize your software to your infrastructure and needs. A large chunk of the revenue for open source vendors comes from the ongoing support. The way out? Train your own technology experts.

**3. Increased productivity and reduced employee turnover.**

A happy employee is an employee that stays with you. Open source technologies are here to stay, and smart, valuable data scientists and data analysts want to work with those. Proprietary software limits their creativity and that would eventually make them leave. It is not pleasant to be locked down with the limited tooling that vendors offer, no matter how “easy” they make it sound.

**4. Trained employees can be promoted.**

Although there are signs that data science starts to be commoditized, with an data science and machine learning bootcamps and even master and PhD degrees everywhere, you can not really expect that any of such “instant experts” will be able to take a senior role. Investing in your employees now will save you money in the future when you are fully committed to embrace advanced analytics in your business, as the new joiners will find strong mentors and leaders within your team.

*Interested in upgrading your analytics / open-source software in-house capabilities? Reach out! *

So, is it really time to jump in and drop one’s career and everything to switch to AI? Is it finally time to join Coursera/Udacity/Edx/your favorite MOOC/university?

Well, no.

The thing is, AI and ML, hype or not, are tools. They are hammers waiting for nails.

It doesn’t make much sense to invest in machine learning training by itself. Successful applications of machine learning and, in general, artificial intelligence, often require solid domain knowledge and proprietary data, one way or another.

Knowledge is certainly transferable among machine learning applications, and solving one type of problems (yes, even mildly artificial, like Kaggle) does help to solve unseen problems, and you can get lucky and apply out of the box algorithms with default parameters and match the state-of-the-art in a particular domain, or exceed it. But going beyond the easy-hanging fruit takes tons of effort and a combination of the above.

A better career strategy might be choosing a domain one is interested in, and then see how AI can be applied there. And for this, you just need to skim through some of the excellent courses around, no need to get a PhD in machine learning.

]]>

We are not teaching students the skills they really need. We send them unprepared to the workplace, much to the dismay of the employer, which needs to spend valuable time and resources in training them. Why are we teaching them Lisp, when the probability of using it is close to zero? I can not believe that there are that many self-branded machine learning / artificial intelligence programs without a single course in R or Python, for instance.

I’m obviously speaking from my trench (data science), but I see that happening in other fields too. For many specializations, including humanities, a decent course in spreadsheet software is long overdue. Instead, focus is lost in highly theoretical subjects, memorizing tons of stuff that are now easily available online.

We should focus on problem solving skills. Why is it taking so long for universities to catch on?

]]>

Unlike academia, where one is free of cherry-picking problems, in the private sector we often find ourselves doing “menial” work for which we feel overqualified. The truth is that one needs to learn by doing, and if you haven’t done your fair amount of, say, SQL queries, you would not appreciate the job necessary, nor would be able to troubleshoot your algorithms yourself.

In my first job as a data scientist, I was super disappointed because I needed to write long queries and I was expecting to just build some beautiful models on top of the cleansed and sanitized data. I clearly had the wrong attitude, but it took me some time to realize.

Later while interviewing candidates for our team, I met a few PhDs that wanted to climb into manager immediately (or so their salary requirements suggested) on the sole merit that they, well, had a PhD. This sense of detachment of reality is what needs to get off from you first.

Another source of problems is the relationship with your advisor and close professors. This is a nasty one, and a very case-by-case topic. I had to disclose my leaving from academia at the time I was finishing the thesis, and tradition dictated that it was time for me to find a postdoc. In my case, my advisor was not against my moving out of academia, and he somewhat encouraged it by introducing me to other mathematicians working on close fields who have gone to the private sector. Unfortunately, another professor took it very differently. She was extremely upset about me leaving academia, and she thought I was a traitor and a mercenary that seeked nothing but profit out of life. Interestingly, she often spent lots of grant’s money on taxis with no reason (public transport in Europe is generally fine), but well. This kind of speech affects you personally: you had grown up to become a researcher, and probably it was a long cherished dream to be a professor. But circumstances change, and you can not let yourself be held by your 15-year-old self. Move on.

]]>At least on the beginning, you need to be aware of what the market is looking for, and where you can fit. Coding in some way or another is essential for this. Depends on the level of specialization you want to achieve, this would be C++ (development of high-performance algorithms, for instance in finance), Python, R or SQL (data science and business analytics). Some data science jobs may ask you for Java and so on: do it at your own risk, since that might be a position a bit further from modeling.

Another important skill is communication: you need to be able to understand other people’s needs and communicate your needs and the solution to their problems. If you are coming straight from academia, chances are hight that you had to give some lectures to experts and teach some courses to non-experts. In many countries, PhD students have to teach to students beyond their field. This is particularly common in maths, where pure math PhDs are usually teaching business students. It’s useful to find out this kind of opportunities during the PhD, so don’t hesitate.

The message that needs to be delivered to the decision makers of the place you want to work is this: you have to show them that you can get work done, and that your work is bringing in revenue or reducing costs. Anything else is utterly irrelevant to them.

**Finding a job**

There are lots of websites out there, which I am sure you are familiar with. I have had great success with indeed.com and their localized versions. Be sure that you understand the job requirements, and read between the lines. Sometimes (rather often) recruiters do not understand a lot about the requirements, so they just pull in a long list of requisites. A good recruiter will look at your CV and maybe even call you if they see that you could fit in their team, whether you fit on their shopping list or not.

The estimates vary, but it is generally agreed that most of the jobs never make it to the job boards. I know, that’s disappointing, but you should be aware that the best way to get a job you will enjoy is through networking. Go out and talk to people, nowadays every major city has meetups or other interest groups where you can find people with common interests. Don’t hesitate to pull in your network: for sure by the time you end your PhD you have a solid network of people to rely on.

It is always useful to build a name for yourself, so don’t hesitate on participating in as many public speakings opportunities you get, specially if it is for a wider audience. Who knows, maybe someone will offer you a job after your talk…

The interview process varies wildly: some places may ask you to prepare for a coding interview, while others may interview you and then do an offer. Usually there are at least three rounds: a first round to do a quick assessment of how suitable you are for the job, a second round to see that you know your stuff and a third round to see the final decision maker.

If you are invited to a coding interview, it is highly advisable to get some practice in websites like Codility, Hackerrank, Codewars or Project Euler.

]]>