Today, my guest blogger is Rick Gentile, an engineer in our Signal Processing Product Marketing group. He will introduce a new app which enables you to gain quick insights into your data.... read more >>

]]>Today, my guest blogger is Rick Gentile, an engineer in our Signal Processing Product Marketing group. He will introduce a new app which enables you to gain quick insights into your data.

Signal Processing Toolbox has helped MATLAB users generate and work with signals for many years. In our recent releases, we have expanded the ability to analyze and compare signals in the time, frequency, and time-frequency domains. You can now use these capabilities to gain insights into your data, which can help you to develop and validate your algorithms.

Signal Processing Toolbox provides functions and apps to preprocess, explore, and extract features from signals. We recently added **Signal Analyzer** app to the toolbox to make it really simple for you to visualize and compare multiple, time-based signals that live in the MATLAB Workspace. You can gather insight with the app about the nature of your signals in the time and frequency domains simultaneously. You will find this to be very helpful in data analytics and machine learning applications where identifying patterns and trends, extracting features, and developing custom algorithms are key aspects of the workflow.

To demonstrate some of the new capabilities of the app, I have selected data that is easy to visualize and understand. The short example below is based on what may be a familiar data set to you, a blue whale call. The file `bluewhale.au` contains audio data from a Pacific blue whale. The file is from the library of animal vocalizations maintained by the Cornell University Bioacoustics Research Program. The time scale in the data is compressed by a factor of 10 to raise the pitch and make the calls more audible.

To start, I read the data into the MATLAB Workspace. Note that you can use the command `sound(x,fs)` to listen to the audio. In the code, I start the **Signal Analyzer** app from the **APPS** tab at the top of the MATLAB environment toolstrip. I could have also done this by typing `signalAnalyzer` at the MATLAB prompt.

whaleFile = fullfile(matlabroot,'examples','matlab','bluewhale.au'); [x,fs] = audioread(whaleFile); signalAnalyzer

In the picture below, after the app opens, I drag the signal `x` directly from the Workspace Browser (lower left section of the app) into the display region.

The recorded data includes four signal components of interest. The first is commonly referred to as a "trill" and comprises roughly the first 10,000 samples. The next three sounds are referred to as "moans".

You see the **Signal Analyzer** app in the picture below with time, spectrum and panner axes enabled. The middle axes show the spectrum of the signal. The bottom axes contain a panner, where you can zoom and navigate into the whale call of interest.

I then turn on Data Cursors to find where each of the sounds is located in the data. Next, I use this information to extract vectors with the signal segments of interest. The code below extracts these vectors based on the beginning and end of the specific moans. In a data analytics application, this is analogous to exploring a data set and finding a condition of interest in the data stream.

moan1 = x(24396:31087); % Extract Signal for Moan 1 moan2 = x(45499:52550); % Extract Signal for Moan 2 moan3 = x(65571:72571); % Extract Signal for Moan 3

You can use Signal Processing Toolbox functions to do a range of operations on the data, including finding changes and outliers, filtering and smoothing, or filling gaps where data is missing. Signal Processing Toolbox also has functions to find signal segments within a larger signal based on similarity measures. Let's look at how you can use this to automate the work that was manually performed above.

The `findsignal` function returns the start and stop indices of a segment of the data array, `data`, that best matches the search array, `signal`. The best-matching segment is such that `dist`, the Euclidean distance between the segment and the search array, is smallest. Note that many functions in Signal Processing Toolbox conveniently generate a plot when no output arguments are specified. You will see that in the code below where I repeated some instructions to take advantage of this but you don't need to repeat the lines if only the plot or the outputs are needed.

I start by taking an additional look at the signals of interest in the app. As shown in the picture below, I compare the three whale moans by simply dragging `moan1`, `moan2`, and `moan3` directly into the display region. The three moans are overlaid in the time domain in the top axes. The bottom axes provide the overlaid spectrum of the three occurrences. From this view, the moan spectra of all three occurrences look more similar than their time-domain counterparts. This is an indication that it may be best to find the signals using the frequency content rather than looking through their time-domain representations.

With this intuition, I now try to use `findsignal` with the spectrogram as an input to see if it matches the signal of interest with the spectra of other occurrences. First I look at the spectrogram for the entire signal. The trill and moans are visible in the image.

spectrogram(x,kaiser(64,3), 60, 256, fs,'yaxis') % View spectrogram for original signal

I now compute the spectrogram of the entire signal and a reference spectrogram based on the first moan and feed these to `findsignal` in order to find all call signal occurrences.

[~,F,T,PxxSignal] = spectrogram(x,kaiser(64,3), 60, 256, fs); [~,F,T,PxxMoan] = spectrogram(moan1,kaiser(64,3), 60, 256, fs); findsignal(PxxSignal, PxxMoan, 'Normalization','power','TimeAlignment','dtw','Metric','symmkl',... 'MaxNumSegments',3);

You can see that `findsignal` did return three matches. The start and stop index for each match is highlighted in the plot. This confirmed the intuition we gained by viewing the data in **Signal Analyzer**. For completeness, I make the same call using output arguments to capture the start and stop index of each match. With this information now in the MATLAB Workspace, it is much easier to automate the signal extraction for larger signals, especially if there are many more signal segments to find. Having these locations generated automatically will save you time as you navigate through your own signals.

[istart,istop,d] = findsignal(PxxSignal, PxxMoan, 'Normalization','power','TimeAlignment', ... 'dtw','Metric','symmkl','MaxNumSegments',3); % Obtain start and stop index for each match

There are many other functions you can use to improve the quality of the data. These functions can all be used in a similar manner, in conjunction with the app, which can provide you with more insights into your data and signal sets.

You can also find more info on the app here.

I hope this example has helped you understand how to visualize and compare multiple time-based signals and find interesting features. Can this help in your area of work? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2016b

Hopefully you recall the recent post where we used a monte-carlo technique to simulate an infinitely-stocked port decanter passing around a table of N people (always starting at position 0), moving with probability p to the left and q to the right, only stopping when everyone at the table had... read more >>

]]>- Introduction
- Gambler's Ruin
- Calculating the probability that position
`k`is last - What routes do we need to consider to compute the expected path length?
- Probability that the port takes route L or R
- Calculating the expected path length when
`k`is last - What is the mean number of passes needed for a table of size N?
- And finally

Hopefully you recall the recent post where we used a monte-carlo technique to simulate an infinitely-stocked port decanter passing around a table of `N` people (always starting at position 0), moving with probability *p* to the left and *q* to the right, only stopping when everyone at the table had received some port. Those simulations seemed to show that simple analytic solutions to expected path lengths might be possible and today, I'll show you what I managed to prove.

In the last post we decided that the functions we wanted to find are

- $p_k(N)$ the probability that position k is the last to receive the port
- $E_k(N)$ the expected number of passes needed when k is the last
- $E_N$ the expected number of passes needed for a table of size N

Thanks to my colleague Rick Amos, we have a starting point for thinking about an analytical solution. He pointed out that our port passing problem was very similar to the random walk process Gambler's Ruin, in which the gambler starts out with some stake (`k`) and wins another unit of that stake with probability `p` (or loses with probability `1-p`). The game continues until the gambler has either lost all the money, or has a total of `N`. The major difference is that in Gambler's Ruin there are two different end conditions (winning with `N` or losing with `0`).

To enable us to attack the port passing problem, I'll need to quote two results from Gambler's Ruin. You can find derivations of these results in the associated files on File Exchange (see the files `BiasedGamblersRuin` and `UnbiasedGamblersRuin`). The first result allows us to answer the question: "what is the probability, given that we are currently at position S, of arriving at position A before arriving at position B?". In terms of Gambler's Ruin this is the same as starting with stake $k=S-B$ and finishing with $N=A-B$. To make some of the equations simpler I'll use $q=1-p$, and note we assume $A<S<B \mbox{ or } B<S<A$.

$$pSAB=\left\lbrace \begin{array}{cc}\frac{k}{N} & p=q=\frac{1}{2}\\\frac{1-r^k}{1-r^N} & p\ne q\mbox{ and } r=\frac{q}{p}\end{array}\right.$$

Similarly, we also need to answer: "what is the expected number of passes (starting at S) to arrive at position A before arriving at position B?".

$$eSAB=\left\lbrace \begin{array}{cc}\frac{N^2-k^2}{3} & p=q=\frac{1}{2}\\\frac{1+r}{1-r}(N(\frac{2}{1-r^N }-1)-k(\frac{2}{{1-r}^k }-1)) & p\ne q\end{array}\right.$$

Note how each result has a different expression for the biased and unbiased case? Below we will run through the biased case (which has more complexity). Exactly the same code could be run with the unbiased case and the correct result obtained - if you are interested simply change `pToUse` below back to `0.5` and execute.

syms pSAB(S,A,B) eSAB(S,A,B) p q k N pToUse = 0.55; assume(p == pToUse); % Define the functions for k and N in terms of S, A, and B. k(S,A,B) = S-B; N(S,A,B) = A-B; if pToUse == 0.5 pSAB(S,A,B) = k/N; eSAB(S,A,B) = (N^2 - k^2)/3; else % In the biased case we will use r rather than p in the equations syms r assume(r ~= 1); pSAB(S,A,B) = (1-r^k)/(1-r^N); eSAB(S,A,B) = (1+r)/(1-r) * (N*(2/(1-r^N)-1) - k*(2/(1-r^k)-1)); end

To allow us to talk about the general case where position `k` is the last to receive the port around a table of size `N` we will use position `L` to refer to a position one to the right of `k` and `R` to one to the left of `k` (note the use of modulo arithmetic to ensure $k-N<R \le 0 \le L < k$ and that the direction left is defined as an increase in position number, and right is a decrease in position number).

syms L R N k L = k-1; R = k+1-N;

For position `k` to be last we either need the port to move from the starting position 0 to L and then to R, or move from 0 to R and then L. This probability is given by

$$p_k = pSAB(0,L,R) * pSAB(L,R,k) + pSAB(0,R,L) * pSAB(R,L,k)$$

p0L = pSAB(0, L, R); p0R = pSAB(0, R, L); pLR = pSAB(L, R, k); pRL = pSAB(R, L, k-N); p0LR = p0L*pLR; p0RL = p0R*pRL; pk = simplify(p0LR + p0RL)

pk = -(r^N*(r - 1))/(r^k*(r - r^N))

In the previous blog post we showed that for the unbiased case this result was independent of position `k`, but in the biased case the position around the table affects the last probability. Let's look at the residuals from some simulations to check if this is correct (the `portPassingSimulation` code is available here).

p = pToUse; N = 17; k = 1:N-1; nSims = 1e5; residuals = NaN(10,N-1); expected = double(subs(pk, {'r' 'N' 'k'}, {(1-p)/p, N, k})); % Simulate 10 times passing around a table of 17 parfor ii = 1:10 last = portPassingSimulation(nSims, N, p); % Group simulations into k (1:N-1) bins and normalize by number of simulations lastProb = hist(last, k)/nSims; residuals(ii, :) = lastProb - expected; end % Are the residuals normally distributed? probplot('normal', residuals(:)) grid on

We can see that the residuals from the simulation are roughly normally distributed (some tail points are not quite normal) but there isn't any bias from the expected results (because the mean of the residuals is close to zero).

meanR = mean(residuals(:))

meanR = 7.5894e-20

To compute the expected path length, we are going to have to consider all possible paths that finally end with position `k`, work out the probability with which each path occurs and multiply by the expected length of that path. Fortunately, whilst there are infinitely many possible paths we can decompose the problem into just 2 that relate directly to Gambler's Ruin. These are the L route and R route as specified by

- Start at 0 and move to L (but
**not R**)`E0L` - Move from L to R (but
**not k**)`ELR` - Then either
**(a)**move from R to k-N (but**not k**)`ERk` - Or
**(b)**move from R to L to k (but**not k-N**)`ERLk`

The alternative route R (which is basically the inverse of route L) is

- Start at 0 and move to R (but
**not L**)`E0R` - Move from R to L (but
**not k-N**)`ERL` - Then either
**(a)**move from L to k (but**not k-N**)`ELk` - Or
**(b)**move from L to R to k-N (but**not k**)`ELRk`

The expected lengths of each of these parts of the path are

syms k N E0L = eSAB(0, L, R); E0R = eSAB(0, R, L); ELR = eSAB(L, R, k); ERL = eSAB(R, L, k-N); ERk = eSAB(R, k-N, k); ELk = eSAB(L, k, k-N); ERLk = eSAB(R, k, k-N); ELRk = eSAB(L, k-N, k);

It is interesting to note that the expected path length `eSAB` is symmetric with respect to swapping the variables `p` and `q`. We can see this in the above equations as `ELR==ERL`, `ERk==ELk` and `ERLk==ELRk`

simplify(ELR == ERL) simplify(ERk == ELk) simplify(ELRk == ERLk)

ans = TRUE ans = TRUE ans = TRUE

To combine the parts of route L and route R where there is a choice we need to include the probabilities of undertaking part 3 and 4 of L and part 3 and 4 of R. These are simply the probabilities of those paths in Gambler's Ruin.

pRk = pSAB(R, k-N, k); pLk = pSAB(L, k, k-N); pRLk = pSAB(R, k, k-N); pLRk = pSAB(L, k-N, k);

So finally we can combine these parts for our two routes. `EL` is the expected path for route L and `ER` the expected path length for route R.

EL = E0L + ELR + pRk.*ERk + pRLk.*ERLk; ER = E0R + ERL + pLk.*ELk + pLRk.*ELRk

ER = ((1/r - 1)*(r + 1)*(2/(1/r - 1) - N*(2/(1/r^N - 1) + 1) + 1))/((r - 1)*(1/r^N - 1)) - (((2/(r^(2 - N) - 1) + 1)*(N - 2) - (2/(r^(1 - k) - 1) + 1)*(k - 1))*(r + 1))/(r - 1) - ((r + 1)*(2/(r - 1) - (N - 1)*(2/(r^(N - 1) - 1) + 1) + 1))/(r - 1) - (((N - 1)*(2/(r^(N - 1) - 1) + 1) - N*(2/(r^N - 1) + 1))*(r + 1)*(r^(N - 1) - 1))/((r^N - 1)*(r - 1))

I've only printed one result out simply to show that at the moment this looks pretty complicated! Hopefully this is going to get simpler some way down the line!

To be able to compute the overall expected path we also need to know the probability that the port takes route L or R, since

$$E_k = p_L E_L + p_R E_R$$

Fortunately, we already have the components to compute these probabilities from the analysis of finish position probabilities. The probability the we take route L is just the probability of from 0 to L to R divided by the probability that k is last. Similarly, for the route R.

pL = simplify(p0L*pLR/pk) pR = simplify(p0R*pRL/pk)

pL = (r^N - r^(k + 1))/(r^N - r^2) pR = -(r*(r - r^k))/(r^N - r^2)

With all these results in place we can compute the expected length by adding the two possible path lengths times their probabilities.

Ek = simplify(pL.*EL + pR.*ER, 400)

Ek = (2*(2*N - k - 3*r^N + N*r^N - N*r^k + k*r^N + 1))/((r^N - 1)*(r - 1)) - (k - 2*N + N*r^N + N*r^k - k*r^N)/(r^N - 1) - 4/(r - 1)^2 - (2*r^N*(r^N + 1)*(N - 1))/((r - r^N)*(r^N - 1))

**Does this agree with simulation?**

We can look to see how well `Ek` agrees with the simulated data by running the simulation and computing the mean number of passes for each finishing position. As we saw in the previous blog post, a simple way to look at the mean number of passes for each last position is by using `varfun` for a table, with a grouping variable.

p = pToUse; N = 17; k = 1:N-1; [last,nPass] = portPassingSimulation(1e6, N, p); t = varfun(@mean, table(last, nPass), ... 'GroupingVariable', 'last')

t = last GroupCount mean_nPass ____ __________ __________ 1 9291 95.262 2 11379 111.75 3 13888 123.74 4 17016 130.06 5 20845 135.69 6 25541 136.47 7 31030 136.05 8 38271 133.8 9 46659 130.33 10 56701 125.85 11 69383 119.94 12 84602 112.84 13 1.04e+05 106.39 14 1.2629e+05 98.765 15 1.5536e+05 90.384 16 1.8975e+05 82.086

Plotting these against the analytical results gives

cMean = double(subs(Ek, {'r' 'N' 'k'}, {(1-p)/p N k})); plot(t.last, t.mean_nPass, 'ro'); hold on plot(k, cMean, '.-') grid on hold off title 'Simulated vs expected mean number of passes' xlabel 'Position port received last at the table' ylabel 'Mean number of passes to reach last position at table' legend('Simulated results', 'Expected result')

And lastly the difference in the mean and expected mean is almost zero

mean(cMean(:) - t.mean_nPass(:))

ans = -0.041108

The average of the residuals is close to zero showing the simulated data are in agreement with the expression.

Having deduced the expected number of passes needed when the last position is `k`, we can now, finally, compute the expected path length for the table of size `N` because we can simply sum the expected path for each of the possible k positions times the probability that position k is the last.

$$E_N =\sum_{k=1}^{N-1} p_k E_k$$

syms k N r EN = simplify(symsum(pk*Ek, k, 1, N-1), 1000)

EN = (N^2*(r + 1))/((r^N - 1)*(r - 1)) - ((r + 1)*(N + r - N*r))/(r - 1)^2 + (r*(N - 1)^2*(r + 1))/((r - r^N)*(r - 1))

Thankfully this looks a lot simpler than some of the other results earlier.

**Check against a known analytical solution**

Before continuing we can check this result in a couple of ways. Firstly, let's check specifically against the `N==3` case where we already know the analytic solution (see `SymbolicAnalysisFor3People.mlx`)

$$E_3 =\frac{3}{q^2 -q+1}-1$$

Substituting in a value for N and equation for r

E3 = simplify(subs(EN, {N, r}, {3 q/(1-q)}), 400)

E3 = 3/(q^2 - q + 1) - 1

Alternatively, we can check the results against some simulations for different values of N

p = pToUse; tableSize = 5:3:37; sim = NaN(size(tableSize)); parfor ii = 1:numel(tableSize) N = tableSize(ii); [~,nPass] = portPassingSimulation(1e5, N, p); sim(ii) = mean(nPass); end plot(tableSize, sim, 'ro') grid on hold on N = min(tableSize):max(tableSize); expected = double(subs(EN, {'r' 'N'}, {(1-p)/p N})); plot(N, expected); hold off title 'Expected passes to finish random walk' xlabel 'Number of people at the table' ylabel 'Number of passes' legend('Simulated results', 'Expected result', 'Location', 'N')

Writing the biased and unbiased results out in full, we get

$$E_N =\left\lbrace \begin{array}{cc}\frac{N(N-1)}{2} & p=q=\frac{1}{2}\\\frac{r+1}{r-1}(\frac{N^2}{r^N -1}-\frac{{(N-1)}^2 }{r^{N-1} -1}+\frac{Nr-N-r}{r-1})& p\ne q\end{array}\right.$$

Written this way, it is quite complicated, but there is a symmetry to it. A term in N, the same term in N-1 and another term. All-in-all distinctly simpler than `ER` that I printed out further up the page. And with that we have produced analytic solutions for all the values we wanted to investigate - perhaps it is now time to pass the port!

Manipulation of the fairly complicated algebraic expression that resulted from joining all the expected paths with probabilities together was something I thought would be really hard when done by hand. Yet I think you will agree using the Symbolic Math Toolbox has made it tractable. I'm also impressed with its ability to solve recurrence relationships (see the associated files that prove the initial definitions of `pSAB` and `eSAB`).

From the article it might look like I've done all the work here myself, so I'd like to thank both Rick Amos (MathWorks) and Barnaby Martin (Durham University) for helping out with many of the details.

Having seen these related blog posts, what problems might you now attack using simulation to guide your mathematical approach? Can you see opportunities to bring statistics and symbolic manipulation together? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2016b

- Introduction
- A simulation approach
- Investigating the final position of the port
- Analytic probability for last position when passing is unbiased
- What happens if there is a biased probability of passing left or right?
- Does this biased simulation agree with theory?
- Who has the most to drink?
- Does the number of guests affect how much each drinks?
- Will the pipe of port run out?
- Predicting the amount of port consumed
- An analytic conjecture
- How does the amount of port consumed depend on the final position?
- And finally

nSims = 100; N = 11; p = 0.5;To simulate

```
M = 25;
singleMove = 2*(rand(nSims, M) >= p) - 1;
% The first 6 pass directions of 6 simulations
disp(singleMove(1:6, 1:6));
```

-1 -1 1 1 1 -1 -1 -1 1 1 -1 1 -1 1 -1 1 -1 -1 -1 1 1 1 -1 -1 -1 1 -1 1 1 -1 -1 -1 -1 -1 1 1By knowing the direction of each pass we can compute the location of the port compared to its starting point by taking the cumulative sum of the directions for each pass. To make this an absolute position we add to it the starting position (in this case position 0). Because we need to account for the decanter moving all the way around the table, we are going to use modulo

```
start = 0;
locations = mod(cumsum(singleMove, 2) + start, N);
% The first 6 locations of 6 simulations
disp(locations(1:6, 1:6))
```

10 9 10 0 1 0 10 9 10 0 10 0 10 0 10 0 10 9 10 0 1 2 1 0 10 0 10 0 1 0 10 9 8 7 8 9We now have a matrix

finished = false(nSims, 1); for ii = 1:nSims finished(ii) = numel(unique(locations(ii, :))) == N; end nFinished = sum(finished)

nFinished = 10For those simulations that have not finished, we know their current location (given by

[last, nPasses] = portPassingSimulation(nSims, N, p); results = table(last, nPasses); disp(results(1:10,:))

last nPasses ____ _______ 7 215 7 129 8 127 8 127 3 124 10 108 4 102 8 102 5 101 3 98

last = portPassingSimulation(1e5, N, p); hist(last, 1:N-1); grid on title '10^5 simulations of 11 people around the table' xlabel 'Last person to receive the port' ylabel 'Number of times that person is last'From this histogram we feel greatly relieved! It shows that no matter where we sit at the table, the number of times we are the last person to receive the port is the same as everyone else. It does not matter where we sit at the table! This result seems somewhat counter intuitive; surely how far you are from the starting position should affect the likelihood you are last? First let's see if this result is the same for a different table size. We can check by running the simulator with a larger number of people.

last = portPassingSimulation(1e5, 41, p); hist(last, 1:max(last)); grid on title '10^5 simulations of 41 people around the table' xlabel 'Last person to receive the port' ylabel 'Number of times that person is last'There is an intuitive way to describe that the probability of being last is independent of where you sit at the table. To be last, the decanter must have arrived at your left or right, then traversed the whole table without passing via you. No matter where you sit at the table, the probability of that set of actions is the same - and thus the probability of you being last is the same.

- Starting at position 0 and going to position L (numbered k-1) whilst
**not**going to position R (numbered k+1), followed by going from position L to position R whilst**not**going to position k - Starting at position 0 and going to position R whilst
**not**going to position L, followed by going from position R to position L whilst**not**going to position k

syms pSAB(S,A,B) k N pSAB(S,A,B) = (S-B)/(A-B); L = k-1; R = k+1-N; pLastUnbiased = simplify(pSAB(0,L,R)*pSAB(L,R,k) + pSAB(0,R,L)*pSAB(R,L,k-N))

pLastUnbiased = 1/(N - 1)Note that to ensure that either $A<S<B \mbox{ or } B<S<A$ we sometimes need to undertake a modulo approach to A or B. That is why R is written

N = 17; p = 0.54; k = 1:N-1; nSims = 1e5; last = portPassingSimulation(nSims, N, p); hist(last, k); grid on title 'Biased passing simulation of 17 people around the table' xlabel 'Last person to receive the port' ylabel 'Number of times that person is last'It is clear from this histogram that even a small bias in the probability hugely affects the likelihood we will be last. We could undertake studies to decide how many dinner parties we needed to attend before we could reasonably deduce that the decanter was biased, but that is another story!

r = (1-p)/p; pLastBiased = r^N*(1-r)./(r.^k*(r-r^N)); hold on plot(k, pLastBiased*nSims, 'ro', 'MarkerSize', 8, 'LineWidth', 2); hold off

nSims = 3e4; N = 25; [~, ~, ~, visitsUnbiased] = portPassingSimulation(nSims, N, 0.5); [~, ~, ~, visitsBiasedL] = portPassingSimulation(nSims, N, 0.53); [~, ~, ~, visitsBiasedR] = portPassingSimulation(nSims, N, 0.47); figure hold on plot(1:N-1, visitsUnbiased/nSims, 'o-') plot(1:N-1, visitsBiasedL/nSims, 'o-') plot(1:N-1, visitsBiasedR/nSims, 'o-') hold off grid on legend('Unbiased pass', 'Biased left-pass', 'Biased right-pass', 'Location', 'N') title 'Decanter visits to each guest' xlabel 'Location at table' ylabel 'Number of times decanter passes location'From the graph we can see that guests who are close to the initial position of the decanter (location 1) receive and pass the port a little less than twice as often (for this size table) as the guest opposite them, and that as the passing becomes biased so one side of the table is more likely to receive more wine than the other side. We can also see that the unbiased case has the largest number of passes on average, which we can guess by considering that in the case p==1 (or p==0) the number of times each location receives the port is just once!

figure hold on grid on tableSizes = [5 10 14 15 20 30 35]; for kk = tableSizes [~, ~, ~, visits] = portPassingSimulation(1e4, kk+1, 0.5); plot( linspace(-1, 1, kk), kk*visits/sum(visits), '.-') end hold off legend(string(tableSizes), 'Location', 'N') title 'Relative likelihood of holding the port for a given table size' xlabel 'Normalized table size' ylabel 'Relative likelihood of holding decanter'There looks to be a quadratic relationship here; the interested reader is encouraged to explore this and produce a functional form for the relationship as table size changes and to look at how the average number of visits scales as the table size increases.

nSims = 1e5; N = 31; p = 0.5; binW = 50; [~, nPasses] = portPassingSimulation(nSims, N, p); hist(nPasses, 30:binW:max(nPasses)); hold on grid onIf you are interested, it is worth looking at this distribution using Distribution Fitter from the Statistics and Machine Learning Toolbox. It seems to be well fitted using either Lognormal or Birnbaum-Saunders distributions. We can fit either of these using the

pd = fitdist(nPasses, 'LogNormal') x = linspace(1, max(nPasses), 200); plot(x, nSims*binW*pdf(pd, x), 'r'); hold off title 'Distribution of total passes' xlabel 'Number of passes to reach all dinner guests' legend('Simulated PDF', 'Expected PDF for LogNormal Distribution')

pd = LognormalDistribution Lognormal distribution mu = 5.98543 [5.98189, 5.98897] sigma = 0.570894 [0.568403, 0.573407]We can use a probability plot to see how well the LogNormal distribution fits the data. You can see that in the tails the distribution is incorrect.

probplot('LogNormal', nPasses) grid onBoth LogNormal and Birnbaum-Sanders over-estimate the number of passes around the mean and underestimate around 1.5 times the mean, so whilst fitting fairly well they don't actually model the distribution correctly. The mean of the fitted and actual distributions is shown below.

fittedMean = pd.mean actualMean = mean(nPasses)

fittedMean = 467.96 actualMean = 465.29To account for this inability to fit the simulation data we are going to use the empirical mean for the rest of this analysis. If we were more sure that we understood the expected distribution, we would use a fitted distribution.

p = 0.5; tableSize = 2:30; tToFinish = zeros(size(tableSize)); parfor ii = 1:numel(tableSize) [~, nPasses] = portPassingSimulation(1e5, tableSize(ii), p); tToFinish(ii) = mean(nPasses); end plot(tableSize, tToFinish, 'o') title 'Mean Passes to Finish vs. Number of Guests' xlabel 'Number of people at the table' ylabel 'Mean number of passes to reach all people' grid onIt looks like there is a very strong polynomial relationship between these quantities. What does a polynomial fit of degree 4 and degree 2 look like?

f4 = polyfit(tableSize, tToFinish, 4) f2 = polyfit(tableSize, tToFinish, 2)

f4 = 1.8679e-05 -0.0011103 0.52081 -0.63573 0.24387 f2 = 0.49955 -0.49157 -0.015421Notice how the 3rd and 4th order parameters of

hold on x = linspace(min(tableSize), max(tableSize), 100); plot(x, polyval(f2, x), 'r'); hold off legend('Measured mean of distribution', 'Fitted quadratic', 'Location', 'NW')

residuals = zeros(60,1); parfor ii = 1:60 % Use table sizes 9:4:29 N = 9 + 4*mod(ii, 6); [~, nPasses] = portPassingSimulation(1e5, N, 0.5); eMeanPasses = polyval([0.5 -0.5 0], N); residuals(ii) = mean(nPasses) - eMeanPasses; end % Are the residuals drawn from a normal distribution with zero mean? We can % use a t-test to look at this [H1, pVal1, CI, stats] = ttest(residuals)

H1 = 0 pVal1 = 0.32436 CI = -0.18322 0.061617 stats = struct with fields: tstat: -0.99384 df: 59 sd: 0.47389The output (

pd = fitdist(residuals, 'tLocationScale'); probplot(pd, residuals) grid on title 'Probability plot with t Location Scale distribution'Finally, we could test the likelihood that the data was drawn from that fitted distribution, but with zero mean, using a chi-squared goodness of fit test

```
pd.mu = 0;
[H2, pVal2] = chi2gof(residuals, 'cdf', pd)
```

H2 = 0 pVal2 = 0.57262That seems to fit reasonably well, indicating that our conjecture seems to fit the data. See my next blog post for an analytic proof of this conjecture. So we can now finally answer the question about the pipe of port running out, since we can predict the mean number of passes for a given table size.

% 1008 Imperial Pints in a Pipe and 2 fl.oz. glass % and 20 fl.oz. in Imperial Pint pintsInPipeOfPort = 1008; maxPasses = pintsInPipeOfPort*20/2; N = sym('N'); assume(N > 0) % Use the Symbolic Math Toolbox to solve the equation for k double(solve( N*(N-1)/2 == maxPasses, N ))

ans = 142.49This tells us that with a table size of 142 less than half of her dinners will run out of port!

N = 17; p = 0.5; [last, nPasses] = portPassingSimulation(5e5, N, p);Next, look at the mean of the number of passes, but group those means by the possible last position. There is a really easy way to do this using the

G = varfun(@mean, table(last, nPasses), 'GroupingVariable', 'last')

G = last GroupCount mean_nPasses ____ __________ ____________ 1 31392 100.71 2 31392 114.98 3 31256 127.18 4 31205 137.1 5 31209 144.15 6 31254 151.21 7 31255 155.31 8 31429 156.65 9 30950 156.79 10 31405 155.29 11 31182 150.78 12 31333 144.77 13 31276 136.87 14 31015 127.2 15 31313 115.35 16 31134 100.88Plotting the mean passes we can see that positions closest to zero have the shortest paths and those furthest away the longest.

plot(G.last, G.mean_nPasses, 'o') title 'Expected passes given position that is last' xlabel 'Position that receives port last' ylabel 'Expected number of passes' grid onFinally, looking at what happens if the situation becomes biased, we probably expect that the position with the longest expected path will shift to one side (i.e. closer to the start position). In fact, thinking about the limit where p approaches 1 we should expect that the path length to position N-1 simply becomes N (since the decanter will pass from k to k+1 with probability close to 1). The path length for N-2 will probably be very close to that for N-1, with a deviation to the right somewhere around the table (with close to zero probability). Thus for p > 0.5 we expect the peak in the path length to move towards position zero and for the expected path length for k = N-1 to be shorter than for k = 1.

N = 17; p = 0.57; [last, nPasses] = portPassingSimulation(5e5, N, p); G = varfun(@mean, table(last, nPasses), 'GroupingVariable', 'last'); plot(G.last, G.mean_nPasses, 'o') title 'Expected passes given position that is last (biased passing)' xlabel 'Position that receives port last' ylabel 'Expected number of passes' grid on

Get the MATLAB code Published with MATLAB® R2016b

]]>Social media has become an important part of modern life, and Twitter is again a center of focus in recent events. Today's guest blogger, Toshi Takeuchi gives us an update on how you can use MATLAB to analyze a Twitter feed.... read more >>

]]>Social media has become an important part of modern life, and Twitter is again a center of focus in recent events. Today's guest blogger, Toshi Takeuchi gives us an update on how you can use MATLAB to analyze a Twitter feed.

- Twitter Revisited
- Load Tweets
- Short Urls
- Tokenize Tweets
- Sentiment Analysis
- What Words Appear Frequently in Tweets?
- What Hashtags Appear Frequently in Tweets?
- Who Got Frequent Mentions in Tweets?
- Frequently Cited Web Sites
- Frequently Cited Sources
- Generating a Social Graph
- Handling Mentions
- Creating the Edge List
- Creating the Graph
- Zooming into the Largest Subgraph
- Using Twitty
- Twitter Search API Example
- Twitter Trending Topic API Example
- Twitter Streaming API Example
- Summary - Visit Andy's Developer Zone for More

When I wrote about analyzing Twitter with MATLAB back in 2014 I didn't expect that 3 years later Twitter would come to play such a huge role in politics. There have been a lot of changes in MATLAB in those years as well. Perhaps it is time to revisit this topic. We hear a lot about fake news since the US Presidential Election of 2016. Let's use Twitter to analyze this phenomenon. While fake news spreads mainly on Facebook, Twitter is the favorite social media platform for journalists who discuss them.

I collected 1,000 tweets that contain the term 'fake news' using the Streaming API and saved them in `fake_news.mat`. Let's start processing tweets by looking at the top 10 users based on the followers count.

load fake_news % load data t = table; % initialize a table t.names = arrayfun(@(x) x.status.user.name, ... % get user names fake_news.statuses, 'UniformOutput', false); t.names = regexprep(t.names,'[^a-zA-Z .,'']',''); % remove non-ascii t.screen_names = arrayfun(@(x) ... % get screen names x.status.user.screen_name, fake_news.statuses, 'UniformOutput', false); t.followers_count = arrayfun(@(x) ... % get followers count x.status.user.followers_count, fake_news.statuses); t = unique(t,'rows'); % remove duplicates t = sortrows(t,'followers_count', 'descend'); % rank users disp(t(1:10,:)) % show the table

names screen_names followers_count ____________________ _________________ _______________ 'Glenn Greenwald' 'ggreenwald' 7.9605e+05 'Soledad O'Brien' 'soledadobrien' 5.6769e+05 'Baratunde' 'baratunde' 2.0797e+05 'Kenneth Roth' 'KenRoth' 1.9189e+05 'Stock Trade Alerts' 'AlertTrade' 1.1921e+05 'SokoAnalyst' 'SokoAnalyst' 1.1864e+05 'Tactical Investor' 'saul42' 98656 'Vladimir Bajic' 'trend_auditor' 70502 'Marketing Gurus' 'MarketingGurus2' 68554 'Jillian C. York ' 'jilliancyork' 53744

Until recently Twitter had a 140 character limit per tweet including links. Therefore when people embed urls in their tweets, they typically used url shortening services. To identify the actual sources, we need to get the expanded urls that those short urls point to. To do it I wrote a utility function `expandUrl` taking advantage of the new HTTP interface introduced in R2016b. You can see that I create URI and RequestMessage objects and used the `send` method to get a ResponseMessage object.

dbtype expandUrl 25:32

25 import matlab.net.* matlab.net.http.* % http interface libs 26 for ii = 1:length(urls) % for each url 27 if contains(urls(ii),shorteners) % if shortened 28 uri = URI(urls(ii)); % create URI obj 29 r = RequestMessage; % request object 30 options = HTTPOptions('MaxRedirects',0); % prevent redirect 31 try % try 32 response = r.send(uri,options); % send http request

Let's give it a try.

expanded = char(expandUrl('http://trib.al/ZQuUDNx')); % expand url disp([expanded(1:70) '...'])

https://hbr.org/2017/01/the-u-s-medias-problems-are-much-bigger-than-f...

To get a sense of what was being discussed in those tweets and what sentiments were represented there, we need to process the text.

- Our first step is to turn tweets into tokens.
- Once we have tokens, we can use them to compute sentiment scores based on lexicons like AFINN.
- You can also use it to visualize tweets as a word cloud.

We also want to collect embedded links along the way.

delimiters = {' ','$','/','.','-',':','&','*', ... % remove those '+','=','[',']','?','!','(',')','{','}',',', ... '"','>','_','<',';','%',char(10),char(13)}; AFINN = readtable('AFINN/AFINN-111.txt', ... % load score file 'Delimiter','\t','ReadVariableNames',0); AFINN.Properties.VariableNames = {'Term','Score'}; % add var names stopwordsURL ='http://www.textfixer.com/resources/common-english-words.txt'; stopWords = webread(stopwordsURL); % read stop words stopWords = split(string(stopWords),','); % split stop words tokens = cell(fake_news.tweetscnt,1); % cell arrray as accumulator expUrls = strings(fake_news.tweetscnt,1); % cell arrray as accumulator dispUrls = strings(fake_news.tweetscnt,1); % cell arrray as accumulator scores = zeros(fake_news.tweetscnt,1); % initialize accumulator for ii = 1:fake_news.tweetscnt % loop over tweets tweet = string(fake_news.statuses(ii).status.text); % get tweet s = split(tweet, delimiters)'; % split tweet by delimiters s = lower(s); % use lowercase s = regexprep(s, '[0-9]+',''); % remove numbers s = regexprep(s,'(http|https)://[^\s]*',''); % remove urls s = erase(s,'''s'); % remove possessive s s(s == '') = []; % remove empty strings s(ismember(s, stopWords)) = []; % remove stop words tokens{ii} = s; % add to the accumulator scores(ii) = sum(AFINN.Score(ismember(AFINN.Term,s))); % add to the accumulator if ~isempty( ... % if display_url exists fake_news.statuses(ii).status.entities.urls) && ... isfield(fake_news.statuses(ii).status.entities.urls,'display_url') durl = fake_news.statuses(ii).status.entities.urls.display_url; durl = regexp(durl,'^(.*?)\/','match','once'); % get its domain name dispUrls(ii) = durl(1:end-1); % add to dipUrls furl = fake_news.statuses(ii).status.entities.urls.expanded_url; furl = expandUrl(furl,'RemoveParams',1); % expand links expUrls(ii) = expandUrl(furl,'RemoveParams',1); % one more time end end

Now we can create the document term matrix. We will also do the same thing for embedded links.

dict = unique([tokens{:}]); % unique words domains = unique(dispUrls); % unique domains domains(domains == '') = []; % remove empty string links = unique(expUrls); % unique links links(links == '') = []; % remove empty string DTM = zeros(fake_news.tweetscnt,length(dict)); % Doc Term Matrix DDM = zeros(fake_news.tweetscnt,length(domains)); % Doc Domain Matrix DLM = zeros(fake_news.tweetscnt,length(links)); % Doc Link Matrix for ii = 1:fake_news.tweetscnt % loop over tokens [words,~,idx] = unique(tokens{ii}); % get uniqe words wcounts = accumarray(idx, 1); % get word counts cols = ismember(dict, words); % find cols for words DTM(ii,cols) = wcounts; % unpdate DTM with word counts cols = ismember(domains,dispUrls(ii)); % find col for domain DDM(ii,cols) = 1; % increment DMM expanded = expandUrl(expUrls(ii)); % expand links expanded = expandUrl(expanded); % one more time cols = ismember(links,expanded); % find col for link DLM(ii,cols) = 1; % increment DLM end DTM(:,ismember(dict,{'#','@'})) = []; % remove # and @ dict(ismember(dict,{'#','@'})) = []; % remove # and @

One of the typical analyses you perform on Twitter feed is sentiment analysis. The histogram shows, not surprisingly, that those tweets were mostly very negative. We can summarize this by the Net Sentiment Rate (NSR), which is based on the ratio of positive tweets to negative tweets.

NSR = (sum(scores >= 0) - sum(scores < 0)) / length(scores);% net setiment rate figure % new figure histogram(scores,'Normalization','probability') % positive tweets line([0 0], [0 .35],'Color','r'); % reference line title(['Sentiment Score Distribution of "Fake News" ' ... % add title sprintf('(NSR: %.2f)',NSR)]) xlabel('Sentiment Score') % x-axis label ylabel('% Tweets') % y-axis label yticklabels(string(0:5:35)) % y-axis ticks text(-10,.25,'Negative');text(3,.25,'Positive'); % annotate

Now let's plot the word frequency to visualize what was discussed in those tweets. They seem to be about dominant news headlines at the time the tweets were collected.

count = sum(DTM); % get word count labels = erase(dict(count >= 40),'@'); % high freq words pos = [find(count >= 40);count(count >= 40)] + 0.1; % x y positions figure % new figure scatter(1:length(dict),count) % scatter plot text(pos(1,1),pos(2,1)+3,cellstr(labels(1)),... % place labels 'HorizontalAlignment','center'); text(pos(1,2),pos(2,2)-2,cellstr(labels(2)),... 'HorizontalAlignment','right'); text(pos(1,3),pos(2,3)-4,cellstr(labels(3))); text(pos(1,3:end),pos(2,3:end),cellstr(labels(3:end))); title('Frequent Words in Tweets Mentioning Fake News') % add title xlabel('Indices') % x-axis label ylabel(' Count') % y-axis label ylim([0 150]) % y-axis range

Hashtags that start with "#" are often used to identify the main theme of tweets, and we see those related to the dominant news again as you would expect.

is_hash = startsWith(dict,'#') & dict ~= '#'; % get indices hashes = erase(dict(is_hash),'#'); % get hashtags hash_count = count(is_hash); % get count labels = hashes(hash_count >= 4); % high freq tags pos = [find(hash_count >= 4) + 1; ... % x y positions hash_count(hash_count >= 4) + 0.1]; figure % new figure scatter(1:length(hashes),hash_count) % scatter plot text(pos(1,1),pos(2,1)- .5,cellstr(labels(1)),... % place labels 'HorizontalAlignment','center'); text(pos(1,2:end-1),pos(2,2:end-1),cellstr(labels(2:end-1))); text(pos(1,end),pos(2,end)-.5,cellstr(labels(end)),... 'HorizontalAlignment','right'); title('Frequently Used Hashtags') % add title xlabel('Indices') % x-axis label ylabel('Count') % y-axis label ylim([0 15]) % y-axis range

Twitter is also a commmunication medium and people can direct their tweets to specific users by including their screen names in the tweets starting with "@". These are called "mentions". We can see there is one particular user who got a lot of mentions.

is_ment = startsWith(dict,'@') & dict ~= '@'; % get indices mentions = erase(dict(is_ment),'@'); % get mentions ment_count = count(is_ment); % get count labels = mentions(ment_count >= 10); % high freq mentions pos = [find(ment_count >= 10) + 1; ... % x y positions ment_count(ment_count >= 10) + 0.1]; figure % new figure scatter(1:length(mentions),ment_count) % scatter plot text(pos(1,:),pos(2,:),cellstr(labels)); % place labels title('Frequent Mentions') % add title xlabel('Indices') % x-axis label ylabel('Count') % y-axis label ylim([0 100]) % y-axis range

You can also embed a link in a tweet, usually for citing sources and directing people to get more details from those sources. This tends to show where the original information came from.

Twitter was the most frequently cited source. This was interesting to me. Usually, if you want to cite other tweets, you retweet them. When you retweet, the original user gets a credit. By embedding the link without retweeting it, people circumvent this mechanism. Very curious.

count = sum(DDM); % get domain count labels = domains(count > 5); % high freq citations pos = [find(count > 5) + 1;count(count > 5) + 0.1]; % x y positions figure % new figure scatter(1:length(domains),count) % scatter plot text(pos(1,:),pos(2,:),cellstr(labels)); % place labels title('Frequently Cited Web Sites') % add title xlabel('Indices') % x-axis label ylabel('Count') % y-axis label

You can also see that many of the web sites are for url shortening services. Let's find out the real urls linked from those short urls.

count = sum(DLM); % get domain count labels = links(count >= 15); % high freq citations pos = [find(count >= 15) + 1;count(count >= 15)]; % x y positions figure % new figure scatter(1:length(links),count) % scatter plot text(ones(size(pos(1,:))),pos(2,:)-2,cellstr(labels)); % place labels title('Frequently Cited Sources ') % add title xlabel('Indices') % x-axis label ylabel('Count') % y-axis label

Now let's think of a way to see the association between users to the entities included in their tweets to reveal their relationships. We have a matrix of words by tweets, and we can convert it into a matrix of users vs. entities, such as hashtags, mentions and links.

users = arrayfun(@(x) x.status.user.screen_name, ... % screen names fake_news.statuses, 'UniformOutput', false); uniq = unique(users); % remove duplicates combo = [DTM DLM]; % combine matrices UEM = zeros(length(uniq),size(combo,2)); % User Entity Matrix for ii = 1:length(uniq) % for unique user UEM(ii,:) = sum(combo(ismember(users,uniq(ii)),:),1); % sum cols end cols = is_hash | is_ment; % hashtags, mentions cols = [cols true(1,length(links))]; % add links UEM = UEM(:,cols); % select those cols ent = dict(is_hash | is_ment); % select entities ent = [ent links']; % add links

Some of the mentions are for users of those tweets, and others are not. When two users mention another, that forms an user-user edge, rather than user-entity edge. To map such edges correctly, we want to treat mentioned users separately.

ment_users = uniq(ismember(uniq,mentions)); % mentioned users is_ment = ismember(ent,'@' + string(ment_users)); % their mentions ent(is_ment) = erase(ent(is_ment),'@'); % remove @ UUM = zeros(length(uniq)); % User User Matrix for ii = 1:length(ment_users) % for each ment user row = string(uniq) == ment_users{ii}; % get row col = ent == ment_users{ii}; % get col UUM(row,ii) = UEM(row,col); % copy count end

Now we can add the user to user matrix to the existing user to entity matrix, but we also need to remove the mentioned users from entities since they are already included in the user to user matrix.

All we need to do then is to turn that into a sparse matrix and find indices of non zero elements. We can then use those indices as the edge list.

UEM(:,is_ment) = []; % remove mentioned users UEM = [UUM, UEM]; % add UUM to adj nodes = [uniq; cellstr(ent(~is_ment))']; % create node list s = sparse(UEM); % sparse matrix [i,j,s] = find(s); % find indices

Once you have the edge list, it is a piece of cake to make a social graph from that. Since our relationships have directions (user --> entity), we will create a directed graph with `digraph`. The size of the nodes are scaled and colored based on the number of incoming relationships called in-degrees. As you can see, most tweets are disjointed but we see some large clusters of tweets.

G = digraph(i,j); % directed graph G.Nodes.Name = nodes; % add node names figure % new figure colormap cool % set color map deg = indegree(G); % get indegrees markersize = log(deg + 2) * 2; % indeg for marker size plot(G,'MarkerSize',markersize,'NodeCData',deg) % plot graph labels = colorbar; labels.Label.String = 'Indegrees'; % add colorbar title('Graph of Tweets containing "Fake News"') % add title xticklabels(''); yticklabels(''); % hide tick labels

Let's zoom into the largest subgraph to see the details. This gives a much clearer idea about what those tweets were about because you see who was mentioned and what sources were cited. You can see a New York Times opinion column and an article from Sweden generated a lot of tweets along with those who were mentioned in those tweets.

bins = conncomp(G,'OutputForm','cell','Type','weak'); % get connected comps binsizes = cellfun(@length,bins); % get bin sizes [~,idx] = max(binsizes); % find biggest comp subG = subgraph(G,bins{idx}); % create sub graph figure % new figure colormap cool % set color map deg = indegree(subG); % get indegrees markersize = log(deg + 2) * 2; % indeg for marker size h = plot(subG,'MarkerSize',markersize,'NodeCData',deg); % plot graph c = colorbar; c.Label.String = 'In-degrees'; % add colorbar title('The Largest Subgraph (Close-up)') % add title xticklabels(''); yticklabels(''); % hide tick labels [~,rank] = sort(deg,'descend'); % get ranking top15 = subG.Nodes.Name(rank(1:15)); % get top 15 labelnode(h,top15,top15 ); % label nodes axis([-.5 2.5 -1.6 -0.7]); % define axis limits

If you want to analyze Twitter for different topics, you need to collect your own tweets. For this analysis I used Twitty by Vladimir Bondarenko. It hasn't been updated since July 2013 but it still works. Let's go over how you use Twitty. I am assuming that you already have your developer credentials and downloaded Twitty into your curent folder. The workspace variable `creds` should contain your credentials in a struct in the following format:

creds = struct; % example creds.ConsumerKey = 'your consumer key'; creds.ConsumerSecret = 'your consumer secret'; creds.AccessToken = 'your token'; creds.AccessTokenSecret = 'your token secret';

Twitty by default expects the JSON Parser by Joel Feenstra. However, I would like to use the new built-in functions in R2016 `jsonencode` and `jsondecode` instead. To suppress the warning Twitty generates, I will use `warning`.

warning('off') % turn off warning addpath twitty_1.1.1; % add Twitty folder to the path load creds % load my real credentials tw = twitty(creds); % instantiate a Twitty object warning('on') % turn on warning

Since Twitty returns JSON as plain text if you don't specify the parser, you can use `jsondecode` once you get the output from Twitty. The number of tweets you can get from the Search API is limited to 100 per request. If you need more, you usually use the Streaming API.

keyword = 'nfl'; % keyword to search tweets = tw.search(keyword,'count',100,'include_entities','true','lang','en'); tweets = jsondecode(tweets); % parse JSON tweet = tweets.statuses{1}.text; % index into text disp([tweet(1:70) '...']) % show 70 chars

RT @JBaezaTopDawg: .@NFL will be announcing a @Patriots v @RAIDERS mat...

If you want to find a high volume topic with thousands of tweets, one way to find such a topic is to use trending topics. Those topics will give you plenty of tweets to work with.

us_woeid = 23424977; % US as location us_trends = tw.trendsPlace(us_woeid); % get trending topics us_trends = jsondecode(us_trends); % parse JSON trends = arrayfun(@(x) x.name, us_trends.trends, 'UniformOutput',false); disp(trends(1:10))

'Beyoncé' 'Rex Tillerson' '#NSD17' '#PressOn' 'DeVos' 'Roger Goodell' '#nationalsigningday' 'Skype' '#wednesdaywisdom' '#MyKindOfPartyIncludes'

Once you find a high volume topic to work with, you can use the Streaming API to get tweets that contain it. Twitty stores the retrieved tweets in the `'data'` property. What you save is defined in an output function like saveTweets.m. `'S'` in this case will be a character array of JSON formatted text and we need to use `jsondecode` to convert it into a struct since we didn't specify the JSON parser.

dbtype twitty_1.1.1/saveTweets.m 17:24

17 % Parse input: 18 S = jsondecode(S); 19 20 if length(S)== 1 && isfield(S, 'statuses') 21 T = S{1}.statuses; 22 else 23 T = S; 24 end

Now let's give it a try. By default, Twitty will get 20 batches of 1000 tweets = 20,000 tweets, but that will take a long time. We will just get 10 tweets in this example.

keyword = 'nfl'; % specify keyword tw.outFcn = @saveTweets; % output function tw.sampleSize = 10; % default 1000 tw.batchSize = 1; % default 20 tw.filterStatuses('track',keyword); % Streaming API call result = tw.data; % save the data length(result.statuses) % number of tweets tweet = result.statuses(1).status.text; % get a tweet disp([tweet(1:70) '...']) % show 70 chars

Tweets processed: 1 (out of 10). Tweets processed: 2 (out of 10). Tweets processed: 3 (out of 10). Tweets processed: 4 (out of 10). Tweets processed: 5 (out of 10). Tweets processed: 6 (out of 10). Tweets processed: 7 (out of 10). Tweets processed: 8 (out of 10). Tweets processed: 9 (out of 10). Tweets processed: 10 (out of 10). ans = 10 RT @Russ_Mac876: Michael Jackson is still the greatest https://t.co/BE...

In this post you saw how you can analyze tweets using the more recent features in MATLAB, such as the HTTP interface to expand short urls. You also got a quick tutorial on how to use Twitty to collect tweets for your own purpose.

Twitty covers your basic needs. But you can go beyond Twitty and roll your own tool by taking advantage of the new HTTP interface. I show you how in a second blog post I wrote for Andy's Developer Zone.

Now that you understand how you can use Twitter to analyze social issues like fake news, tell us how you would put it to good use here.

Get
the MATLAB code

Published with MATLAB® R2016b

Valentine's day is fast approaching and those who are in a relationship might start thinking about plans. For those who are not as lucky, read on! Today's guest blogger, Today's guest blogger, Toshi Takeuchi, explores how you can be successful at speed dating events through data.... read more >>

]]>Valentine's day is fast approaching and those who are in a relationship might start thinking about plans. For those who are not as lucky, read on! Today's guest blogger, Today's guest blogger, Toshi Takeuchi, explores how you can be successful at speed dating events through data.

- Speed Dating Dataset
- What Participants Looked For In the Opposite Sex
- Were They Able to Find Matches?
- Do You Get More Matches If You Make More Requests?
- Decision to Say Yes for Second Date
- Factors Influencing the Decision
- Validatng the Model with the Holdout Set
- Relative Attractiveness
- Are We Good at Assessing Our Own Attractiveness?
- Attractiveness is in the Eye of the Beholder
- Modesty is the Key to Success
- Summary

I recently came across an interesting Kaggle dataset Speed Dating Experiment - What attributes influence the selection of a romantic partner?. I never experienced speed dating, so I got curious.

The data comes from a series of heterosexual speed dating experiements at Columbia University from 2002-2004. In these experiments, you each met all of you opposite-sex participants for four minutes. The number of the first dates varied by the event - on average there were 15, but it could be as few as 5 or as many as 22. Then you were asked if you would like to meet any of them again. You also provided ratings on six attributes about your dates:

- Attractiveness
- Sincerity
- Intelligence
- Fun
- Ambition
- Shared Interests

The dataset also includes participants' preferences on those attributes at various point in the process, along with other demographic information.

opts = detectImportOptions('Speed Dating Data.csv'); % set import options opts.VariableTypes([9,38,50:end]) = {'double'}; % treat as double opts.VariableTypes([35,49]) = {'categorical'}; % treat as categorical csv = readtable('Speed Dating Data.csv', opts); % import data csv.tuition = str2double(csv.tuition); % convert to double csv.zipcode = str2double(csv.zipcode); % convert to double csv.income = str2double(csv.income); % convert to double

The participants filled out survey questions when they signed up, including what they were looking for in the opposite sex and what they thought others of their own gender looked for. If you take the mean ratings and then subtract the self-rating from the peer rating, you see that participants thought others were more into looks while they were also into sincerity and intelligence. Sounds a bit biased. We will see how they actually made decisions.

vars = csv.Properties.VariableNames; % get var names [G, res] = findgroups(csv(:,{'iid','gender'})); % group by id and gender [~,idx,~] = unique(G); % get unique indices pref = table2array(csv(idx,contains(vars,'1_1'))); % subset pref as array pref(isnan(pref)) = 0; % replace NaN with 0 pref1_1 = pref ./ sum(pref,2) * 100; % convert to 100-pt alloc pref = table2array(csv(idx,contains(vars,'4_1'))); % subset pref as array pref(isnan(pref)) = 0; % replace NaN with 0 pref4_1 = pref ./ sum(pref,2) * 100; % convert to 100-pt alloc labels = {'attr','sinc','intel','fun','amb','shar'}; % attributes figure % new figure b = bar([mean(pref4_1(res.gender == 1,:),'omitnan') - ... % bar plot mean(pref1_1(res.gender == 1,:),'omitnan'); ... mean(pref4_1(res.gender == 0,:),'omitnan') - ... mean(pref1_1(res.gender == 0,:),'omitnan')]'); b(1).FaceColor = [0 .45 .74]; b(2).FaceColor = [.85 .33 .1]; % change face color b(1).FaceAlpha = 0.6; b(2).FaceAlpha = 0.6; % change face alpha title('What Peers Look For More Than You Do in the Opposite Sex?') % add title xticklabels(labels) % label bars ylabel('Differences in Mean Ratings - Peers vs Self') % y axis label legend('Men','Women') % add legend

Let's first find out how successful those speed dating events were. If both you and your partner request another date after the first one, then you have a match. What percentage of initial dates resulted in matches?

- Most people found matches - the median match rate was around 13%
- A few people were extremely successful, getting more than an 80% match rate
- About 17-19% of men and women found no matches, unfortunately
- All in all, it looks like these speed dating events delivered the promised results

res.rounds = splitapply(@numel, G, G); % # initial dates res.matched = splitapply(@sum, csv.match, G); % # matches res.matched_r = res.matched ./ res.rounds; % match rate edges = [0 0.4 1:9]./ 10; % bin edges figure % new figure histogram(res.matched_r(res.gender == 1), edges,... % histogram of 'Normalization','probability') % male match rate hold on % overlay another plot histogram(res.matched_r(res.gender == 0), edges,... % histogram of 'Normalization','probability') % female match rate hold off % stop overlay title('What Percentage of the First Dates Resulted in Matches?') % add title xlabel(sprintf('%% Matches (Median - Men %.1f%%, Women %.1f%%)', ...% x-axis label median(res.matched_r(res.gender == 1))*100, ... % median men median(res.matched_r(res.gender == 0))*100)) % median women xticklabels(string(0:10:90)) % use percentage xlim([-0.05 0.95]) % x-axis range ylabel('% Participants') % y-axis label yticklabels(string(0:5:30)) % use percentage legend('Men','Women') % add legend text(-0.04,0.21,'0 matches') % annotate

In order to make a match, you need to request another date and get that request accepted. This means some people who got very high match rate must have requested a second date with almost everyone they met and they got their favor returned. Does that mean people who made fewer matches were more picky and didn't request another date as often as those who were more successful? Let's plot the request rate vs. match rate - if they correlate, then we should see a diagonal line!

- You can see some correlation below a 50% request rate - particularly for women. The more requests they make, the more matches they seem to get, to a point
- There is a clear gender gap in request rate - women tend to make fewer requests - the median for men is 44% vs. women for 37%
- If you request everyone, your mileage varies - you may still get no matches. In the end, you only get matches if your requests are accepted

res.requests = splitapply(@sum, csv.dec, G); % # requests res.request_r = res.requests ./ res.rounds; % request rate figure % new figure subplot(2,1,1) % add subplot scatter(res.request_r(res.gender == 1), ... % scatter plot male res.matched_r(res.gender == 1),'filled','MarkerFaceAlpha', 0.6) hold on % overlay another plot scatter(res.request_r(res.gender == 0), ... % scatter plot female res.matched_r(res.gender == 0),'filled','MarkerFaceAlpha', 0.6) r = refline(1,0); r.Color = 'r';r.LineStyle = ':'; % reference line hold off % stop overlay title('Do You Get More Matches If You Ask More?') % add title xlabel('% Second Date Requests') % x-axis label xticklabels(string(0:10:100)) % use percentage ylabel('% Matches') % y-axis label yticklabels(string(0:50:100)) % use percentage legend('Men','Women','Location','NorthWest') % add legend subplot(2,1,2) % add subplot histogram(res.request_r(res.gender == 1),... % histogram of 'Normalization','probability') % male match rate hold on % overlay another plot histogram(res.request_r(res.gender == 0),... % histogram of 'Normalization','probability') % female match rate hold off % stop overlay title('Do Women Make Fewer Requests Than Men?') % add title xlabel(sprintf('%% Second Date Requests (Median - Men %.1f%%, Women %.1f%%)', ... median(res.request_r(res.gender == 1))*100, ... % median men median(res.request_r(res.gender == 0))*100)) % median women xticklabels(string(0:10:100)) % use percentage ylabel('% Participants') % y-axis label yticklabels(string(0:5:20)) % use percentage legend('Men','Women') % add legend

Most people are more selective about who they go out with for second date. The two most obvious factors in such a decision are how much they liked who they met and how likely they think they will get a "yes" from them. There is no point in asking a person for a second date no matter how much you like him or her when you feel there's no chance of getting a yes. You can see a fairly good correlation between Yes and No using those two factors.

Please note that I normalized the ratings for "like" and "probability" (of getting yes) because the actual rating scale differs from person to person. You subtract the mean from the data to center the data around zero and then divide by the standard deviation to normalize the value range.

func = @(x) {(x - mean(x))./std(x)}; % normalize func f = csv.like; % get like f(isnan(f)) = 0; % replace NaN with 0 f = splitapply(func, f, G); % normalize by group like = vertcat(f{:}); % add normalized like f = csv.prob; % get prob f(isnan(f)) = 0; % replace NaN with 0 f = splitapply(func, f, G); % normalize by group prob = vertcat(f{:}); % add normalized prob figure % new figure subplot(1,2,1) % add subplot gscatter(like(csv.gender == 1), ... % scatter plot male prob(csv.gender == 1),csv.dec(csv.gender == 1),'rb','o') title('Second Date Decision - Men') % add title xlabel('Like, normalized') % x-axis label ylabel('Prob of Get "Yes", normalized') % y-axis label ylim([-5 5]) % y-axis range legend('No','Yes') % add legend subplot(1,2,2) % add subplot gscatter(like(csv.gender == 0), ... % scatter plot female prob(csv.gender == 0),csv.dec(csv.gender == 0),'rb','o') title('Second Date Decision - Women') % add title xlabel('Like, normalized') % x-axis label ylabel('Prob of Get "Yes", normalized') % y-axis label ylim([-5 5]) % y-axis range legend('No','Yes') % add legend

If you can correctly guess the probability of getting a yes, that should help a lot. Can we make such a prediction using observable and discoverable factors only?

Features were generated using normalization and other techniques - see `feature_eng.m` for more details. To determine which features matter more, the resulting feature set was then split into training and holdout sets and the training set was used to generate a Random Forest model `bt_all.mat` using the Classification Learner app. What's nice about a Random Forest is that it can show you the predictor importance estimates based on how error increases if you randomly change the value of particular predictors. If they don't matter, it shouldn't increase the error rate much, right?

Based on those scores, the most important features are:

- attrgap - difference between partners' attractiveness rated by participants and their own self rating
- attr - rating of attractiveness participants gave to their partners
- shar - rating of shared interests participants gave to their partners
- fun - rating of fun participants gave to their partners
- field_cd - participants' field of study

The least important features are:

- agegap - difference between the ages of participants and their partners
- order - in which part of the event they first met - earlier or later
- samerace - whether participants and their partners were the same race
- reading - participant's rating of interest in reading
- race_o - the race of the partners

feature_eng % engineer features load bt_all % load trained model imp = oobPermutedPredictorImportance(bt_all.ClassificationEnsemble);% get predictor importance vars = bt_all.ClassificationEnsemble.PredictorNames; % predictor names figure % new figure subplot(2,1,1) % add subplot [~,rank] = sort(imp,'descend'); % get ranking bar(imp(rank(1:10))); % plot top 10 title('Out-of-Bag Permuted Predictor Importance Estimates'); % add title ylabel('Estimates'); % y-axis label xlabel('Top 10 Predictors'); % x-axis label xticks(1:20); % set x-axis ticks xticklabels(vars(rank(1:10))) % label bars xtickangle(45) % rotate labels ax = gca; % get current axes ax.TickLabelInterpreter = 'none'; % turn off latex subplot(2,1,2) % add subplot [~,rank] = sort(imp); % get ranking bar(imp(rank(1:10))); % plot bottom 10 title('Out-of-Bag Permuted Predictor Importance Estimates'); % add title ylabel('Estimates'); % y-axis label xlabel('Bottom 10 Predictors'); % x-axis label xticks(1:10); % set x-axis ticks xticklabels(vars(rank(1:10))) % label bars xtickangle(45) % rotate labels ax = gca; % get current axes ax.TickLabelInterpreter = 'none'; % turn off latex

To be confident about the predictor values, let's check its predictive perfomance. The model was retained without the bottom 2 predictors - see `bt_45.mat` and the resulting model can predict the decision of participants to 79.6% accuracy. This looks a lot better than human participants did.

load bt_45 % load trained model Y = holdout.dec; % ground truth X = holdout(:,1:end-1); % predictors Ypred = bt_45.predictFcn(X); % prediction c = confusionmat(Y,Ypred); % get confusion matrix disp(array2table(c, ... % show the matrix as table 'VariableNames',{'Predicted_No','Predicted_Yes'}, ... 'RowNames',{'Actual_No','Actual_Yes'})); accuracy = sum(c(logical(eye(2))))/sum(sum(c)) % classification accuracy

Predicted_No Predicted_Yes ____________ _____________ Actual_No 420 66 Actual_Yes 105 246 accuracy = 0.7957

We saw that relative attractiveness was a major factor in the decision to say yes. `attrgap` scores indicate how much more attractive the partners were relative to the participants. As you can see, people tend to say yes when their partners are more attractive than themselves regardless of gender.

- This is a dilemma, because if you say yes to people who are more attractive than you are, they are more likely to say no because you are less attractive.
- But if you have some redeeming quality, such as having more shared interests or being fun, then you may be able to get yes from more attractive partners
- This applies to both genders. Is it just me, or does it look like men may be more willing to say yes to less attractive partners while women tends to more more receptive to fun partners? Loren isn't sure - she thinks it's just me.

figure % new figure subplot(1,2,1) % add subplot gscatter(train.fun(train.gender == '1'), ... % scatter male train.attrgap(train.gender == '1'),train.dec(train.gender == '1'),'rb','o') title('Second Date Decision - Men') % add title xlabel('Partner''s Fun Rating') % x-axis label ylabel('Partner''s Relative Attractivenss') % y-axis label ylim([-4 4]) % y-axis range legend('No','Yes') % add legend subplot(1,2,2) % add subplot gscatter(train.fun(train.gender == '0'), ... % scatter female train.attrgap(train.gender == '0'),train.dec(train.gender == '0'),'rb','o') title('Second Date Decision - Women') % add title xlabel('Partner''s Fun Rating') % x-axis label ylabel('Partner''s Relative Attractivenss') % y-axis label ylim([-4 4]) % y-axis range legend('No','Yes') % add legend

If relative attractiveness is one of the key factors in our decision to say "yes", how good are we at assessing our own attractiveness? Let's compare the self-rating for attractivess to the average ratings participants received. If you subtract the average ratings received from the self rating, you can see how much people overestimate their attractiveness.

- We are generally overestimating our own attractiveness - the median is almost as as much 1 point higher out of 1-10 scale
- Men tend to overestimate more than women
- If you overestimate, then you are more likely to be overconfident about the probability you will get "yes" answers

res.attr_mu = splitapply(@(x) mean(x,'omitnan'), csv.attr_o, G); % mean attr ratings [~,idx,~] = unique(G); % get unique indices res.attr3_1 = csv.attr3_1(idx); % remove duplicates res.atgap = res.attr3_1 - res.attr_mu; % add the rating gaps figure % new figure histogram(res.atgap(res.gender == 1),'Normalization','probability') % histpgram male hold on % overlay another plot histogram(res.atgap(res.gender == 0),'Normalization','probability') % histpgram female hold off % stop overlay title('How Much People Overestimate Their Attractiveness') % add title xlabel(['Rating Differences ' ... % x-axis label sprintf('(Median - Men %.2f, Women %.2f)', ... median(res.atgap(res.gender == 1),'omitnan'), ... median(res.atgap(res.gender == 0),'omitnan'))]) ylabel('% Participants') % y-axis label legend('Men','Women') % add legend

One possible reason we are not so good at judging our own attractiveness is that for majority of people it it in the eye of the beholder. If you plot to standard deviations of ratings people received, the spread is pretty wide, especially for men.

figure res.attr_sigma = splitapply(@(x) std(x,'omitnan'), csv.attr_o, G); % sigma of attr ratings histogram(res.attr_sigma(res.gender == 1), ... % histpgram male 'Normalization','probability') hold on % overlay another plot histogram(res.attr_sigma(res.gender == 0), ... % histpgram female 'Normalization','probability') hold off % stop overlay title('Distribution of Attractiveness Ratings Received') % add title xlabel('Standard Deviations of Attractiveness Ratings Received') % x-axis label ylabel('% Participants') % y-axis label yticklabels(string(0:5:30)) % use percentage legend('Men','Women') % add legend

Given that people are not always good at assessing their own attractiveness, how does it affect the ultimate goal - getting matches? Let's focus on average looking people (people who rate themselves 6 or 7) only to level the field and see how the self assessment affects the outcome.

- People who estimated their attractiveness realistically, with the gap from the mean received rating less than 0.9, did reasonably well
- People who overestimated performed the worst
- People who underestimated performed the best

is_realistic = abs(res.atgap) < 0.9; % lealitics estimate is_over = res.atgap >= 0.9; % overestimate is_under = res.atgap <= -0.9; % underestimate is_avg = ismember(res.attr3_1,6:7); % avg looking group figure % new figure scatter(res.attr_mu(is_avg & is_over), ... % plot overestimate res.matched_r(is_avg & is_over)) hold on % overlay another plot scatter(res.attr_mu(is_avg & is_under), ... % plot underestimate res.matched_r(is_avg & is_under)) scatter(res.attr_mu(is_avg & is_realistic), ... % plot lealitics res.matched_r(is_avg & is_realistic)) hold off % stop overlay title('Match Rates among Self-Rated 6-7 in Attractiveness') % add title xlabel('Mean Attractiveness Ratings Recevided') % x-axis label ylabel('% Matches') % y-axis label legend('Overestimators','Underestimators', .... % add legend 'Realistic Estimators','Location','NorthWest')

It looks like you can get matches in speed dating as long as you set your expectations appropriately. Here are some of my suggestions.

- Relative attractiveness is more important than people admit, because you are not going to learn a lot about your partners in four minutes.
- But who people find attractive varies a lot.
- You can still do well if you have more shared interests or more fun - so use your four minutes wisely.
- Be modest about your own looks and look for those are also modest about their looks - you will more likely get a match.

We should also remember that the data comes from those who went to Columbia at that time - as indicated by such variables as the field of study - and therefore the findings may not generalize to other situations.

Of course I also totally lack practical experience in speed dating - if you do, please let us know what you think of this analysis compared to your own experience here.

Get
the MATLAB code

Published with MATLAB® R2016b

Let's talk about changing characteristics of elements in a plot today. I will use an example derived from one in Mapping Toolbox. My goal is to show you good ways to modify your graphics.... read more >>

]]>Let's talk about changing characteristics of elements in a plot today. I will use an example derived from one in Mapping Toolbox. My goal is to show you good ways to modify your graphics.

First I put the examples directory onto my path so I can get the shape file containing state borders as polygons.

addpath(fullfile(userpath, 'Examples','mapexgeo'))

states = shaperead('usastatehi','UseGeoCoords',true);

Create a SymbolSpec to display Alaska and Hawaii as red polygons.

symbols = makesymbolspec('Polygon', ... {'Name','Alaska','FaceColor','red'}, ... {'Name','Hawaii','FaceColor','red'});

Create a worldmap of North America with Alaska and Hawaii in red, all other states in blue.

figure worldmap('north america') statePatches = geoshow(states,'SymbolSpec',symbols, ... 'DefaultFaceColor','blue','DefaultEdgeColor','black') axis off

statePatches = Group with properties: Children: [51×1 Patch] Visible: 'on' HitTest: 'on' Use GET to show all properties

We can change the color of one state using the easy dot-notation. And a different way to find out about the states.

```
statePatches.Children(17).FaceColor = 'y';
states(17).Name
```

ans = Kentucky

And I can change a state color to a custom one instead.

statePatches.Children(42).FaceColor = [0.2, 0.8, 0.7]; states(42).Name

ans = Tennessee

Now change all the states' colors to magenta. In this case, I have to assemble all of the patches together, into a comma-separated list for the left-hand side. And I need to make sure I have as many outputs from the right-hand side, hence the use of `deal`.

```
[statePatches.Children.FaceColor] = deal('m');
```

Now change each state to different colors. Let's make an array of colors first.

stateColors1 = rand(length(states),3); stateColors2 = fliplr(stateColors1);

We can do this with a loop, of course. Even though this is a relatively small example, let's time it for fun.

tic for npatch = 1:length(states) statePatches.Children(npatch).FaceColor = stateColors1(npatch,:); end colorTimes = toc;

We can also do this in a vectorized way, which can provide a huge time savings if there are enough patches involved. In this case, with 51 patches, time is not really a big issue. We have seen customers improve the speed of their code tremendously with scenes that are much more complicated.

What we need to do is not set each color separately. Instead we use the "old-fashioned" function `set`, and take advantage of the vectorization it offers. What I have to do is pass in an array of handles. Also, I enclose the property (or properties) I want to adjust in a cell array. And I supply the inputs for each handle as elements of a matching cell array. This requires that I convert my triplets for RGB into cells. And I can do that readily with `num2cell`.

Here we go.

```
tic
set(statePatches.Children, {'FaceColor'}, num2cell(flipud(stateColors2),2));
colorTimes(2) = toc
```

colorTimes = 0.052154 0.026216

Sometimes vectorization for graphics happens with no major effort. Like changing every patch to a single color. Using the notation introduced with the updated graphics system is easy and your code is highly readable.

Sometimes for performance, there may be instances where vectorizing is still appropriate, and you can't get it with the new notation. In that case, you will need the various techniques and functions I used in this post. Will this be helpful to you? Please share your thoughts on this here.

Get
the MATLAB code

Published with MATLAB® R2016b

2017 is upon us and that means some of you may be going into your annual review or thinking about your career after graduation. Today's guest blogger, Toshi Takeuchi used machine learning on a job-related dataset for predictive analytics. Let's see what he learned.... read more >>

]]>2017 is upon us and that means some of you may be going into your annual review or thinking about your career after graduation. Today's guest blogger, Toshi Takeuchi used machine learning on a job-related dataset for predictive analytics. Let's see what he learned.

- Dataset
- Big Picture - How Bad Is Turnover at This Company?
- Defining Who Are the "Best"
- Defining Who Are the "Most Experienced"
- Job Satisfaction Among High Risk Group
- Was It for Money?
- Making Insights Actionable with Predictive Analytics
- Evaluating the Predictive Performance
- Using the Model to Take Action
- Explaining the Model
- Operationalizing Action
- Summary

Companies spend money and time recruiting talent and they lose all that investment when people leave. Therefore companies can save money if they can intervene before their employees leave. Perhaps this is a sign of a robust economy, that one of the datasets popular on Kaggle deals with this issue: Human Resources Analytics - Why are our best and most experienced employees leaving prematurely?

This is an example of **predictive analytics** where you try to predict future events based on the historical data using machine learning algorithms. When people talk about predictive analytics, you hear often that the key is to turn insights into action. What does that really mean? Let's also examine this question through exploration of this dataset.

The Kaggle page says "our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next."

The fields in the dataset include:

- Employee satisfaction level, scaling 0 to 1
- Last evaluation, scaling 0 to 1
- Number of projects
- Average monthly hours
- Time spent at the company in years
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Sales (which actually means job function)
- Salary - low, medium or high
- Whether the employee has left

Let's load it into MATLAB. The new `detectImportOptions` makes it easy to set up import options based on the file content.

opts = detectImportOptions('HR_comma_sep.csv'); % set import options opts.VariableTypes(9:10) = {'categorical'}; % turn text to categorical csv = readtable('HR_comma_sep.csv', opts); % import data fprintf('Number of rows: %d\n',height(csv)) % show number of rows

Number of rows: 14999

We will then hold out 10% of the dataset for model evaluation (`holdout`), and use the remaining 90% (`train`) to explore the data and train predictive models.

rng(1) % for reproducibility c = cvpartition(csv.left,'HoldOut',0.1); % partition data train = csv(training(c),:); % for training holdout = csv(test(c),:); % for model evaluation

The first thing to understand is how bad a problem this company has. Assuming each row represents an employee, this company employs close to 14999 people over some period and about 24% of them left this company in that same period (not stated). Is this bad? Turnover is usually calculated on an annual basis and we don't know what period this dataset covers. Also the turnover rate differs from industry to industry. That said, this seems pretty high for a company of this size with an internal R&D team.

When you break it down by job function, the turnover seems to be correlated to job satisfaction. For example, HR and accounting have low median satisfaction levels and high turnover ratios where as R&D and Management have higher satisfaction levels and lower turnover ratios.

Please note the use of new functions in R2016b: `xticklabels` to set x-axis tick labels and `xtickangle` to rotate x-axis tick labels.

[g, job] = findgroups(train.sales); % group by sales job = cellstr(job); % convert to cell job = replace(job,'and','&'); % clean up job = replace(job,'_',' '); % more clean up job = regexprep(job,'(^| )\s*.','${upper($0)}'); % capitalize first letter job = replace(job,'Hr','HR'); % more capitalize func = @(x) sum(x)/numel(x); % anonymous function turnover = splitapply(func, train.left, g); % get group stats figure % new figure yyaxis left % left y-axis bar(turnover*100) % turnover percent xticklabels(job) % label bars xtickangle(45) % rotate labels ylabel('Employee Turnover %') % left y-axis label title({'Turnover & Satisfaction by Job Function'; ... sprintf('Overall Turnover %.1f%%', sum(train.left)/height(train)*100)}) hold on % overlay another plot satisfaction = splitapply(@median, ... % get group median train.satisfaction_level, g); yyaxis right % right y-axis plot(1:length(job),satisfaction) % plot median line ylabel('Median Employee Satisfaction') % right y-axis label ylim([.5 .67]) % scale left y-axis hold off % stop overlay

We are asked to analyze why the best and most experienced employees are leaving. How do we identify who are the "best"? For the purpose of this analysis, I will use the performance evaluation score to determine who are high performers. As the following histogram shows, employees with lower scores as well as higher scores tend to leave, and people with average scores are less likely to leave. The median score is 0.72. Let's say anyone with 0.8 or higher scores are high performers.

figure % new figure histogram(train.last_evaluation(train.left == 0)) % histogram of those stayed hold on % overlay another plot histogram(train.last_evaluation(train.left == 1)) % histogram of those left hold off % stop overlay xlabel(... % x-axis label sprintf('Last Evaluation - Median = %.2f',median(train.last_evaluation))) ylabel('# Employees') % y-axis label legend('Stayed','Left') % add legend title('Distribution of Last Evaluation') % add title

Among high performers, the company is particularly interested in "most experienced" people - let's use Time Spent at Company to measure the experience level. The plot shows that high performers with 4 to 6 years of experience are at higher risk of turnover.

hp = train(train.last_evaluation >= 0.8,:); % subset high performers figure % new figure histogram(hp.time_spend_company(hp.left == 0)) % histogram of those stayed hold on % overlay another plot histogram(hp.time_spend_company(hp.left == 1)) % histogram of those left hold off % stop overlay xlabel(... % x-axis label sprintf('Time Spent @ Company - Median = %.2f',median(hp.time_spend_company))) ylabel('# Employees') % y-axis label legend('Stayed','Left') % add legend title('Time Spent @ Company Among High Performers') % add title

Let's isolate the at-risk group and see how their job satisfaction stacks up. It is interesting to see that not only people with very low satisfaction levels (no surprise) but also the highly satisfied people left the company. It seems like people with a satisfaction level of 0.7 or higher are at an elevated risk.

at_risk = hp(hp.time_spend_company >= 4 &... % subset high performers hp.time_spend_company <= 6,:); % with 4-6 years experience figure % new figure histogram(at_risk.satisfaction_level(... % histogram of those stayed at_risk.left == 0)) hold on % overlay another plot histogram(at_risk.satisfaction_level(... % histogram of those left at_risk.left == 1)) hold off % stop overlay xlabel(... % x-axis label sprintf('Satisfaction Level - Median = %.2f',median(at_risk.satisfaction_level))) ylabel('# Employees') % y-axis label legend('Stayed','Left') % add legend title('Satisfaction Level of the Best and Most Experienced')

Let's isolate high performing seasoned employees and check their salaries. It is clear that people who get a higher salary are staying while people with medium or low salary leave. No big surprise here, either.

at_risk_sat = at_risk(... % subset at_risk at_risk.satisfaction_level >= .7,:); figure % new figure histogram(at_risk_sat.salary(at_risk_sat.left == 0))% histogram of those stayed hold on % overlay another plot histogram(at_risk_sat.salary(at_risk_sat.left == 1))% histogram of those left hold off % stop overlay xlabel('Salary') % x-axis label ylabel('# Employees') % y-axis label legend('Stayed','Left') % add legend title('Salary of the Best and Most Experienced with High Satisfaction')

At this point, you may say, "I knew all this stuff already. I didn't get any new insight from this analysis." It's true, but why are you complaining about that? **Predictability is a good thing for prediction!**

What we have seen so far all happened in the past, which you cannot undo. However, if you can predict the future, you can do something about it. Making data actionable - that's the true value of predictive analytics.

In order to make this insight actionable, we need to quantify the turnover risk as scores so that we can identify at-risk employees for intervention. Since we are trying to classify people into those who are likely to stay vs. leave, what we need to do is build a classification model to predict such binary outcome.

I have used the Classification Learner app to run multiple classifiers on our training data `train` to see which one provides the highest prediction accuracy.

The winner is the Bagged Trees classifier (also known as "Random Forest") with 99.0% accuracy. To evaluate a classifier, we typically use the ROC curve plot, which lets you see the trade-off between the true positive rate vs. false positive rate. If you have a high AUC score like 0.99, the classifer is very good at identifying the true class without causing too many false positives.

You can also export the trained predictive model from the Classification Learner. I saved the exported model as `btree.mat` that comes with some instructions.

load btree % load trained model how_to = btree.HowToPredict; % get how to use disp([how_to(1:80) ' ...']) % snow snippet

To make predictions on a new table, T, use: yfit = c.predictFcn(T) replacing ...

Let's pick 10 samples from the `holdout` data partition and see how the predictive model scores them. Here are the samples - 5 people who left and 5 who stayed. They are all high performers with 4 or more years of experience.

samples = holdout([111; 484; 652; 715; 737; 1135; 1293; 1443; 1480; 1485],:); samples(:,[1,2,5,8,9,10,7])

ans = satisfaction_level last_evaluation time_spend_company promotion_last_5years sales salary left __________________ _______________ __________________ _____________________ __________ ______ ____ 0.9 0.92 4 0 sales low 1 0.31 0.92 6 0 support medium 0 0.86 0.87 4 0 sales low 0 0.62 0.95 4 0 RandD low 0 0.23 0.96 6 0 marketing medium 0 0.39 0.89 5 0 support low 0 0.09 0.92 4 0 sales medium 1 0.73 0.87 5 0 IT low 1 0.75 0.97 6 0 technical medium 1 0.84 0.83 5 0 accounting low 1

Now let's try the model to predict whether they left the company.

actual = samples.left; % actual outocme predictors = samples(:,[1:6,8:10]); % predictors only [predicted,score]= btree.predictFcn(predictors); % get prediction c = confusionmat(actual,predicted); % get confusion matrix disp(array2table(c, ... % show the matrix as table 'VariableNames',{'Predicted_Stay','Predicted_Leave'}, ... 'RowNames',{'Actual_Stayed','Actual_Left'}));

Predicted_Stay Predicted_Leave ______________ _______________ Actual_Stayed 5 0 Actual_Left 0 5

The model was able to predict the outcome of those 10 samples accurately. In addition, it also returns the probability of each class as score. You can also use the entire `holdout` data to check the model performance, but I will skip that step here.

We can use the probability of leaving as the risk score. We can select high performers based on the last evaluation and time spent at the company, and then score their risk level and intervene in high risk cases. In this example, the sales employee with the 0.92 evaluation score is 100% at risk of leaving. Since this employee has not been promoted in the last 5 years, perhaps it is time to do so.

samples.risk = score(:,2); % probability of leaving [~,ranking] = sort(samples.risk,'descend'); % sort by risk score samples = samples(ranking,:); % sort table by ranking samples(samples.risk > .7,[1,2,5,8,9,10,11]) % intervention targets

ans = satisfaction_level last_evaluation time_spend_company promotion_last_5years sales salary risk __________________ _______________ __________________ _____________________ __________ ______ _______ 0.09 0.92 4 0 sales medium 1 0.73 0.87 5 0 IT low 1 0.84 0.83 5 0 accounting low 0.9984 0.75 0.97 6 0 technical medium 0.96154 0.9 0.92 4 0 sales low 0.70144

Machine learning algorithms used in predictive analytics are often a black-box solution, so HR managers may need to provide an easy-to-understand explanation about how it works in order to get the buy-in from their superiors. We can use the predictor importance score to show which attributes are used to compute the prediction.

In this example, the model is using the mixture of attributes at different weights to compute it, with emphasis on the satisfaction level, followed by the number of projects and time spent at the company. We also see that people who got a promotion in the last 5 years are less likely to leave. Those attributes are very obvious and therefore we feel more confident about the prediction with this model.

imp = oobPermutedPredictorImportance(... % get predictor importance btree.ClassificationEnsemble); vals = btree.ClassificationEnsemble.PredictorNames; % predictor names figure % new figure bar(imp); % plot importnace title('Out-of-Bag Permuted Predictor Importance Estimates'); ylabel('Estimates'); % y-axis label xlabel('Predictors'); % x-axis label xticklabels(vals) % label bars xtickangle(45) % rotate labels ax = gca; % get current axes ax.TickLabelInterpreter = 'none'; % turn off latex

The predictor importance score also give us a hint about when we should intervene. Given that satisfaction level, last evaluation, and time spent at the company are all important predictors, it is probably a good idea to update the predictive scores at the time of each annual evaluation and then decide who may need intervention.

It is feasible to implement a system to schedule such analysis when a performance review is conducted and automatically generate a report of high risk employees using the model derived from MATLAB. Here is a simple live demo of such a system running the following code using MATLAB Production Server. It uses the trained predictive model we just generated.

The example code just loads the data from MAT-file, but you would instead access data from an SQL database in a real world use scenario. Please also note the use of a new function `jsonencode` introduced in R2016b that converts structure arrays into JSON formatted text. Naturally, we also have `jsondecode` that converts JSON formatted text into appropriate MATLAB data types.

```
dbtype scoreRisk.m
```

1 function risk = scoreRisk(employee_ids) 2 %SCORERISK scores the turnover risk of selected employees 3 4 load data % load holdout data 5 load btree % load trained model 6 X_sub = X(employee_ids,:); % select employees 7 [predicted,score]= btree.predictFcn(X_sub); % get prediction 8 risk = struct('Ypred',predicted,'score',score(:,2));% create struct 9 risk = jsonencode(risk); % JSON encode it 10 11 end

Depending on your needs, you can build it in a couple of ways:

Often, you would need to retrain the predictive model as human behavior changes over time. If you use code generated from MATLAB, it is very easy to retrain using the new dataset and redeploy the new model for production use, as compared to cases where you reimplement the model in some other languages.

In the beginning of this analysis, we explored the dataset which represents the past. You cannot undo the past, but you can do something about the future if you can find quantifiable patterns in the data for prediction.

People often think the novelty of insights is important. What really matters is what action you can take on them, whether novel or well-known. If you do nothing, no fancy analytics will deliver any value. Harvard Business Review recently publushed an article Why You’re Not Getting Value from Your Data Science but failed to mention this issue.

In my opionion, analysis paralysis is one of the biggest reason companies are not getting the value because they are failing to take action and are stuck at the data analysis phase of the data science process.

Has this example helped you understand how you can use predictive analytics to solve practical problems? Can you think of way to apply this in your area of work? Let us know what you think here.

Get
the MATLAB code

Published with MATLAB® R2016b

There is a new way to work with textual data in MATLAB R2016b. The new `string` datatype haven't got enough attention from me until recently. I have been chatting with colleagues Matt Tearle and Adam Sifounakis and we have each discovered a similar beautiful code pattern in MATLAB for generating a sequence of strings.... read more >>

There is a new way to work with textual data in MATLAB R2016b. The new `string` datatype haven't got enough attention from me until recently. I have been chatting with colleagues Matt Tearle and Adam Sifounakis and we have each discovered a similar beautiful code pattern in MATLAB for generating a sequence of strings.

Early on, MATLAB had character arrays. Let's create one.

myCharPets = ['dog ';'cat ';'fish']

myCharPets = dog cat fish

Notice how I had to add trailing blanks for the first 2 pets because my final pet, a fish, required more memory (like Dory from Finding Nemo)?.

I can find my second pet, but, to be fair, I also have to remove the trailing blank.

pet2 = deblank(myCharPets(2,:))

pet2 = cat

With MATLAB 5.0, we introduced `cell` arrays and then cell arrays of strings. Since each cell contains its own MATLAB array, there is no need for each array to contain the same number of elements. So we can do this, exploiting some "new" syntax.

myCellPets = {'dog';'cat';'fish'}

myCellPets = 3×1 cell array 'dog' 'cat' 'fish'

I can find the second pet on the list, with some more, but similar, "new" syntax.

pet2 = myCellPets{2}

pet2 = cat

In MATLAB Release R2016b, we introduced the notion of a `string`. Now I can create an array of textual data another way.

myStringPets = string(myCellPets)

myStringPets = 3×1 string array "dog" "cat" "fish"

And I can find my second pet again

pet2 = myStringPets(2)

pet2 = string "cat"

I think the notation feels much more natural. And I can add strings together.

allofmypets = myStringPets(1) + ' & ' + myStringPets(2) + ' & ' + myStringPets(3)

allofmypets = string "dog & cat & fish"

Ok, yes, I really should vectorize that. And I can do that with strings!

You may remember that recently, Steve Eddins posted on my blog about implicit expansion? Well, we can take good advantage of that with strings.

Suppose I want to create an array of directory names that are embedded with a sequence of years.

```
dirnames = string('C:\work\data\yob\') + (2000:2010)'
```

dirnames = 11×1 string array "C:\work\data\yob\2000" "C:\work\data\yob\2001" "C:\work\data\yob\2002" "C:\work\data\yob\2003" "C:\work\data\yob\2004" "C:\work\data\yob\2005" "C:\work\data\yob\2006" "C:\work\data\yob\2007" "C:\work\data\yob\2008" "C:\work\data\yob\2009" "C:\work\data\yob\2010"

And if I want to add months, I can do that too.

quarterlyMonths = string({'Jan','Apr','Jul','Oct'}); dirname = string('C:\root\') + quarterlyMonths + (2000:2010)'

dirname = 11×4 string array Columns 1 through 3 "C:\root\Jan2000" "C:\root\Apr2000" "C:\root\Jul2000" "C:\root\Jan2001" "C:\root\Apr2001" "C:\root\Jul2001" "C:\root\Jan2002" "C:\root\Apr2002" "C:\root\Jul2002" "C:\root\Jan2003" "C:\root\Apr2003" "C:\root\Jul2003" "C:\root\Jan2004" "C:\root\Apr2004" "C:\root\Jul2004" "C:\root\Jan2005" "C:\root\Apr2005" "C:\root\Jul2005" "C:\root\Jan2006" "C:\root\Apr2006" "C:\root\Jul2006" "C:\root\Jan2007" "C:\root\Apr2007" "C:\root\Jul2007" "C:\root\Jan2008" "C:\root\Apr2008" "C:\root\Jul2008" "C:\root\Jan2009" "C:\root\Apr2009" "C:\root\Jul2009" "C:\root\Jan2010" "C:\root\Apr2010" "C:\root\Jul2010" Column 4 "C:\root\Oct2000" "C:\root\Oct2001" "C:\root\Oct2002" "C:\root\Oct2003" "C:\root\Oct2004" "C:\root\Oct2005" "C:\root\Oct2006" "C:\root\Oct2007" "C:\root\Oct2008" "C:\root\Oct2009" "C:\root\Oct2010"

How cool is that!

This is just the beginning for strings. You can find out what else is available now.

methods(string)

Methods for class string: cellstr extractAfter le split char extractBefore lower splitlines compose extractBetween lt startsWith contains ge ne strip count gt pad strlength double insertAfter plus upper endsWith insertBefore replace eq ismissing replaceBetween erase issorted reverse eraseBetween join sort

And you can bet we have plans to add more capabilities for strings over time. What features would you like to see us add? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2016b

Sensors are a big part of our lives and continuously generate signals measuring some real physical phenomenon. Engineers are tasked with processing these signals to identify unique features and reveal interesting patterns in these signals. Fourier transform is a first step used by many in analyzing the frequency content of... read more >>

]]>Sensors are a big part of our lives and continuously generate signals measuring some real physical phenomenon. Engineers are tasked with processing these signals to identify unique features and reveal interesting patterns in these signals. Fourier transform is a first step used by many in analyzing the frequency content of the signals but it is not well suited to analyze signals which have features coming in at different scales or resolution i.e. a signal containing a slowly varying component punctuated with abrupt transients as is the case with many naturally occurring signals and images. The reason being, Fourier transforms use sinusoids, which are not well localized in time and frequency as these sinusoids oscillate forever.

One alternative to this challenge is to analyze signals using wavelets which are well localized in time and frequency. Kirthi K. Devleker, an engineer in our product marketing team explains the basic concepts behind what wavelets are and how you can use them through a series of 4-part short tech talk videos on wavelets.

Whether you are seasoned engineer who wants a refresher on wavelet transforms or a newbie who wants to explore more about wavelet transforms, I am sure you will find these videos helpful. The first two videos (Part -1 and Part -2) cover background concepts on what wavelets are and the types of wavelet transforms commonly used. The next two videos (Part -3 and Part -4) cover two important real world applications of wavelet transforms in MATLAB.

Watch the first of the four videos here.

Let us know if you like them and if you want to see additional topics let us know by leaving a comment here.

Get
the MATLAB code

Published with MATLAB® R2016b

We love MATLAB and we also have many other interests, too. Today's guest blogger, Toshi Takeuchi, found an interesting way to combine his passion for MATLAB with one of his interests, Argentine Tango!... read more >>

]]>We love MATLAB and we also have many other interests, too. Today's guest blogger, Toshi Takeuchi, found an interesting way to combine his passion for MATLAB with one of his interests, Argentine Tango!

I started taking Argentine Tango classes at MIT when we had a record snow fall a few winters ago in Boston. I was originally planning to learn snowboarding (I even bought a used snowboard) but who would want to spend more time in the snow after having shoveled it day after day?

I was drawn to Tango because I can do the figures once I understand the geometry and physics of the moves. I can't do other dances because I am dyslexic and can't follow steps. Perhaps that's why they say Tango tends to attract people with a technical background? Loren says the same goes for glass blowing.

Argentine Tango is an improvisational partner dance that incorporates a rich variety of elegant figures - watch this YouTube video for example. You have to practice those figures and you often watch videos like this - pausing, rolling back, playing back the same segment over and over - to study how those figures are executed. This is a tedious process often interrupted by a slow connection and annoying ads.

Naturally, I started wondering, "can I use MATLAB to watch videos at slower speed?" and of course you can. I am doing this for dancing, but I am sure you can also use it for other activities, such as checking your golf swing or learning how to play an instrument.

In this example, I will use my own video shot on an iPhone rather than professional videos you see on YouTube. I transferred the video from my phone to the current folder. If you want to follow this example, please use your own video. Now how do we load the video into MATLAB?

You can use the `VideoReader` object to read video files into MATLAB, and store each frame into a structure array. This will take a bit of time, but this is an important step in order to play back those frames in real time later. I show the progress bar while frames are extracted and then show the first frame when completed. If the orientation of your video is not correct, use `imrotate` to correct the orientation after extracting each frame.

filename = 'video-1468594243.mp4'; % video to load wb = waitbar(0,'Loading the file...'); % show progress bar v = VideoReader(filename); % create VideoReader object w = v.Width; % get width h = v.Height; % get height d = v.Duration; % get duration s = struct('cdata',zeros(h,w,3,'uint8'),'colormap',[]); % struct to hold frames k = 1; % initialize counter while hasFrame(v) % while object has frame f = readFrame(v); % read the frame s(k).cdata = f; % = imrotate(f,90); % rotate frame as needed k = k + 1; % increment counter waitbar(v.CurrentTime/d) % update progress bar end v.CurrentTime = 18; % set current time f = readFrame(v); % get the frame close(wb) % close progress bar axes % create axes imshow(f) % show the frame title('Final Tango Frame') % add title

You can now play back the video at the speed of your choice with `imshow`. To ensure real-time playback, I first create the image, and subsequently update its underlying data for each frame using its handle, because it is too slow to create a new image for each frame. You can only see the final frame here, so I added an animated GIF from the figure at the end of this post.

fps = v.FrameRate; % get frame rate startTime = 14; % in seconds endTime = 17; % in seconds speed = 1/2; % play back speed startFrame = floor(startTime * fps); % starting frame endFrame = floor(endTime * fps); % ending frame curAxes = axes; % create axes hImage = imshow(s(startFrame).cdata,'Parent',curAxes); % create hande title('Tango Video') % add title for k = startFrame + 1:endFrame % loop over others set(hImage,'CData',s(k).cdata); % update underlying data pause(1/(fps*speed)); % pause for specified speed end

It is cumbersome to run the script over and over each time I change settings like the starting frame or speed. It would be much easier if I built a GUI for it. When it comes to building a GUI in MATLAB, you have two options.

- App Designer - to create an app that runs on MATLAB
- GUIDE - to create a GUI that can be compiled as a standalone app

Since this could be useful for other Tango friends of mine, I would like to be able to share it. I will assume they generally don't have MATLAB. I would therefore go with GUIDE in this case. For details of how you create a GUI with GUIDE, please refer to this tutorial video.

Here is the fig file I created in GUIDE. You can open the fig file in GUIDE by typing `guide` in Command Window and selecting "Open Existing GUI" tab in the Quick Start window. To open the working GUI, you can press the green play button in GUIDE or run the `tangoplayer` callback function for the GUI in Command Window.

tangoplayer

The callback function file `tangoplayer.m` was generated from GUIDE. I then added my custom code in it. The code we saw earlier for loading video is reused for the callback function for the "Choose File" button. Here is the relevant section of `chooseBtn_Callback` function.

dbtype 'tangoplayer.m' 408:431

408 % *** Begin Reuse Example Code *** 409 wb = waitbar(0,'Loading the file...'); % show progress bar 410 v = VideoReader(filename); % create VideoReader object 411 w = v.Width; % get width 412 h = v.Height; % get height 413 d = v.Duration; % get duration 414 s = struct('cdata',zeros(h,w,3,'uint8'),'colormap',[]); % struct to hold frames 415 k = 1; % initialize counter 416 while hasFrame(v) % while object has frame 417 if rotate == 0 % if no rotation 418 s(k).cdata = readFrame(v); % read the frame 419 else % otherwise 420 f = readFrame(v); % get the frame 421 s(k).cdata = imrotate(f,handles.rotate); % rotate it 422 end 423 k = k + 1; % increment counter 424 waitbar(v.CurrentTime/d) % update progress bar 425 end 426 v.CurrentTime = 0; % back to the beginning 427 f = readFrame(v); % get the frame 428 close(wb) % close progress bar 429 axes(handles.viewer) % get axes 430 hImage = imshow(imrotate(f,rotate)); % show the frame 431 % *** End Reuse Example Code ***

We want to share the resulting VideoReader object and structure array that contains extracted frames across other callback functions. You can do this by adding those variables to the `handles` structure and store it in the GUI's application data using `guidata(hObject,handles)`.

dbtype 'tangoplayer.m' 437:443

437 % share the loaded data across the GUI app with handles 438 handles.v = v; % add v to handles 439 handles.s = s; % add s to handles 440 handles.t = 0; % add t to handles 441 handles.k = 0; % add k to handle 442 handles.fin = false; % add fin to handle 443 guidata(hObject,handles) % update handles

To use the stored application data, you just have to access the relevant fields in the `handles` structure.

dbtype 'tangoplayer.m' 191:193

191 if ismember('v',fieldnames(handles)) % if VideoReader object exists 192 v = handles.v; % retrieve it 193 s = handles.s; % get frames

You can also retrieve, in the same way, the values for the start time and the speed set by the sliders. I invite you to look at the source code to see how the callback functions for those sliders work.

dbtype 'tangoplayer.m' 201:204

201 fps = v.FrameRate; % get frame rate 202 speed = str2num(get(handles.speedBox,'String')); % get speed 203 startTime = handles.t; % get start time 204 startFrame = handles.k; % get start frame

The actual code that plays back the video is similar to the earlier example except that it has additional lines to take care of some details related to the GUI such as updating the current time display, adding a context menu to the image, and checking for the button state, etc. The context menu adds the ability to draw lines, rectangles, and circles on the image as defined in another section of the file.

dbtype 'tangoplayer.m' 214:225

214 % *** Begin Reuse Example Code *** 215 set(handles.timeBox,'String',sprintf('%.4f',startTime)) % update box text 216 hImage = imshow(s(startFrame).cdata, ... % create hande and 217 'Parent',handles.viewer); % show the first frame 218 ctm = handles.ctm; % get context menu 219 set(hImage,'uicontextmenu',ctm) % add context menu 220 221 for k = startFrame + 1:length(s) % for each remaining frame 222 if get(hObject,'Value') % if button is down 223 set(hImage,'CData',s(k).cdata); % update underlying data 224 pause(1/(fps*speed)); % pause for specified speed 225 % *** End Reuse Example Code ***

Here is the app in action - the lines were added using the context menu. It shows that I didn't keep my back straight in that particular frame. Ouch - it is painful to watch your own video, but it is more important to know the issues you need to fix.

Now that the app is working in MATLAB, I would like to share that with other Tango friends of mine who may or may not have MATLAB. If you have the MATLAB Compiler, you will find the Application Compiler under the Apps tab.

Open the Application Compiler and add `tangoplayer.m` as the main file and the Application Compiler will automatically add `tangoplayer.fig` and other dependencies in "Files required for your application to run". You can also make other customizations such as adding icons and a splash screen. Then all you need to do is to click "Package" to compile an executable. That's it! Since I am running MATLAB on Windows, this compiles a Windows app.

Once completed, you find your installer in the "for_redistribution" folder and that's what I would give to my friends. The app requires MATLAB Runtime, and the installer automatically downloads it from the web in the installation process.

Learning Tango takes many hours, but you can create a custom app in MATLAB and share it as a standalone application in a mere tiny fraction of that time!

Have you used MATLAB for your hobbies before? What did you create? Did you know that you can buy MATLAB Home to support your MATLAB addiction? Let us know what you've created here.

Get
the MATLAB code

Published with MATLAB® R2016b