Sometimes it's nice to have a best friend. Sometimes I have more than one, context-dependent. I want to share my best friend today for dealing with `struct` arrays.... read more >>

Sometimes it's nice to have a best friend. Sometimes I have more than one, context-dependent. I want to share my best friend today for dealing with `struct` arrays.

Structure arrays are a useful way to collect similar information for a collection of items, even if what gets collected is not always the some size. However, struct arrays can also be cumbersome to work with. Thanks to the `table` datatype, I can now avoid some of the intracacies of indexing into struct arrays but simply performing a conversion to `table` first, and then using good, old-fashioned, stands the test of time, MATLAB indexing.

Some examples for working with struct arrays after converting to a table include:

- output of
`dir`- (I can then remove dirs, etc. very easily) `jsondecode``regionprops`from Image Processing Toolbox - (now includes output to`table`option)

I'll now get the listing for the blog post directory for one of my guest authors, Alan Weiss.

```
dirstruct = dir('AlanW')
```

dirstruct = 5×1 struct array with fields: name folder date bytes isdir datenum

And here's a picture from the Windows Explorer

You can readily see that I really have only 3 items, and not the 5 suggested in `dirstruct`. I can, of course, now look at the names. I extract them via a comma-separated list.

dirstruct.name

ans = '.' ans = '..' ans = 'Clumping' ans = 'FinSymbolic' ans = 'Sudoku'

and now you can see that I really don't care about the first two. But I'd have to delete these from the name field, and the folder field, and ... So I definitely can do so, but what a hassle.

Instead let me convert to a `table`.

dirtable = struct2table(dirstruct)

dirtable = 5×6 table name folder date bytes isdir datenum _____________ ___________________________ ______________________ _____ _____ __________ '.' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05 '..' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05 'Clumping' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05 'FinSymbolic' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05 'Sudoku' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05

And now I can eliminate the first 2 rows in the usual way in MATLAB:

dirtable(1:2,:) = []

dirtable = 3×6 table name folder date bytes isdir datenum _____________ ___________________________ ______________________ _____ _____ __________ 'Clumping' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05 'FinSymbolic' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05 'Sudoku' 'C:\Work\ArtofMATLAB\AlanW' '22-Sep-2017 02:50:46' 0 true 7.3696e+05

I'm wondering if you've used struct arrays before and are now replacing them with tables. And if you have an application where that does not make sense, I'd love to hear about that as well. Let me know here.

Get
the MATLAB code

Published with MATLAB® R2017b

A=[1 2; 3 4] M = cat(3, A, A+1, A+2, A+3);

A = 1 2 3 4So the function should work like this:

IDX = [NaN 4; 3 2]

IDX = NaN 4 3 2

catdim = 3; szA = size(A); [mv,midx] = max(M >= 5,[],catdim); IDX = NaN(szA); IDX(mv) = midx(mv)

IDX = NaN 4 3 2Why/how did this work? Well, I use the condition I want met into function

A = randi(17, [3 2 1 3]); catdim = ndims(A)+1 M = cat(catdim, A, A+1, A+2, A+3, A+4);

catdim = 5See a slice of M

sliceOfM = M(:,:,:,1)

sliceOfM = 12 2 13 4 8 16Apply the

szA = size(A) [mv,midx] = max(M >= 13,[],catdim); IDX = NaN(szA); IDX(mv) = midx(mv)

szA = 3 2 1 3 IDX(:,:,1,1) = 2 NaN 1 NaN NaN 1 IDX(:,:,1,2) = NaN 1 1 NaN 4 NaN IDX(:,:,1,3) = NaN 1 1 1 NaN 1

I recently attended the annual AGU (American Geophysical Union) meeting and, used the time at our booth to talk to many geoscientists about many different topics. One that emerged as a big interest stemming from unfamiliarity is deep learning.... read more >>

]]>I recently attended the annual AGU (American Geophysical Union) meeting and, used the time at our booth to talk to many geoscientists about many different topics. One that emerged as a big interest stemming from unfamiliarity is deep learning.

At MathWorks we have been adding to our machine learning and deep learning repertoire ardently. One reflection of this is the newest blog, on deep learning, authored by Steve Eddins, also the author of the Steve on Image Processing blog.

Most applications of deep learning these days are focused on images, with signals coming along strongly now as well. Here's a link to some MATLAB examples/applications in more traditional areas like object detection.

In the sciences, successes have been reported in biological and medical studies.

In domains, such as geosciences, often data can be viewed as an image even if they don't represent an actual picture. We had several discussions at AGU about possible applications of deep learning for seismic problems, atmospheric ones, and oceanographic ones. And deep learning is certainly having some success in some remote sensing applications, from a quick web search.

Do you know of any people or applications currently using deep learning in their work? What domain? What do you see looking forward to the application of deep learning in domains you know most about? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2017b

This is one time of the year when there is often an abundance of baked goods showing up in my office, and many others no doubt. And you hear people say things like, "That's the best cookie I ever had!". And sometimes a debate ensues.... read more >>

]]>This is one time of the year when there is often an abundance of baked goods showing up in my office, and many others no doubt. And you hear people say things like, "That's the best cookie I ever had!". And sometimes a debate ensues.

One problem, of course, is that not all cookies are good. Another is not all baked goods are cookies. Hmmm, there must be some math that could help us out here.

A colleague just pointed out a thread on Reddit where there is a discussion about cookies, and which ones are best. In an effort to resolve the intense debate, one employee analyzed recipes to determine which ones qualify as a cookie.

To do so, the author

- scraped recipes from the net, including ones including the term "cookie" and some other chosen terms,
- used the ingredient lists as input to principal component analysis to reduce the dimensionality of the problem,
- applied clustering algorithms and and support vector machines to distinguish between pastries and cookies

The conclusion reached - some very tasty tarts did not quality for best cookie! Do you use MATLAB recreationally? To learn new concepts that might not yet be relevant for your work, but you are curious about? We'd love to hear your ideas here.

Get
the MATLAB code

Published with MATLAB® R2017b

There are a lot of ways of sharing your ideas with MATLAB these days. Including... read more >>

]]>There are a lot of ways of sharing your ideas with MATLAB these days. Including

- MATLAB scripts and functions
- Live scripts
- MAT-files
- Apps
- Code generation of various kinds, e.g., C/C++ and GPU via Coder products
- MATLAB Compiler and SDK (stand-alone apps and code suitable for integrating into other environments)
- MATLAB Production Server

Today I want to remind you of some tools in MATLAB that can facilitate the transition to code you want to deploy via the MATLAB Compiler and SDK.

Here's a list of the functions I find most useful to aid in deployment.

Do you have other functions that ease your transition to deploying your app? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2017b

Today, guest blogger Matt Tearle continues his investigation of the new `polyshape` type (added in R2017b) as a mapping tool.... read more >>

Today, guest blogger Matt Tearle continues his investigation of the new `polyshape` type (added in R2017b) as a mapping tool.

In a previous post, I used the new `polyshape` variable type in MATLAB to determine connections between states. You can see the code in that post. I'll just load in the polygons and matrix of connections that came from that:

load stateborders % Make a graph of state connections G = graph(border,stnames,'lower'); % Obtain centroid locations for states [x,y] = centroid(p); % Plot map and connections plot(p) hold on plot(G,'XData',x,'YData',y) hold off

Having made this nice visualization, it struck me that the colors from the default polygon `plot` weren't making a good map because neighboring states were often given the same color. The `plot` function, of course, doesn't know about the connections - it is just using a set of colors in the order of the polygons in the array. Using knowledge of the connections between states (stored in the matrix `border` and the graph `G`), I should be able to recolor the states so that no neighboring states have the same color.

Note that the actual color is arbitrary. The goal is actually to assign a number to each state. That number can then be used to index into a list of colors (or a colormap).

A simple ("greedy") algorithm to assign colors would be:

- state 1 is color 1
- go through the states in order
- for each state, find all the neighboring states
- get the color values assigned to the neighboring states
- assign the current state the smallest unused value

In the last step, if colors 1 through *n* are used by neighbors, then the number of colors used in the map will grow to *n*+1. This means that the number of colors used for the whole map will depend on the connections and the order of the states. Let's see what happens if I apply this algorithm to the states in default (alphabetical) order.

n = length(p); color = zeros(n,1); % Initialize array to hold color numbering numcolors = 1; % Number of colors needed (so far) for k = 1:n idx = neighbors(G,k); % Get neighbors of the kth node idx = idx(idx < k); % But just those that have an assigned color neighborcolors = unique(color(idx)); % Get colors used by neighbors % Assign the smallest color value not used by the neighboring nodes thiscolor = min(setdiff(1:numcolors,neighborcolors)); % If there isn't one, add another color to the map if isempty(thiscolor) numcolors = numcolors + 1; thiscolor = numcolors; end color(k) = thiscolor; end disp([num2str(numcolors),' colors needed'])

5 colors needed

Five colors is actually common for many maps.

polygonplot(p,color)

I'll be plotting the states with a custom color ordering a few times, so I figured it would make sense to make a short function to do that:

```
type('polygonplot')
```

function polygonplot(p,color) % Get a list of colors (using the default plot colors) ax = gca; colors = ax.ColorOrder; % Plot the polygons and change their colors map = plot(p); for k = 1:length(p) map(k).FaceColor = colors(color(k),:); map(k).FaceAlpha = 0.8; % Make the shading darker end

Five colors is pretty good, but the four color mapping theorem states that you need only four colors to color any map so that no neighboring regions have the same color. However, in true pure math style, the theorem only says it *can* be done, not *how* to do it.

Is there an algorithm I can use to assign colors to the states to achieve a four-color map coloring? It turns out that this is a difficult problem in general. There are algorithms, but they are far more complicated than I want to deal with!

So instead, what about just shuffling the order of the states, then applying the greedy algorithm above? I took that code and turned it into a function that takes a graph as input and returns the color numbering and number of colors needed (equivalent to the max of the color numbering) as output.

Now I just need to shuffle the states and apply the function. But first, the matrix of connections is currently just the lower-triangular part. That structure will be ruined by the shuffling, so I'll make the full matrix version:

border = border + border';

Now let's try shuffling and see how many colors we need:

```
rng(123) % for reproducibility
idx = randperm(48);
Gperm = graph(border(idx,idx));
[~,nc] = greedycolor(Gperm)
```

nc = 6

OK, it's working, but that particular shuffle didn't help. It actually was worse. Well, that's random numbers for you. Maybe it's a good idea to see how the number of colors needed are distributed. Let's run the algorithm a bunch of times and see:

rng(123) nexp = 1000; numcolors = zeros(nexp,1); for k = 1:nexp idx = randperm(n); Gperm = graph(border(idx,idx)); [~,numcolors(k)] = greedycolor(Gperm); end histogram(numcolors)

Interesting. Most of the time I get five, sometimes I need six, and about 5% of the time I do indeed get a valid 4-color mapping. So a simple way to get a 4-color mapping would be to try random permutations until only four colors were needed. A terrible algorithm? Perhaps. But I bet it works:

numcolors = Inf; while numcolors > 4 idx = randperm(n); Gperm = graph(border(idx,idx)); [color,numcolors] = greedycolor(Gperm); end polygonplot(p(idx),color)

Surely there's a better way, though. Right...? Well, maybe. I did encounter an algorithm for generating 5-color mappings. Basically, remove nodes from the graph one by one, until the nodes of the remaining graph all have degree greater than five. Then... do more stuff, which will color the remaining graph. Then you add the nodes back, coloring them as you go. I didn't really pay attention to the details because it seemed like there was a very good chance that you'd never get nodes with degree greater than five. Every time you remove a node, the degree of every node it's attached to goes down by one. So it seems like you could just try to repeatedly remove the lowest-degree node. Keep the list of nodes in order that you removed them, then use that (or, more precisely, the reverse of it, from last to first) as the ordering in which to color the nodes. It's not guaranteed to work, but it's probably not a bad option to try (and surely more efficient than random permutations). Also, this was for 5-color mappings, not 4, but let's just see what we get...

G = graph(border,stnames); idx = zeros(n,1); for j = 1:n [~,k] = min(degree(G)); idx(j) = find(strcmp(G.Nodes.Name{k},stnames)); G = rmnode(G,k); end idx = flipud(idx); Gperm = graph(border(idx,idx)); [color,numcolors] = greedycolor(Gperm); polygonplot(p(idx),color)

And just like that, I have a 4-color mapping for the US! In fact, it's not far from being a 3-color mapping! Only three states needed the fourth color. In fact, notice that this approach tries to use the lowest numbers as much as possible

histogram(color)

But maybe I just got lucky. So let's try this on another map. Feeling brave, I found a world map shape file online. I applied the exact same approach: imported it with `shaperead`, converted the resulting structure array to a polygon array, used the "pad-and-intersect" method to build the graph of shared borders, sorted by iteratively removing the lowest-degree node, then applied the greedy coloring algorithm. The result...?

A 4-color world map! Just like that.

As I undertand it, real cartographers wouldn't have much use for my algorithm. There's a definite skew in the colors (e.g., all islands are color #1 because they have no borders), so the result isn't particularly aesthetically pleasing. Are there any amateur (or professional!) cartographers out there who want to try applying a "real" map-coloring algorithm using polygons and graphs? I'd love to see your results. I also know there must be maps that break my algorithm, but how well will the algorithm hold up with actual maps in practice, rather than deliberately pathological examples? If anyone has other polygonal maps they want to try coloring with this approach, I'd love to hear how it works.

And, of course, maps and graphs don't have to be just about physical geography - they can be abstractions of any number of things. If you have an application that could use map coloring for visualization, share your ideas in the comments.

Get
the MATLAB code

Published with MATLAB® R2017b

R2017b was released recently. Today's guest blogger, Matt Tearle, discovered a fun application of one of the many new features - solving a map-based puzzle.... read more >>

]]>R2017b was released recently. Today's guest blogger, Matt Tearle, discovered a fun application of one of the many new features - solving a map-based puzzle.

"Car Talk" recently featured a brain teaser submitted by a former MathWorker. The short, slightly modified version is: Loren has to drive from Delaware to every state in the contiguous US, visiting each state exactly once; Loren's boss - let's call him Jack - offers to meet her in the last state, which turns out to be his birthplace; but how can Jack know what the last state will be, without waiting to hear Loren's itinerary?

I'm about to give away the answer, so if you want to figure it out yourself, do so now. It's OK, I'll wait.

Done? Great. Hopefully you realized that there's one and only one state that borders only one other state. That state has to be the beginning or end of any path. (You can't visit any state more than once so once you enter from the single neighboring state, you're stuck.) The beginning is Delaware (which borders several other states), so this unique "degree-one" state must be the end of Loren's epic journey.

As usual for me, the really interesting aspect of this problem is: can I solve it in MATLAB? This is a classic "traveling salesman" problem, so the graph capabilities recently added to MATLAB should be of use. But the first problem is how to build the graph that represents the connections between the states. It turns out that another new feature will help.

A new variable type for representing 2-D polygons was introduced in R2017b. The `polyshape` function takes vectors representing *x* and *y* coordinates and returns a polygon with those points as its vertices. It so happens that Mapping Toolbox comes with a data file containing the latitudes and longitudes of the borders of the US states - just what I need. With the `shaperead` function I can import this data in the form of a structure array, with fields for the name of the state and the latitude/longitude of the border:

states = shaperead('usastatehi.shp'); % Remove Alaska & Hawaii, and Washington DC states([2,11,51]) = [];

For a single state, it's straightforward to extract the arrays in the `X` (longitude) and `Y` (latitude) fields and use these as the inputs to `polyshape`:

x = states(1).X; y = states(1).Y; p = polyshape(x,y)

p = polyshape with properties: Vertices: [518×2 double] NumRegions: 2 NumHoles: 0

Now that the state is represented as a polygon, I can use the functions that can work with polygon, such as plotting:

plot(p)

But to really get the benefit of working with polygons, I'd like to get an array of them, where each element is a polygon representing one state. I could extract the *x* and *y* components for each state in a loop, but any time I see myself about to apply an operation to each element of an array, I think "time for `arrayfun`!".

struct2poly = @(s) polyshape(s.X,s.Y); % define my struct -> polyshape operation p = arrayfun(struct2poly,states) % apply to all elements of the struct STATES

p = 48×1 polyshape array with properties: Vertices NumRegions NumHoles

Because this is MATLAB, functions like `plot` should work on arrays of polygons, just as well as scalar polygons, right?

plot(p)

Isn't that nice? Each polygon is plotted as its own shaded shape. If we zoom in, we can see the solution to the puzzle.

axis([-75 -66 42 48])

By eye we can see how states share borders, but how can we get MATLAB to determine this in an automated way? Let's start with the simplest problem: given two states (polygons), can we determine if they share a border or not?

me = p(17); ma = p(19); nh = p(27); subplot(1,2,1) plot([ma nh]) title('Shared border') subplot(1,2,2) plot([ma me]) title('No shared border')

The polygons have a `Vertices` property that contains the border locations. So a simplistic approach would be to see if the two polygons have any vertices in common. However, this has a lot of potential problems. Firstly, two shapes can share a boundary without sharing vertices:

```
a = polyshape([0 1 1 0],[0 0 3 3]);
b = polyshape([1 2 2 1],[1 1 2 2]);
subplot(1,1,1)
plot([a b])
intersect(a.Vertices,b.Vertices,'rows')
```

ans = 0×2 empty double matrix

Secondly, even if the survey points all align, there's always the potential for floating-point precision issues: how is 31°41'59" represented?

lat1 = 31.699722222222222 lat2 = 31 + 41/60 + 59/3600 lat1 == lat2 lat1 - lat2

lat1 = 31.7 lat2 = 31.7 ans = logical 0 ans = -3.5527e-15

So a better idea might be to think about it geometrically. Two areas in the plane share a border if their intersection is a line or curve. And the new polygons in MATLAB have an `intersect` functionality! However, unfortunately, from a programming perspective, the intersection of two polygons is another polygon. And a polygon with zero area (which you get when the intersection is a line) is empty, so there's no obvious way to distinguish between states that share a boundary or not.

intersect(ma,nh) intersect(ma,me)

ans = polyshape with properties: Vertices: [0×2 double] NumRegions: 0 NumHoles: 0 ans = polyshape with properties: Vertices: [0×2 double] NumRegions: 0 NumHoles: 0

Furthermore, this approach would still have been vulnerable to the finite precision problem.

The new MATLAB polygons have a `polybuffer` function that gives you a way to "fatten up" a polygon by pushing the border out a given amount. Here's Massachusetts with a 0.1-degree (approx. 7-mile) border:

mafat = polybuffer(ma,0.1); plot([ma mafat])

This gives a fairly simple, but robust, way to determine whether states are neighbors: put a small buffer around both and then see if there's any overlap.

```
pfat = polybuffer(p,1e-4); % add ~35 feet around every state
me = pfat(17);
ma = pfat(19);
nh = pfat(27);
intersect(ma,nh)
intersect(ma,me)
```

ans = polyshape with properties: Vertices: [541×2 double] NumRegions: 1 NumHoles: 0 ans = polyshape with properties: Vertices: [0×2 double] NumRegions: 0 NumHoles: 0

As you'd expect for a data type that represents polygons, there's an `area` function that does exactly what its name suggests:

area(intersect(ma,nh)) area(intersect(ma,me))

ans = 0.00035462 ans = 0

So determining whether the padded states overlap should be simple: does the intersection have nonzero area? But the programmer in me was a bit worried: should I use a function, like `area`, that will come with some overhead? Would it be more efficient to directly query the properties of the polygon that results from the intersection?

However, there's another slight wrinkle that needs to be considered before worrying about implementation details. The "Four Corners" area of the American Southwest is the one place where four states touch - Colorado, New Mexico, Arizona, and Utah. Does Colorado border Arizona? Does Utah border New Mexico? We can let philosophers and mathematicians argue over that. But unless an infinitely thin Loren is driving an infinitely thin car, the answer, for our purposes, is no. There's no way in reality to drive from Colorado to Arizona without going through Utah or New Mexico.

plot(pfat,'LineStyle','none') axis([-109.055 -109.035 36.99 37.01])

But if we look at the intersection between the "fattened" states, there will be a small overlap between Colorado and Arizona. Note that it will be *very* small - only slightly bigger than my apartment in graduate school, which should probably be considered a quantum amount of geographic area. Any real border between states will be at least an order of magnitude larger than this. So now we have a simple and robust way to determine whether or not states share a border:

- Add a small border to all states
- Compute the area of the intersection of states
- States share a border if that intersection is greater than a threshold

area(intersect(ma,nh)) > 1e-6 area(intersect(ma,me)) > 1e-6

ans = logical 1 ans = logical 0

Now I just need to apply that to every pair of states. I could probably try to get fancy with `arrayfun`, but this is a case where the simple `for`-loop approach might be a better idea. In fact, loops can help me avoid redundant calculations. If State A borders State B, then State B borders State A and I don't have to check that.

border = zeros(48); for k = 1:48 for j = (k+1):48 % Only need j for ks we haven't checked yet border(j,k) = area(intersect(pfat(j),pfat(k))) > 1e-6; end end

The result of this is a lower-triangular matrix of connections. Now I can build a graph from the connectivity matrix and solve the problem. One of the neat graph functions is `degree`, which gives the number of connections to each node.

```
G = graph(border,'lower');
histogram(degree(G))
```

So there's one degree-one node in the graph. That's the one we want. But which state does that correspond to? The names are stored in the `Name` field of the original structure array. I'll extract them into a cell array by using one of my favorite MATLAB tricks. When indexing produces multiple results (not just an array with multiple elements), those results come back in the form of a comma-separated list. That is, if `s` is a structure array (for example), `s.Thing` is equivalent to `s(1).Thing, s(2).Thing, ..., s(end).Thing`. What's great about having a comma-separated list is that that's the form MATLAB uses for concatenating elements into an array (i.e., `[a,b,c]` or `{a,b,c}`) and for passing inputs into a function (i.e., `myfun(a,b,c)`). Hence, I can extract all the `Name` field elements of the structure array `states` and concatenate them into a cell array with one simple line of MATLAB:

stnames = {states.Name};

And now, finally, I can find out which state borders only one other state:

stnames{degree(G) == 1}

ans = 'Maine'

Having the state names, I can go a little further and visualize both the map and the connectivity graph. To start, it's possible to name the nodes in a graph:

```
G = graph(border,stnames,'lower');
```

This is useful when you visualize a graph because the nodes are automatically labeled:

plot(G)

But this graph doesn't show the states in a physical layout. It's also possible to specify the locations of the nodes when plotting a graph, but for that I need a point in each state to use as the node location. Right in the middle of the state would be good... Guess what? There's a polygon function for that: `centroid`.

[x,y] = centroid(p); plot(G,'XData',x,'YData',y)

That's looking pretty good. It would be nice to add the map for reference.

plot(p) hold on plot(G,'XData',x,'YData',y) hold off

Nice! Something cool to notice about that code: two uses of `plot`, both of which were on "exotic", non-numeric data types (one polygon array and one graph). This is something that our management team work hard on - making sure functionality is consistent, so that, for example, `plot` works naturally on different kinds of data. It's a feature of MATLAB I greatly appreciate. It means that I, as a MATLAB user, don't have to expend brain energy on which plotting function, or even which variant of `plot`, I have to use. I can just use `plot` and get on with what I'm actually trying to do.

I solved the brain teaser, and even went a bit further to make a nice visualization of the states and their connections. But there's one thing about my visualization that's still bugging me: the states are colored with seven unique colors in the order that the states appear in the polygon array (which happens to be alphabetical by name). This means that neighboring states are sometimes given the same color. For example, note that Nebraska and the Dakotas are elments 25, 32, and 39, which means they all end up the same color. I happen to recall that there's a mathematical theorem that says that any map can be colored with only four unique colors, such that no neighboring regions have the same color.

To me, that sounds like a second interesting task for polygons and graphs,... but a story for another time. In the meantime, let us know in the comments if you can think of somewhere in your work where the new `polyshape` type might be useful.

Get
the MATLAB code

Published with MATLAB® R2017b

Text data has become an important part of data analytics, thanks to advances in natural language processing that transform unstructured text into meaningful data. The new Text Analytics Toolbox provides tools to process and analyze text data in MATLAB.... read more >>

]]>Text data has become an important part of data analytics, thanks to advances in natural language processing that transform unstructured text into meaningful data. The new Text Analytics Toolbox provides tools to process and analyze text data in MATLAB.

Today's guest blogger, Toshi Takeuchi introduces some cool features available in the new toolbox, starting with word embeddings. Check out how he uses sentiment analysis to find good AirBnB locations to stay in Boston!

- What is a Word Embedding?
- Ingredients
- Loading a Pre-Trained Word Embedding from GloVe
- Vector Math Example
- Visualizing the Word Embedding
- Using Word Embeddings for Sentiment Analysis
- Word Embeddings Meet Machine Learning
- Prepare Data for Machine Learning
- Training and Evaluating the Sentiment Classifier
- Boston Airbnb Open Data
- Airbnb Review Ratings
- Computing Sentiment Scores
- Sentiment by Location
- Summary

Have you heard about word2vec or GloVe? These are part of very powerful natural language processing technique called word embeddings, and you can now take advantage of it in MATLAB via Text Analytics Toolbox.

Why am I excited about it? It "embeds" words into a vector space model based on how often a word appears close to other words. Done at an internet scale, you can attempt to capture the semantics of the words in the vectors, so that similar words have similar vectors.

One very famous example of how word embeddings can represent such relationship is that you can do a vector computation like this:

$$king - man + woman \approx queen$$

Yes, "queen" is like "king" except that it is a woman, rather than a man! How cool is that? This kind of magic has become possible thanks to vast availability of raw text data on the internet, greater computing capability that can process it, and advances in artificial neural networks, such as deep learning.

Even more exciting is the fact that you don't have to be a natural language processing expert to harness the power of word embeddings if you use pre-trained models! Let me show you how you can use it for your own text analytics purposes, such as document classification, information retrieval and sentiment analysis.

In this example, I will use a pre-trained word embedding from GloVe. To follow along, please

- Get the source code of this post by clicking on "Get the MATLAB code" at the bottom of this page
- Download a free trial version of Text Analytics Toolbox (MATLAB and Statistics and Machine Learning Toolbox R2017b or later are also required).
- Download the pre-trained model glove.6B.300d.txt (6 billion tokens, 400K vocabulary, 300 dimensions) from GloVe.
- Download the sentiment lexicon from University of Illinois at Chicago
- Download the data from the Boston Airbnb Open Data page on Kaggle
- Download my custom function load_lexicon.m and class sentiment.m as well as the raster map of Boston

Please extract the content from the archive files into your current folder.

You can use the function `readWordEmbedding` in Text Analytics Toolbox to read pre-trained word embeddings. To see a word vector, use `word2vec` to get the vector representation of a given word. Because the dimension for this embedding is 300, we get a vector of 300 elements for each word.

filename = "glove.6B.300d"; if exist(filename + '.mat', 'file') ~= 2 emb = readWordEmbedding(filename + '.txt'); save(filename + '.mat', 'emb', '-v7.3'); else load(filename + '.mat') end v_king = word2vec(emb,'king')'; whos v_king

Name Size Bytes Class Attributes v_king 300x1 1200 single

Let's try the vector math! Here is another famous example:

$$paris - france + poland \approx warsaw$$

Apparently, the vector subtraction "paris - france" encodes the concept of "capital" and if you add "poland", you get "warsaw".

Let's try it with MATLAB. `word2vec` returns vectors for given words in the word embedding, and `vec2word` finds the closest words to the vectors in the word embedding.

v_paris = word2vec(emb,'paris'); v_france = word2vec(emb,'france'); v_poland = word2vec(emb,'poland'); vec2word(emb, v_paris - v_france + v_poland)

ans = "warsaw"

We would like to visualize this word embedding using `textscatter` plot, but it is hard to visualize it if all 400,000 words from the word embedding are included. I found a list of 4,000 English nouns. Let's use those words only and reduce the dimensions from 300 to 2 using `tsne` (t-Distributed Stochastic Neighbor Embedding) for dimensionality reduction. To make it easier to see words, I zoomed into a specific area of the plot that contains food related-words. You can see that related words are placed close together.

if exist('nouns.mat','file') ~= 2 url = 'http://www.desiquintans.com/downloads/nounlist/nounlist.txt'; nouns = webread(url); nouns = split(nouns); save('nouns.mat','nouns'); else load('nouns.mat') end nouns(~ismember(nouns,emb.Vocabulary)) = []; vec = word2vec(emb,nouns); rng('default'); % for reproducibility xy = tsne(vec); figure textscatter(xy,nouns) title('GloVe Word Embedding (6B.300d) - Food Related Area') axis([-35 -10 -36 -14]); set(gca,'clipping','off') axis off

For a practical application of word embeddings, let's consider sentiment analysis. We would typically take advantage of pre-existing sentiment lexicons such as this one from the University of Illinois at Chicago. It comes with 2,006 positive words and 4,783 negative words. Let's load the lexicon using the custom function `load_lexicon`.

If we just rely on the available words in the lexicon, we can only score sentiment for 6,789 words. One idea to expand on this is to use the word embedding to find words that are close to these sentiment words.

pos = load_lexicon('positive-words.txt'); neg = load_lexicon('negative-words.txt'); [length(pos) length(neg)]

ans = 2006 4783

What if we use word vectors as the training data to develop a classifier that can score all words in the 400,000-word embedding? We can take advantage of the fact that related words are close together in word embeddings to do this. Let's make a sentiment classifier that takes advantage of the vectors from the word embedding.

As the first step, we will get vectors from the word embedding for words in the lexicon to create a matrix of predictors with 300 columns, and then use positive or negative sentiment labels as the response variable. Here is the preview of the word, response variable and the first 7 predictor variables out of 300.

% Drop words not in the embedding pos = pos(ismember(pos,emb.Vocabulary)); neg = neg(ismember(neg,emb.Vocabulary)); % Get corresponding word vectors v_pos = word2vec(emb,pos); v_neg = word2vec(emb,neg); % Initialize the table and add the data data = table; data.word = [pos;neg]; pred = [v_pos;v_neg]; data = [data array2table(pred)]; data.resp = zeros(height(data),1); data.resp(1:length(pos)) = 1; % Preview the table head(data(:,[1,end,2:8 ]))

ans = 8×9 table word resp pred1 pred2 pred3 pred4 pred5 pred6 pred7 _____________ ____ _________ _________ _________ ________ __________ _________ __________ "abound" 1 0.081981 -0.27295 0.32238 0.19932 0.099266 0.60253 0.18819 "abounds" 1 -0.037126 0.085212 0.26952 0.20927 -0.014547 0.52336 0.11287 "abundance" 1 -0.038408 0.076613 -0.094277 -0.10652 -0.43257 0.74405 0.41298 "abundant" 1 -0.29317 -0.068101 -0.44659 -0.31563 -0.13791 0.44888 0.31894 "accessible" 1 -0.45096 -0.46794 0.11761 -0.70256 0.19879 0.44775 0.26262 "acclaim" 1 0.07426 -0.11164 0.3615 -0.4499 -0.0061991 0.44146 -0.0067972 "acclaimed" 1 0.69129 0.04812 0.29267 0.1242 0.083869 0.25791 -0.5444 "acclamation" 1 -0.026593 -0.60759 -0.15785 0.36048 -0.45289 0.0092178 0.074671

Let's partition the data into a training set and holdout set for performance evaluation. The holdout set contains 30% of the available data.

rng('default') % for reproducibility c = cvpartition(data.resp,'Holdout',0.3); train = data(training(c),2:end); Xtest = data(test(c),2:end-1); Ytest = data.resp(test(c)); Ltest = data(test(c),1); Ltest.label = Ytest;

We want to build a classifier that can separate positive words and negative words in the vector space defined by the word embedding. For a quick performance evaluation, I chose the fast and easy linear discriminant among possible machine learning algorithms.

Here is the confusion matrix of this model. The result was 91.1% classification accuracy. Not bad.

% Train model mdl = fitcdiscr(train,'resp'); % Predict on test data Ypred = predict(mdl,Xtest); cf = confusionmat(Ytest,Ypred); % Display results figure vals = {'Negative','Positive'}; heatmap(vals,vals,cf); xlabel('Predicted Label') ylabel('True Label') title({'Confusion Matrix of Linear Discriminant'; ... sprintf('Classification Accuracy %.1f%%', ... sum(cf(logical(eye(2))))/sum(sum(cf))*100)})

Let's check the predicted sentiment score against the actual label. The custom class `sentiment` uses the linear discriminant model to score sentiment.

The `scoreWords` method of the class scores words. A positive score represents positive sentiment, and a negative score is negative. Now we can use 400,000 words to score sentiment.

dbtype sentiment.m 18:26

18 function scores = scoreWords(obj,words) 19 %SCOREWORDS scores sentiment of words 20 vec = word2vec(obj.emb,words); % word vectors 21 if size(vec,2) ~= obj.emb.Dimension % check num cols 22 vec = vec'; % transpose as needed 23 end 24 [~,scores,~] = predict(obj.mdl,vec); % get class probabilities 25 scores = scores(:,2) - scores(:,1); % positive scores - negative scores 26 end

Let's test this custom class. If the label is 0 and score is negative or the label is 1 and score is positive, then the model classified the word correctly. Otherwise, the word was misclassified.

Here is the table that shows 10 examples from the test set:

- the word
- its sentiment label (0 = negative, 1 = positive)
- its sentiment score (negative = negative, positive = positive)
- evaluation (true = correct, false = incorrect)

sent = sentiment(emb,mdl); Ltest.score = sent.scoreWords(Ltest.word); Ltest.eval = Ltest.score > 0 == Ltest.label; disp(Ltest(randsample(height(Ltest),10),:))

word label score eval _____________ _____ ________ _____ "fugitive" 0 -0.90731 true "misfortune" 0 -0.98667 true "outstanding" 1 0.99999 true "reluctant" 0 -0.99694 true "botch" 0 -0.99957 true "carefree" 1 0.97568 true "mesmerize" 1 0.4801 true "slug" 0 -0.88944 true "angel" 1 0.43419 true "wheedle" 0 -0.98412 true

Now we need a way to score the sentiment of human-language text, rather than a single word. The `scoreText` method of the sentiment class averages the sentiment scores of each word in the text. This may not be the best way to do it, but it's a simple place to start.

dbtype sentiment.m 28:33

28 function score = scoreText(obj,text) 29 %SCORETEXT scores sentiment of text 30 tokens = split(lower(text)); % split text into tokens 31 scores = obj.scoreWords(tokens); % get score for each token 32 score = mean(scores,'omitnan'); % average scores 33 end

Here are the sentiment scores on sentences given by the `scoreText` method - very positive, somewhat positive, and negative.

[sent.scoreText('this is fantastic') ... sent.scoreText('this is okay') ... sent.scoreText('this sucks')]

ans = 0.91458 0.80663 -0.073585

Let's try this on review data from the Boston Airbnb Open Data page on Kaggle. First, we would like to see what people say in their reviews as a word cloud. Text Analytics Toolbox provides functionality to simplify text preprocessing workflows, such as `tokenizedDocument` which parses documents into an array of tokens, and `bagOfWords` that generates the term frequency count model (this can be used to build a machine learning model).

The commented-out code will generate the word cloud shown at the top of this post. However, you can also generate word clouds using two-word phrases known as bigrams. You can generate bigrams with `docfun`, which operates on the array of tokens. You can also see that it is possible to generate trigrams and other n-grams by modifying the function handle.

It seems a lot of comments were about locations!

opts = detectImportOptions('listings.csv'); l = readtable('listings.csv',opts); reviews = readtable('reviews.csv'); comments = tokenizedDocument(reviews.comments); comments = lower(comments); comments = removeWords(comments,stopWords); comments = removeShortWords(comments,2); comments = erasePunctuation(comments); % == uncomment to generate a word cloud == % bag = bagOfWords(comments); % figure % wordcloud(bag); % title('AirBnB Review Word Cloud') % Generate a Bigram Word Cloud f = @(s)s(1:end-1) + " " + s(2:end); bigrams = docfun(f,comments); bag2 = bagOfWords(bigrams); figure wordcloud(bag2); title('AirBnB Review Bigram Cloud')

Review ratings are also available, but ratings are really skewed towards 100, meaning the vast majority of listings are just perfectly wonderful (really?). As this XCKD comic shows, we have the problem with online ratings with regards to review ratings. This is not very useful.

figure histogram(l.review_scores_rating) title('Distribution of AirBnB Review Ratings') xlabel('Review Ratings') ylabel('# Listings')

Now let's score sentiment of Airbnb listing reviews instead. Since a listing can have number of reviews, I would use the median sentiment score per listing. The median sentiment scores in Boston are generally in the positive range, but it follows a normal distribution. This looks more realistic.

% Score the reviews f = @(str) sent.scoreText(str); reviews.sentiment = cellfun(f,reviews.comments); % Calculate the median review score by listing [G,listings] = findgroups(reviews(:,'listing_id')); listings.sentiment = splitapply(@median, ... reviews.sentiment,G); % Visualize the results figure histogram(listings.sentiment) title('Sentiment by Boston AirBnB Listing') xlabel('Median Sentiment Score') ylabel('Number of Listings')

The bigram cloud showed reviewers often commented on location and distance. You can use latitude and longitude of the listings to see where listings with very high or low sentiment scores are located. If you see clusters of high scores, perhaps they may indicate good locations to stay in.

% Join sentiment scores and listing info joined = innerjoin( ... listings,l(:,{'id','latitude','longitude', ... 'neighbourhood_cleansed'}), ... 'LeftKeys','listing_id','RightKeys','id'); joined.Properties.VariableNames{end} = 'ngh'; % Discard listings with a NaN sentiment score joined(isnan(joined.sentiment),:) = []; % Discretize the sentiment scores into buckets joined.cat = discretize(joined.sentiment,0:0.25:1, ... 'categorical',{'< 0.25','< 0.50','< 0.75','<=1.00'}); % Remove undefined categories cats = categories(joined.cat); joined(isundefined(joined.cat),:) = []; % Variable for color colorlist = winter(length(cats)); % Generate the plot latlim = [42.300 42.386]; lonlim = [-71.1270 -71.0174]; load boston_map.mat figure imagesc(lonlim,latlim, map) hold on gscatter(joined.longitude,joined.latitude,joined.cat,colorlist,'o') hold off dar = [1, cosd(mean(latlim)), 1]; daspect(dar) set(gca,'ydir','normal'); axis([lonlim,latlim]) title('Sentiment Scores by Boston Airbnb Listing') [g,ngh] = findgroups(joined(:,'ngh')); ngh.Properties.VariableNames{end} = 'name'; ngh.lat = splitapply(@mean,joined.latitude,g); ngh.lon = splitapply(@mean,joined.longitude,g); % Annotations text(ngh.lon(2),ngh.lat(2),ngh.name(2),'Color','w') text(ngh.lon(4),ngh.lat(4),ngh.name(4),'Color','w') text(ngh.lon(6),ngh.lat(6),ngh.name(6),'Color','w') text(ngh.lon(11),ngh.lat(11),ngh.name(11),'Color','w') text(ngh.lon(13),ngh.lat(13),ngh.name(13),'Color','w') text(ngh.lon(17),ngh.lat(17),ngh.name(17),'Color','w') text(ngh.lon(18),ngh.lat(18),ngh.name(18),'Color','w') text(ngh.lon(22),ngh.lat(22),ngh.name(22),'Color','w')

In this post, I focused on word embeddings and sentiment analysis as an example of new features available in Text Analytics Toolbox. Hopefully you saw that the toolbox makes advanced text processing techniques very accessible. You can do more with word embeddings besides sentiment analysis, and the toolbox offers many more features besides word embeddings, such as Latent Semantic Analysis or Latent Dirichlet Allocation.

Hopefully I have more opportunities to discuss those other interesting features in Text Analytics Toolbox in the future.

Get a free trial version to play with it and let us know what you think here!

Get
the MATLAB code

Published with MATLAB® R2017b

Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.... read more >>

]]>Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.

- Part 1: The Citizen CATE Experiment
- Part 2: Training the Volunteers
- Part 3: Rehearsing for the Eclipse
**Part 4: Imaging the Eclipse**

Here is Part 4 of the Series.

In Part 1 of this series, I discussed MathWorks participation in the Citizen CATE Experiment - a citizen science project to image the 2017 solar eclipse. In Part 2 of this series, I described the volunteers, the equipment they will be using, and how they are being trained. In Part 3 of this series, I talked about how the volunteers were rehearsing for the eclipse and the MATLAB software being used to capture the images from the eclipse.

In this final post, I will describe what happened on the day of the eclipse and show some images of the corona during totality.

Weather was always a concern for viewing the eclipse. Fortunately, eclipse day dawned bright and clear. We set up in a public park in the town of Hartsville, Tennessee near the mid-line of the band of totality. The town had planned events for Eclipse day so we were joined in the park by a few hundred townspeople.

We started by setting up the telescope, camera, and software using the procedures we had perfected through all the previous practice sessions. The pre-eclipse steps included recording the GPS data, aligning the telescope, checking the telescope tracking, and adjusting the focus to get the most precise images possible. Here is a picture of my son, William, using the MATLAB software to monitoring live images of the sun before the eclipse began.

Our goal was to be completely finished with the preliminary steps before the partial phase of the eclipse began. Here is an example image taken during those preparations. This image shows a number of big sunspots visible on the surface of the sun.

The partial phase began with *first contact* around 12:00 pm CDT. First contact is when the moon begins to obscure the sun. The partial phase was expected to last about one and a half hours. Here is an image of the partial eclipse about 25 minutes after first contact.

A number of interesting phenomena happened during the partial phase. First, the temperature began to drop. From first contact to totality, the temperature dropped about 7 degrees F. As the eclipse approached totality, the light around us began to dim. About halfway through the partial phase, we observed crescent-shaped images of the sun from light passing through the trees. This occurs because the trees act like thousands of little pin-hole cameras.

Totality began about 1:30 pm CDT and lasted for about 2 minutes and 30 seconds. As the last of the sun's disk disappeared behind the moon, there was an audible gasp from the crowd as the corona appeared as a white halo around the sun. I heard people around us use words like "beautiful", "other-worldly", and even "spiritual" to describe the appearance of the sun during totality. As a first-timer, I didn't know what to expect. Now that I have experienced my first total solar eclipse, I can honestly say that it is the most amazing natural phenomenon that I have ever experienced.

While we were mesmerized by totality, the MATLAB program was acquiring the images just as it was designed to do. Here are two images taken during totality. The first is from my site in Tennessee. The second is from Andrei's site in South Carolina. Notice the apparent difference in the shape of the prominences at 3 and 4 o'clock.

Of the 68 Citizen CATE teams, 58 reported clear weather in their locations at the time of the eclipse. Most of the teams have uploaded their data to the National Solar Observatory site. Dr. Penn and his team have already begun to analyze the data. A first look movie of the combined data from the Citizen CATE sites has been posted to the Citizen CATE website.

With this final post on the 2017 solar eclipse, I'd like to thank Loren for letting me hijack her blog for this series. I'd also like to thank Harrison Roberts and Andrei Ursache my fellow MATLAB Citizen CATE volunteers for their help during the creation of these posts. Finally, I'd like to thank Dr. Matthew Penn and the National Solar Observatory for allowing us to be a part of this exciting project.

Remember, there is another total solar eclipse that will pass through the United States on April 8th, 2024. Can't wait...

If you have an interest in amateur Astronomy or have any questions or comments about the Citizen CATE experiment, please let us know here.

*Copyright 2017 The MathWorks, Inc.*

Get
the MATLAB code

Published with MATLAB® R2017b

Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using Tables.... read more >>

]]>Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using Tables.

Tables are convenient containers for column-oriented data. The variables in a table can have different data types, but must have the same number of rows. You can access table data by row, by variable, or by variable name. And you can specify groups within table variables, to perform calculations on those groups. Today I will show several different ways to apply functions to groups in tables.

First, let's create a table. You can create a table using the `table` function, the Import Tool, or the `readtable` function. I'm going to use `readtable` to read data from a sample file that ships with MATLAB. The file `outages.csv` contains simulated data for electric power outages over a period of 12 years in the United States. Read the table. To display the first five rows, use table indexing.

T = readtable('outages.csv','Format','%C%D%f%f%D%C'); T(1:5,:)

ans = 5×6 table Region OutageTime Loss Customers RestorationTime Cause _________ ________________ ______ __________ ________________ _______________ SouthWest 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 winter storm SouthEast 2003-01-23 00:49 530.14 2.1204e+05 NaT winter storm SouthEast 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 winter storm West 2004-04-06 05:44 434.81 3.4037e+05 2004-04-06 06:10 equipment fault MidWest 2002-03-16 06:18 186.44 2.1275e+05 2002-03-18 23:23 severe storm

There are six columns in the file. Since I want to specify the data types of the table variables, I use the `Format` name-value pair argument. `T` has two categorical variables (specified with `%C`), two datetime variables (`%D`) and two numeric variables (`%f`).

Once data is in a table, you can perform calculations on different table variables. One way is to use dot notation to refer to table variables by name and perform calculations. For example, calculate the total number of customers affected by power outages using the `sum` function, omitting any `NaN` values that might be in `T.Customers`.

```
totalCustomers = sum(T.Customers,'omitnan')
```

totalCustomers = 1.903e+08

You also can apply a function to every table variable, using the `varfun` function. `varfun` is particularly useful for applying a function and returning the results in another table.

However, you might want to apply a function only to variables of a particular type. You can specify the type of variable with the `vartype` function. Here I'm going to subscript into `T` for the numeric variables and calculate the mean of each one. And since the variables have some `NaN` values, I'm going to wrap `mean` in an anonymous function so I can use the `'omitnan'` flag.

T2 = T(:,vartype('numeric')); omean = @(x) mean(x,'omitnan'); meanT = varfun(omean,T2)

meanT = 1×2 table Fun_Loss Fun_Customers ________ _____________ 563.89 1.6693e+05

A more interesting view of the table is to group the data, and start looking for differences between the groups. For example, what is the difference in the mean power loss between the Northeast and Southwest regions?

To answer this question, you can start by specifying one or more table variables as *grouping variables*. The grouping variables define groups into which you split the other table variables. Then you can apply a function to each group within each table variable and the results. For more on grouping variables, see Grouping Variables to Split Data.

You can use grouping variables with the `varfun` function I described above. Specify `Region` as the grouping variable in this table. It's a categorical variable, making it especially convenient to use as a grouping variable. Specify `Loss` and `Customers` as the input variables, and calculate the mean values by region for these two variables. I'm going to use the same `omean` function I specified above.

meanByRegion = varfun(omean,T,'GroupingVariables','Region',... 'InputVariables',{'Loss','Customers'})

meanByRegion = 5×4 table Region GroupCount Fun_Loss Fun_Customers _________ __________ ________ _____________ MidWest 142 1137.7 2.4015e+05 NorthEast 557 551.65 1.4917e+05 SouthEast 389 495.35 1.6776e+05 SouthWest 26 493.88 2.6975e+05 West 354 433.37 1.5201e+05

Another way to apply a grouping variable to the data in a table variable is to use the `accumarray` function. If I convert `T.Region` to a numeric array, and extract `T.Loss` from `T`, then I can calculate mean power loss per region using `accumarray`, just as I did with `varfun`.

regions = double(T.Region); A = accumarray(regions,T.Loss,[],omean)

A = 1137.7 551.65 495.35 493.88 433.37

`accumarray` operates on arrays. So to use `accumarray` on a table variable, you must extract it from the table. Also, the grouping variable must be numeric, and the output argument is an array, not a table. While you can use `accumarray` on table variables, it lacks many of the conveniences that `varfun` provides.

You can specify more than one grouping variable using `varfun`. For example, I'll specify `Region` and `Cause` as grouping variables. The groups are defined as combinations of regions and causes.

meanByRegionAndCause = varfun(omean,T,'GroupingVariables',{'Region','Cause'},... 'InputVariables',{'Loss','Customers'}); meanByRegionAndCause(1:5,:)

ans = 5×5 table Region Cause GroupCount Fun_Loss Fun_Customers _______ ________________ __________ ________ _____________ MidWest attack 12 0 0 MidWest energy emergency 19 536.09 57603 MidWest equipment fault 9 343.37 29704 MidWest severe storm 31 1055.7 4.3584e+05 MidWest thunder storm 32 941.07 1.3301e+05

Another simple calculation is the mean duration of the power outages by region. The basis of this calculation is the difference between the `OutageTime` and `RestorationTime` table variables.

You might think to use `varfun` for this calculation too. The difficulty is that `varfun` applies a function to each table variable separately. But this calculation requires both variables. In other words, we want to apply a function that requires two input arguments.

The solution is to use the `rowfun` function. With `rowfun`, all the table variables are taken as input arguments. If the table has `N` variables, then the function you apply must accept `N` input arguments.

To perform this calculation, I'll start by defining an anonymous function that subtracts one input argument from the other, and then calculates the mean. Then I will calculate the mean duration of power outages by region using `rowfun`.

meanDiff = @(a,b) mean(abs(a-b),'omitnan'); meanOutageTimes = rowfun(meanDiff,T,'GroupingVariables','Region',... 'InputVariables',{'OutageTime','RestorationTime'})

meanOutageTimes = 5×3 table Region GroupCount Var3 _________ __________ _________ MidWest 142 819:14:44 NorthEast 557 581:02:18 SouthEast 389 40:49:49 SouthWest 26 59:31:07 West 354 673:27:12

Note that `OutageTime` and `RestorationTime` are datetime variables. You can perform arithmetic with datetime variables, just as you can with numeric variables.

To rename a variable, you can access the `VariableNames` property of the table and change its name there. Tables contain metadata, such as variable names and descriptions, in a property called `Properties`.

meanOutageTimes.Properties.VariableNames{'Var3'} = 'mean_OutageTime'; meanOutageTimes(1:5,:)

ans = 5×3 table Region GroupCount mean_OutageTime _________ __________ _______________ MidWest 142 819:14:44 NorthEast 557 581:02:18 SouthEast 389 40:49:49 SouthWest 26 59:31:07 West 354 673:27:12

`varfun` and `rowfun` are suitable when you want to perform the same calculation with the table variables. But what if you want to calculate different quantities? For example, you might want to calculate the mean power loss, but also the minimum duration of power outages, by region. In that case, you can perform the calculations and tabulate the results using the `findgroups` and `splitapply` functions.

These functions work a little differently from `varfun` and `rowfun`. The `findgroups` function returns numeric indices that correspond to the grouping variables you specify. As a second output, it also returns a table of the groups. Then you can call `splitapply` to apply a function to each group within the variables.

For example, I'll calculate the mean power loss, the minimum outage time, and the maximum number of customers affected, by region. I will specify the groups once using `findgroups`. Then I can use the groups in multiple calls to `splitapply`.

I'm also going to switch gears a little bit, by removing all rows with `NaN` or missing values. I'll use the `rmmissing` function to remove these rows, so that I don't have to specify the `'omitnan'` flag in every function call.

```
T = rmmissing(T);
[G,results] = findgroups(T(:,'Region'));
meanLoss = splitapply(@mean,T.Loss,G);
minOutageTime = splitapply(@min,T.RestorationTime - T.OutageTime,G);
maxCustomers = splitapply(@max,T.Customers,G)
```

maxCustomers = 3.295e+06 5.9689e+06 2.2249e+06 1.8202e+06 4.26e+06

The output of `splitapply` is an array. `maxCustomers` is a 5-by-1 numeric array. However, you can tabulate the results by adding them as variables to the table `results`, the second output from `findgroups`.

results.meanLoss = meanLoss; results.minOutageTime = minOutageTime; results.maxCustomers = maxCustomers

results = 5×4 table Region meanLoss minOutageTime maxCustomers _________ ________ _____________ ____________ MidWest 907.19 00:00:00 3.295e+06 NorthEast 383.86 00:00:00 5.9689e+06 SouthEast 508.34 00:00:00 2.2249e+06 SouthWest 541.66 00:28:00 1.8202e+06 West 429.73 00:00:00 4.26e+06

So far, I've worked with a table that is small enough to fit into memory. But what if there is so much data that it won't all fit into memory? I might still want to treat it as a table and use the same functions.

That can easily be done by creating a *tall table*. A tall table is a kind of a tall array in MATLAB. Tall arrays and tall tables know how to read the data in one chunk at a time, perform the calculations you want, and then gather up the output at the end. The syntax for working with a tall table is very similar to that for working with a table.

For more information on tall arrays and tall tables, see Tall Arrays.

To show how this works, I'll create a tall table out of `outages.csv`. Of course it's a bit silly to create a tall table out of a file that fits in memory, but this illustrates how to work with a much larger table.

Instead of calling `readtable`, I'll start by calling `datastore` to read `outages.csv` into a datastore.

Then I'll call `tall` on `ds`, to create a tall table out of the datastore. Now I have a tall table that looks and acts very much like the table I worked with in the previous sections. One difference you see is that the tall table is displayed as an M-by-6 table, showing that the number of rows is not yet known.

```
ds = datastore('outages.csv');
T = tall(ds)
```

Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers. T = M×6 tall table Region OutageTime Loss Customers RestorationTime Cause ___________ ________________ ______ __________ ________________ _________________ 'SouthWest' 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 'winter storm' 'SouthEast' 2003-01-23 00:49 530.14 2.1204e+05 NaT 'winter storm' 'SouthEast' 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 'winter storm' 'West' 2004-04-06 05:44 434.81 3.4037e+05 2004-04-06 06:10 'equipment fault' 'MidWest' 2002-03-16 06:18 186.44 2.1275e+05 2002-03-18 23:23 'severe storm' 'West' 2003-06-18 02:49 0 0 2003-06-18 10:54 'attack' 'West' 2004-06-20 14:39 231.29 NaN 2004-06-20 19:16 'equipment fault' 'West' 2002-06-06 19:28 311.86 NaN 2002-06-07 00:51 'equipment fault' : : : : : : : : : : : :

Now I can use `findgroups` and `splitapply` to perform some of the same calculations as before. Again I will wrap `mean` in an anonymous function so I can ignore `NaN` values, as I did above.

```
omean = @(x) mean(x,'omitnan');
T.Region = categorical(T.Region);
T.Cause = categorical(T.Cause);
G = findgroups(T.Region);
meanLoss = splitapply(omean,T.Loss,G);
meanCustomers = splitapply(omean,T.Customers,G)
```

meanCustomers = M×1 tall double column vector ? ? ? : :

Instead of seeing the output, we see a trail of question marks. What does this mean? Simply that I have not performed any calculations yet. At this point, I have only defined the calculations I want to perform on the tall table. No calculations are performed until I *gather* the results, by calling the `gather` function. The output of `gather` is not a tall array and must fit into memory.

meanLoss = gather(meanLoss); meanCustomers = gather(meanCustomers)

Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 4: Completed in 4 sec - Pass 2 of 4: Completed in 2 sec - Pass 3 of 4: Completed in 2 sec - Pass 4 of 4: Completed in 2 sec Evaluation completed in 25 sec Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 2 sec Evaluation completed in 5 sec meanCustomers = 2.4015e+05 1.4917e+05 1.6776e+05 2.6975e+05 1.5201e+05

Now let's tabulate the results in a table. Since `meanLoss` and `meanCustomers` are not tall arrays, `results` is not a tall table. Add the regions as another table variable.

results = table(meanLoss,meanCustomers) region = unique(T.Region); results.region = gather(region)

results = 5×2 table meanLoss meanCustomers ________ _____________ 1137.7 2.4015e+05 551.65 1.4917e+05 495.35 1.6776e+05 493.88 2.6975e+05 433.37 1.5201e+05 Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 1 sec Evaluation completed in 2 sec results = 5×3 table meanLoss meanCustomers region ________ _____________ _________ 1137.7 2.4015e+05 MidWest 551.65 1.4917e+05 NorthEast 495.35 1.6776e+05 SouthEast 493.88 2.6975e+05 SouthWest 433.37 1.5201e+05 West

Here I have shown you several ways you can group data in tables for calculations. To summarize, you can use one or more grouping variables to specify groups within the other table variables. Then you can apply functions to the groups in the table variables.

To apply:

- A function to a single table variable and return an array, use
`accumarray`. - The same function to each table variable and return a table, use
`varfun`. - A function that requires all the table variables as input arguments and return a table, use
`rowfun`. - Different functions to different table variables and build a table of results, use
`findgroups`and`splitapply`.

How about you? Have done analysis with tables where `accumarray`, `varfun`, `rowfun`, `findgroups` and `splitapply` might have helped? Do you have cases where you need tall tables? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2017a