There are a lot of ways of sharing your ideas with MATLAB these days. Including... read more >>

]]>There are a lot of ways of sharing your ideas with MATLAB these days. Including

- MATLAB scripts and functions
- Live scripts
- MAT-files
- Apps
- Code generation of various kinds, e.g., C/C++ and GPU via Coder products
- MATLAB Compiler and SDK (stand-alone apps and code suitable for integrating into other environments)
- MATLAB Production Server

Today I want to remind you of some tools in MATLAB that can facilitate the transition to code you want to deploy via the MATLAB Compiler and SDK.

Here's a list of the functions I find most useful to aid in deployment.

Do you have other functions that ease your transition to deploying your app? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2017b

Today, guest blogger Matt Tearle continues his investigation of the new `polyshape` type (added in R2017b) as a mapping tool.... read more >>

Today, guest blogger Matt Tearle continues his investigation of the new `polyshape` type (added in R2017b) as a mapping tool.

In a previous post, I used the new `polyshape` variable type in MATLAB to determine connections between states. You can see the code in that post. I'll just load in the polygons and matrix of connections that came from that:

load stateborders % Make a graph of state connections G = graph(border,stnames,'lower'); % Obtain centroid locations for states [x,y] = centroid(p); % Plot map and connections plot(p) hold on plot(G,'XData',x,'YData',y) hold off

Having made this nice visualization, it struck me that the colors from the default polygon `plot` weren't making a good map because neighboring states were often given the same color. The `plot` function, of course, doesn't know about the connections - it is just using a set of colors in the order of the polygons in the array. Using knowledge of the connections between states (stored in the matrix `border` and the graph `G`), I should be able to recolor the states so that no neighboring states have the same color.

Note that the actual color is arbitrary. The goal is actually to assign a number to each state. That number can then be used to index into a list of colors (or a colormap).

A simple ("greedy") algorithm to assign colors would be:

- state 1 is color 1
- go through the states in order
- for each state, find all the neighboring states
- get the color values assigned to the neighboring states
- assign the current state the smallest unused value

In the last step, if colors 1 through *n* are used by neighbors, then the number of colors used in the map will grow to *n*+1. This means that the number of colors used for the whole map will depend on the connections and the order of the states. Let's see what happens if I apply this algorithm to the states in default (alphabetical) order.

n = length(p); color = zeros(n,1); % Initialize array to hold color numbering numcolors = 1; % Number of colors needed (so far) for k = 1:n idx = neighbors(G,k); % Get neighbors of the kth node idx = idx(idx < k); % But just those that have an assigned color neighborcolors = unique(color(idx)); % Get colors used by neighbors % Assign the smallest color value not used by the neighboring nodes thiscolor = min(setdiff(1:numcolors,neighborcolors)); % If there isn't one, add another color to the map if isempty(thiscolor) numcolors = numcolors + 1; thiscolor = numcolors; end color(k) = thiscolor; end disp([num2str(numcolors),' colors needed'])

5 colors needed

Five colors is actually common for many maps.

polygonplot(p,color)

I'll be plotting the states with a custom color ordering a few times, so I figured it would make sense to make a short function to do that:

```
type('polygonplot')
```

function polygonplot(p,color) % Get a list of colors (using the default plot colors) ax = gca; colors = ax.ColorOrder; % Plot the polygons and change their colors map = plot(p); for k = 1:length(p) map(k).FaceColor = colors(color(k),:); map(k).FaceAlpha = 0.8; % Make the shading darker end

Five colors is pretty good, but the four color mapping theorem states that you need only four colors to color any map so that no neighboring regions have the same color. However, in true pure math style, the theorem only says it *can* be done, not *how* to do it.

Is there an algorithm I can use to assign colors to the states to achieve a four-color map coloring? It turns out that this is a difficult problem in general. There are algorithms, but they are far more complicated than I want to deal with!

So instead, what about just shuffling the order of the states, then applying the greedy algorithm above? I took that code and turned it into a function that takes a graph as input and returns the color numbering and number of colors needed (equivalent to the max of the color numbering) as output.

Now I just need to shuffle the states and apply the function. But first, the matrix of connections is currently just the lower-triangular part. That structure will be ruined by the shuffling, so I'll make the full matrix version:

border = border + border';

Now let's try shuffling and see how many colors we need:

```
rng(123) % for reproducibility
idx = randperm(48);
Gperm = graph(border(idx,idx));
[~,nc] = greedycolor(Gperm)
```

nc = 6

OK, it's working, but that particular shuffle didn't help. It actually was worse. Well, that's random numbers for you. Maybe it's a good idea to see how the number of colors needed are distributed. Let's run the algorithm a bunch of times and see:

rng(123) nexp = 1000; numcolors = zeros(nexp,1); for k = 1:nexp idx = randperm(n); Gperm = graph(border(idx,idx)); [~,numcolors(k)] = greedycolor(Gperm); end histogram(numcolors)

Interesting. Most of the time I get five, sometimes I need six, and about 5% of the time I do indeed get a valid 4-color mapping. So a simple way to get a 4-color mapping would be to try random permutations until only four colors were needed. A terrible algorithm? Perhaps. But I bet it works:

numcolors = Inf; while numcolors > 4 idx = randperm(n); Gperm = graph(border(idx,idx)); [color,numcolors] = greedycolor(Gperm); end polygonplot(p(idx),color)

Surely there's a better way, though. Right...? Well, maybe. I did encounter an algorithm for generating 5-color mappings. Basically, remove nodes from the graph one by one, until the nodes of the remaining graph all have degree greater than five. Then... do more stuff, which will color the remaining graph. Then you add the nodes back, coloring them as you go. I didn't really pay attention to the details because it seemed like there was a very good chance that you'd never get nodes with degree greater than five. Every time you remove a node, the degree of every node it's attached to goes down by one. So it seems like you could just try to repeatedly remove the lowest-degree node. Keep the list of nodes in order that you removed them, then use that (or, more precisely, the reverse of it, from last to first) as the ordering in which to color the nodes. It's not guaranteed to work, but it's probably not a bad option to try (and surely more efficient than random permutations). Also, this was for 5-color mappings, not 4, but let's just see what we get...

G = graph(border,stnames); idx = zeros(n,1); for j = 1:n [~,k] = min(degree(G)); idx(j) = find(strcmp(G.Nodes.Name{k},stnames)); G = rmnode(G,k); end idx = flipud(idx); Gperm = graph(border(idx,idx)); [color,numcolors] = greedycolor(Gperm); polygonplot(p(idx),color)

And just like that, I have a 4-color mapping for the US! In fact, it's not far from being a 3-color mapping! Only three states needed the fourth color. In fact, notice that this approach tries to use the lowest numbers as much as possible

histogram(color)

But maybe I just got lucky. So let's try this on another map. Feeling brave, I found a world map shape file online. I applied the exact same approach: imported it with `shaperead`, converted the resulting structure array to a polygon array, used the "pad-and-intersect" method to build the graph of shared borders, sorted by iteratively removing the lowest-degree node, then applied the greedy coloring algorithm. The result...?

A 4-color world map! Just like that.

As I undertand it, real cartographers wouldn't have much use for my algorithm. There's a definite skew in the colors (e.g., all islands are color #1 because they have no borders), so the result isn't particularly aesthetically pleasing. Are there any amateur (or professional!) cartographers out there who want to try applying a "real" map-coloring algorithm using polygons and graphs? I'd love to see your results. I also know there must be maps that break my algorithm, but how well will the algorithm hold up with actual maps in practice, rather than deliberately pathological examples? If anyone has other polygonal maps they want to try coloring with this approach, I'd love to hear how it works.

And, of course, maps and graphs don't have to be just about physical geography - they can be abstractions of any number of things. If you have an application that could use map coloring for visualization, share your ideas in the comments.

Get
the MATLAB code

Published with MATLAB® R2017b

R2017b was released recently. Today's guest blogger, Matt Tearle, discovered a fun application of one of the many new features - solving a map-based puzzle.... read more >>

]]>R2017b was released recently. Today's guest blogger, Matt Tearle, discovered a fun application of one of the many new features - solving a map-based puzzle.

"Car Talk" recently featured a brain teaser submitted by a former MathWorker. The short, slightly modified version is: Loren has to drive from Delaware to every state in the contiguous US, visiting each state exactly once; Loren's boss - let's call him Jack - offers to meet her in the last state, which turns out to be his birthplace; but how can Jack know what the last state will be, without waiting to hear Loren's itinerary?

I'm about to give away the answer, so if you want to figure it out yourself, do so now. It's OK, I'll wait.

Done? Great. Hopefully you realized that there's one and only one state that borders only one other state. That state has to be the beginning or end of any path. (You can't visit any state more than once so once you enter from the single neighboring state, you're stuck.) The beginning is Delaware (which borders several other states), so this unique "degree-one" state must be the end of Loren's epic journey.

As usual for me, the really interesting aspect of this problem is: can I solve it in MATLAB? This is a classic "traveling salesman" problem, so the graph capabilities recently added to MATLAB should be of use. But the first problem is how to build the graph that represents the connections between the states. It turns out that another new feature will help.

A new variable type for representing 2-D polygons was introduced in R2017b. The `polyshape` function takes vectors representing *x* and *y* coordinates and returns a polygon with those points as its vertices. It so happens that Mapping Toolbox comes with a data file containing the latitudes and longitudes of the borders of the US states - just what I need. With the `shaperead` function I can import this data in the form of a structure array, with fields for the name of the state and the latitude/longitude of the border:

states = shaperead('usastatehi.shp'); % Remove Alaska & Hawaii, and Washington DC states([2,11,51]) = [];

For a single state, it's straightforward to extract the arrays in the `X` (longitude) and `Y` (latitude) fields and use these as the inputs to `polyshape`:

x = states(1).X; y = states(1).Y; p = polyshape(x,y)

p = polyshape with properties: Vertices: [518×2 double] NumRegions: 2 NumHoles: 0

Now that the state is represented as a polygon, I can use the functions that can work with polygon, such as plotting:

plot(p)

But to really get the benefit of working with polygons, I'd like to get an array of them, where each element is a polygon representing one state. I could extract the *x* and *y* components for each state in a loop, but any time I see myself about to apply an operation to each element of an array, I think "time for `arrayfun`!".

struct2poly = @(s) polyshape(s.X,s.Y); % define my struct -> polyshape operation p = arrayfun(struct2poly,states) % apply to all elements of the struct STATES

p = 48×1 polyshape array with properties: Vertices NumRegions NumHoles

Because this is MATLAB, functions like `plot` should work on arrays of polygons, just as well as scalar polygons, right?

plot(p)

Isn't that nice? Each polygon is plotted as its own shaded shape. If we zoom in, we can see the solution to the puzzle.

axis([-75 -66 42 48])

By eye we can see how states share borders, but how can we get MATLAB to determine this in an automated way? Let's start with the simplest problem: given two states (polygons), can we determine if they share a border or not?

me = p(17); ma = p(19); nh = p(27); subplot(1,2,1) plot([ma nh]) title('Shared border') subplot(1,2,2) plot([ma me]) title('No shared border')

The polygons have a `Vertices` property that contains the border locations. So a simplistic approach would be to see if the two polygons have any vertices in common. However, this has a lot of potential problems. Firstly, two shapes can share a boundary without sharing vertices:

```
a = polyshape([0 1 1 0],[0 0 3 3]);
b = polyshape([1 2 2 1],[1 1 2 2]);
subplot(1,1,1)
plot([a b])
intersect(a.Vertices,b.Vertices,'rows')
```

ans = 0×2 empty double matrix

Secondly, even if the survey points all align, there's always the potential for floating-point precision issues: how is 31°41'59" represented?

lat1 = 31.699722222222222 lat2 = 31 + 41/60 + 59/3600 lat1 == lat2 lat1 - lat2

lat1 = 31.7 lat2 = 31.7 ans = logical 0 ans = -3.5527e-15

So a better idea might be to think about it geometrically. Two areas in the plane share a border if their intersection is a line or curve. And the new polygons in MATLAB have an `intersect` functionality! However, unfortunately, from a programming perspective, the intersection of two polygons is another polygon. And a polygon with zero area (which you get when the intersection is a line) is empty, so there's no obvious way to distinguish between states that share a boundary or not.

intersect(ma,nh) intersect(ma,me)

ans = polyshape with properties: Vertices: [0×2 double] NumRegions: 0 NumHoles: 0 ans = polyshape with properties: Vertices: [0×2 double] NumRegions: 0 NumHoles: 0

Furthermore, this approach would still have been vulnerable to the finite precision problem.

The new MATLAB polygons have a `polybuffer` function that gives you a way to "fatten up" a polygon by pushing the border out a given amount. Here's Massachusetts with a 0.1-degree (approx. 7-mile) border:

mafat = polybuffer(ma,0.1); plot([ma mafat])

This gives a fairly simple, but robust, way to determine whether states are neighbors: put a small buffer around both and then see if there's any overlap.

```
pfat = polybuffer(p,1e-4); % add ~35 feet around every state
me = pfat(17);
ma = pfat(19);
nh = pfat(27);
intersect(ma,nh)
intersect(ma,me)
```

ans = polyshape with properties: Vertices: [541×2 double] NumRegions: 1 NumHoles: 0 ans = polyshape with properties: Vertices: [0×2 double] NumRegions: 0 NumHoles: 0

As you'd expect for a data type that represents polygons, there's an `area` function that does exactly what its name suggests:

area(intersect(ma,nh)) area(intersect(ma,me))

ans = 0.00035462 ans = 0

So determining whether the padded states overlap should be simple: does the intersection have nonzero area? But the programmer in me was a bit worried: should I use a function, like `area`, that will come with some overhead? Would it be more efficient to directly query the properties of the polygon that results from the intersection?

However, there's another slight wrinkle that needs to be considered before worrying about implementation details. The "Four Corners" area of the American Southwest is the one place where four states touch - Colorado, New Mexico, Arizona, and Utah. Does Colorado border Arizona? Does Utah border New Mexico? We can let philosophers and mathematicians argue over that. But unless an infinitely thin Loren is driving an infinitely thin car, the answer, for our purposes, is no. There's no way in reality to drive from Colorado to Arizona without going through Utah or New Mexico.

plot(pfat,'LineStyle','none') axis([-109.055 -109.035 36.99 37.01])

But if we look at the intersection between the "fattened" states, there will be a small overlap between Colorado and Arizona. Note that it will be *very* small - only slightly bigger than my apartment in graduate school, which should probably be considered a quantum amount of geographic area. Any real border between states will be at least an order of magnitude larger than this. So now we have a simple and robust way to determine whether or not states share a border:

- Add a small border to all states
- Compute the area of the intersection of states
- States share a border if that intersection is greater than a threshold

area(intersect(ma,nh)) > 1e-6 area(intersect(ma,me)) > 1e-6

ans = logical 1 ans = logical 0

Now I just need to apply that to every pair of states. I could probably try to get fancy with `arrayfun`, but this is a case where the simple `for`-loop approach might be a better idea. In fact, loops can help me avoid redundant calculations. If State A borders State B, then State B borders State A and I don't have to check that.

border = zeros(48); for k = 1:48 for j = (k+1):48 % Only need j for ks we haven't checked yet border(j,k) = area(intersect(pfat(j),pfat(k))) > 1e-6; end end

The result of this is a lower-triangular matrix of connections. Now I can build a graph from the connectivity matrix and solve the problem. One of the neat graph functions is `degree`, which gives the number of connections to each node.

```
G = graph(border,'lower');
histogram(degree(G))
```

So there's one degree-one node in the graph. That's the one we want. But which state does that correspond to? The names are stored in the `Name` field of the original structure array. I'll extract them into a cell array by using one of my favorite MATLAB tricks. When indexing produces multiple results (not just an array with multiple elements), those results come back in the form of a comma-separated list. That is, if `s` is a structure array (for example), `s.Thing` is equivalent to `s(1).Thing, s(2).Thing, ..., s(end).Thing`. What's great about having a comma-separated list is that that's the form MATLAB uses for concatenating elements into an array (i.e., `[a,b,c]` or `{a,b,c}`) and for passing inputs into a function (i.e., `myfun(a,b,c)`). Hence, I can extract all the `Name` field elements of the structure array `states` and concatenate them into a cell array with one simple line of MATLAB:

stnames = {states.Name};

And now, finally, I can find out which state borders only one other state:

stnames{degree(G) == 1}

ans = 'Maine'

Having the state names, I can go a little further and visualize both the map and the connectivity graph. To start, it's possible to name the nodes in a graph:

```
G = graph(border,stnames,'lower');
```

This is useful when you visualize a graph because the nodes are automatically labeled:

plot(G)

But this graph doesn't show the states in a physical layout. It's also possible to specify the locations of the nodes when plotting a graph, but for that I need a point in each state to use as the node location. Right in the middle of the state would be good... Guess what? There's a polygon function for that: `centroid`.

[x,y] = centroid(p); plot(G,'XData',x,'YData',y)

That's looking pretty good. It would be nice to add the map for reference.

plot(p) hold on plot(G,'XData',x,'YData',y) hold off

Nice! Something cool to notice about that code: two uses of `plot`, both of which were on "exotic", non-numeric data types (one polygon array and one graph). This is something that our management team work hard on - making sure functionality is consistent, so that, for example, `plot` works naturally on different kinds of data. It's a feature of MATLAB I greatly appreciate. It means that I, as a MATLAB user, don't have to expend brain energy on which plotting function, or even which variant of `plot`, I have to use. I can just use `plot` and get on with what I'm actually trying to do.

I solved the brain teaser, and even went a bit further to make a nice visualization of the states and their connections. But there's one thing about my visualization that's still bugging me: the states are colored with seven unique colors in the order that the states appear in the polygon array (which happens to be alphabetical by name). This means that neighboring states are sometimes given the same color. For example, note that Nebraska and the Dakotas are elments 25, 32, and 39, which means they all end up the same color. I happen to recall that there's a mathematical theorem that says that any map can be colored with only four unique colors, such that no neighboring regions have the same color.

To me, that sounds like a second interesting task for polygons and graphs,... but a story for another time. In the meantime, let us know in the comments if you can think of somewhere in your work where the new `polyshape` type might be useful.

Get
the MATLAB code

Published with MATLAB® R2017b

Text data has become an important part of data analytics, thanks to advances in natural language processing that transform unstructured text into meaningful data. The new Text Analytics Toolbox provides tools to process and analyze text data in MATLAB.... read more >>

]]>Text data has become an important part of data analytics, thanks to advances in natural language processing that transform unstructured text into meaningful data. The new Text Analytics Toolbox provides tools to process and analyze text data in MATLAB.

Today's guest blogger, Toshi Takeuchi introduces some cool features available in the new toolbox, starting with word embeddings. Check out how he uses sentiment analysis to find good AirBnB locations to stay in Boston!

- What is a Word Embedding?
- Ingredients
- Loading a Pre-Trained Word Embedding from GloVe
- Vector Math Example
- Visualizing the Word Embedding
- Using Word Embeddings for Sentiment Analysis
- Word Embeddings Meet Machine Learning
- Prepare Data for Machine Learning
- Training and Evaluating the Sentiment Classifier
- Boston Airbnb Open Data
- Airbnb Review Ratings
- Computing Sentiment Scores
- Sentiment by Location
- Summary

Have you heard about word2vec or GloVe? These are part of very powerful natural language processing technique called word embeddings, and you can now take advantage of it in MATLAB via Text Analytics Toolbox.

Why am I excited about it? It "embeds" words into a vector space model based on how often a word appears close to other words. Done at an internet scale, you can attempt to capture the semantics of the words in the vectors, so that similar words have similar vectors.

One very famous example of how word embeddings can represent such relationship is that you can do a vector computation like this:

$$king - man + woman \approx queen$$

Yes, "queen" is like "king" except that it is a woman, rather than a man! How cool is that? This kind of magic has become possible thanks to vast availability of raw text data on the internet, greater computing capability that can process it, and advances in artificial neural networks, such as deep learning.

Even more exciting is the fact that you don't have to be a natural language processing expert to harness the power of word embeddings if you use pre-trained models! Let me show you how you can use it for your own text analytics purposes, such as document classification, information retrieval and sentiment analysis.

In this example, I will use a pre-trained word embedding from GloVe. To follow along, please

- Get the source code of this post by clicking on "Get the MATLAB code" at the bottom of this page
- Download a free trial version of Text Analytics Toolbox (MATLAB and Statistics and Machine Learning Toolbox R2017b or later are also required).
- Download the pre-trained model glove.6B.300d.txt (6 billion tokens, 400K vocabulary, 300 dimensions) from GloVe.
- Download the sentiment lexicon from University of Illinois at Chicago
- Download the data from the Boston Airbnb Open Data page on Kaggle
- Download my custom function load_lexicon.m and class sentiment.m as well as the raster map of Boston

Please extract the content from the archive files into your current folder.

You can use the function `readWordEmbedding` in Text Analytics Toolbox to read pre-trained word embeddings. To see a word vector, use `word2vec` to get the vector representation of a given word. Because the dimension for this embedding is 300, we get a vector of 300 elements for each word.

filename = "glove.6B.300d"; if exist(filename + '.mat', 'file') ~= 2 emb = readWordEmbedding(filename + '.txt'); save(filename + '.mat', 'emb', '-v7.3'); else load(filename + '.mat') end v_king = word2vec(emb,'king')'; whos v_king

Name Size Bytes Class Attributes v_king 300x1 1200 single

Let's try the vector math! Here is another famous example:

$$paris - france + poland \approx warsaw$$

Apparently, the vector subtraction "paris - france" encodes the concept of "capital" and if you add "poland", you get "warsaw".

Let's try it with MATLAB. `word2vec` returns vectors for given words in the word embedding, and `vec2word` finds the closest words to the vectors in the word embedding.

v_paris = word2vec(emb,'paris'); v_france = word2vec(emb,'france'); v_poland = word2vec(emb,'poland'); vec2word(emb, v_paris - v_france + v_poland)

ans = "warsaw"

We would like to visualize this word embedding using `textscatter` plot, but it is hard to visualize it if all 400,000 words from the word embedding are included. I found a list of 4,000 English nouns. Let's use those words only and reduce the dimensions from 300 to 2 using `tsne` (t-Distributed Stochastic Neighbor Embedding) for dimensionality reduction. To make it easier to see words, I zoomed into a specific area of the plot that contains food related-words. You can see that related words are placed close together.

if exist('nouns.mat','file') ~= 2 url = 'http://www.desiquintans.com/downloads/nounlist/nounlist.txt'; nouns = webread(url); nouns = split(nouns); save('nouns.mat','nouns'); else load('nouns.mat') end nouns(~ismember(nouns,emb.Vocabulary)) = []; vec = word2vec(emb,nouns); rng('default'); % for reproducibility xy = tsne(vec); figure textscatter(xy,nouns) title('GloVe Word Embedding (6B.300d) - Food Related Area') axis([-35 -10 -36 -14]); set(gca,'clipping','off') axis off

For a practical application of word embeddings, let's consider sentiment analysis. We would typically take advantage of pre-existing sentiment lexicons such as this one from the University of Illinois at Chicago. It comes with 2,006 positive words and 4,783 negative words. Let's load the lexicon using the custom function `load_lexicon`.

If we just rely on the available words in the lexicon, we can only score sentiment for 6,789 words. One idea to expand on this is to use the word embedding to find words that are close to these sentiment words.

pos = load_lexicon('positive-words.txt'); neg = load_lexicon('negative-words.txt'); [length(pos) length(neg)]

ans = 2006 4783

What if we use word vectors as the training data to develop a classifier that can score all words in the 400,000-word embedding? We can take advantage of the fact that related words are close together in word embeddings to do this. Let's make a sentiment classifier that takes advantage of the vectors from the word embedding.

As the first step, we will get vectors from the word embedding for words in the lexicon to create a matrix of predictors with 300 columns, and then use positive or negative sentiment labels as the response variable. Here is the preview of the word, response variable and the first 7 predictor variables out of 300.

% Drop words not in the embedding pos = pos(ismember(pos,emb.Vocabulary)); neg = neg(ismember(neg,emb.Vocabulary)); % Get corresponding word vectors v_pos = word2vec(emb,pos); v_neg = word2vec(emb,neg); % Initialize the table and add the data data = table; data.word = [pos;neg]; pred = [v_pos;v_neg]; data = [data array2table(pred)]; data.resp = zeros(height(data),1); data.resp(1:length(pos)) = 1; % Preview the table head(data(:,[1,end,2:8 ]))

ans = 8×9 table word resp pred1 pred2 pred3 pred4 pred5 pred6 pred7 _____________ ____ _________ _________ _________ ________ __________ _________ __________ "abound" 1 0.081981 -0.27295 0.32238 0.19932 0.099266 0.60253 0.18819 "abounds" 1 -0.037126 0.085212 0.26952 0.20927 -0.014547 0.52336 0.11287 "abundance" 1 -0.038408 0.076613 -0.094277 -0.10652 -0.43257 0.74405 0.41298 "abundant" 1 -0.29317 -0.068101 -0.44659 -0.31563 -0.13791 0.44888 0.31894 "accessible" 1 -0.45096 -0.46794 0.11761 -0.70256 0.19879 0.44775 0.26262 "acclaim" 1 0.07426 -0.11164 0.3615 -0.4499 -0.0061991 0.44146 -0.0067972 "acclaimed" 1 0.69129 0.04812 0.29267 0.1242 0.083869 0.25791 -0.5444 "acclamation" 1 -0.026593 -0.60759 -0.15785 0.36048 -0.45289 0.0092178 0.074671

Let's partition the data into a training set and holdout set for performance evaluation. The holdout set contains 30% of the available data.

rng('default') % for reproducibility c = cvpartition(data.resp,'Holdout',0.3); train = data(training(c),2:end); Xtest = data(test(c),2:end-1); Ytest = data.resp(test(c)); Ltest = data(test(c),1); Ltest.label = Ytest;

We want to build a classifier that can separate positive words and negative words in the vector space defined by the word embedding. For a quick performance evaluation, I chose the fast and easy linear discriminant among possible machine learning algorithms.

Here is the confusion matrix of this model. The result was 91.1% classification accuracy. Not bad.

% Train model mdl = fitcdiscr(train,'resp'); % Predict on test data Ypred = predict(mdl,Xtest); cf = confusionmat(Ytest,Ypred); % Display results figure vals = {'Negative','Positive'}; heatmap(vals,vals,cf); xlabel('Predicted Label') ylabel('True Label') title({'Confusion Matrix of Linear Discriminant'; ... sprintf('Classification Accuracy %.1f%%', ... sum(cf(logical(eye(2))))/sum(sum(cf))*100)})

Let's check the predicted sentiment score against the actual label. The custom class `sentiment` uses the linear discriminant model to score sentiment.

The `scoreWords` method of the class scores words. A positive score represents positive sentiment, and a negative score is negative. Now we can use 400,000 words to score sentiment.

dbtype sentiment.m 18:26

18 function scores = scoreWords(obj,words) 19 %SCOREWORDS scores sentiment of words 20 vec = word2vec(obj.emb,words); % word vectors 21 if size(vec,2) ~= obj.emb.Dimension % check num cols 22 vec = vec'; % transpose as needed 23 end 24 [~,scores,~] = predict(obj.mdl,vec); % get class probabilities 25 scores = scores(:,2) - scores(:,1); % positive scores - negative scores 26 end

Let's test this custom class. If the label is 0 and score is negative or the label is 1 and score is positive, then the model classified the word correctly. Otherwise, the word was misclassified.

Here is the table that shows 10 examples from the test set:

- the word
- its sentiment label (0 = negative, 1 = positive)
- its sentiment score (negative = negative, positive = positive)
- evaluation (true = correct, false = incorrect)

sent = sentiment(emb,mdl); Ltest.score = sent.scoreWords(Ltest.word); Ltest.eval = Ltest.score > 0 == Ltest.label; disp(Ltest(randsample(height(Ltest),10),:))

word label score eval _____________ _____ ________ _____ "fugitive" 0 -0.90731 true "misfortune" 0 -0.98667 true "outstanding" 1 0.99999 true "reluctant" 0 -0.99694 true "botch" 0 -0.99957 true "carefree" 1 0.97568 true "mesmerize" 1 0.4801 true "slug" 0 -0.88944 true "angel" 1 0.43419 true "wheedle" 0 -0.98412 true

Now we need a way to score the sentiment of human-language text, rather than a single word. The `scoreText` method of the sentiment class averages the sentiment scores of each word in the text. This may not be the best way to do it, but it's a simple place to start.

dbtype sentiment.m 28:33

28 function score = scoreText(obj,text) 29 %SCORETEXT scores sentiment of text 30 tokens = split(lower(text)); % split text into tokens 31 scores = obj.scoreWords(tokens); % get score for each token 32 score = mean(scores,'omitnan'); % average scores 33 end

Here are the sentiment scores on sentences given by the `scoreText` method - very positive, somewhat positive, and negative.

[sent.scoreText('this is fantastic') ... sent.scoreText('this is okay') ... sent.scoreText('this sucks')]

ans = 0.91458 0.80663 -0.073585

Let's try this on review data from the Boston Airbnb Open Data page on Kaggle. First, we would like to see what people say in their reviews as a word cloud. Text Analytics Toolbox provides functionality to simplify text preprocessing workflows, such as `tokenizedDocument` which parses documents into an array of tokens, and `bagOfWords` that generates the term frequency count model (this can be used to build a machine learning model).

The commented-out code will generate the word cloud shown at the top of this post. However, you can also generate word clouds using two-word phrases known as bigrams. You can generate bigrams with `docfun`, which operates on the array of tokens. You can also see that it is possible to generate trigrams and other n-grams by modifying the function handle.

It seems a lot of comments were about locations!

opts = detectImportOptions('listings.csv'); l = readtable('listings.csv',opts); reviews = readtable('reviews.csv'); comments = tokenizedDocument(reviews.comments); comments = lower(comments); comments = removeWords(comments,stopWords); comments = removeShortWords(comments,2); comments = erasePunctuation(comments); % == uncomment to generate a word cloud == % bag = bagOfWords(comments); % figure % wordcloud(bag); % title('AirBnB Review Word Cloud') % Generate a Bigram Word Cloud f = @(s)s(1:end-1) + " " + s(2:end); bigrams = docfun(f,comments); bag2 = bagOfWords(bigrams); figure wordcloud(bag2); title('AirBnB Review Bigram Cloud')

Review ratings are also available, but ratings are really skewed towards 100, meaning the vast majority of listings are just perfectly wonderful (really?). As this XCKD comic shows, we have the problem with online ratings with regards to review ratings. This is not very useful.

figure histogram(l.review_scores_rating) title('Distribution of AirBnB Review Ratings') xlabel('Review Ratings') ylabel('# Listings')

Now let's score sentiment of Airbnb listing reviews instead. Since a listing can have number of reviews, I would use the median sentiment score per listing. The median sentiment scores in Boston are generally in the positive range, but it follows a normal distribution. This looks more realistic.

% Score the reviews f = @(str) sent.scoreText(str); reviews.sentiment = cellfun(f,reviews.comments); % Calculate the median review score by listing [G,listings] = findgroups(reviews(:,'listing_id')); listings.sentiment = splitapply(@median, ... reviews.sentiment,G); % Visualize the results figure histogram(listings.sentiment) title('Sentiment by Boston AirBnB Listing') xlabel('Median Sentiment Score') ylabel('Number of Listings')

The bigram cloud showed reviewers often commented on location and distance. You can use latitude and longitude of the listings to see where listings with very high or low sentiment scores are located. If you see clusters of high scores, perhaps they may indicate good locations to stay in.

% Join sentiment scores and listing info joined = innerjoin( ... listings,l(:,{'id','latitude','longitude', ... 'neighbourhood_cleansed'}), ... 'LeftKeys','listing_id','RightKeys','id'); joined.Properties.VariableNames{end} = 'ngh'; % Discard listings with a NaN sentiment score joined(isnan(joined.sentiment),:) = []; % Discretize the sentiment scores into buckets joined.cat = discretize(joined.sentiment,0:0.25:1, ... 'categorical',{'< 0.25','< 0.50','< 0.75','<=1.00'}); % Remove undefined categories cats = categories(joined.cat); joined(isundefined(joined.cat),:) = []; % Variable for color colorlist = winter(length(cats)); % Generate the plot latlim = [42.300 42.386]; lonlim = [-71.1270 -71.0174]; load boston_map.mat figure imagesc(lonlim,latlim, map) hold on gscatter(joined.longitude,joined.latitude,joined.cat,colorlist,'o') hold off dar = [1, cosd(mean(latlim)), 1]; daspect(dar) set(gca,'ydir','normal'); axis([lonlim,latlim]) title('Sentiment Scores by Boston Airbnb Listing') [g,ngh] = findgroups(joined(:,'ngh')); ngh.Properties.VariableNames{end} = 'name'; ngh.lat = splitapply(@mean,joined.latitude,g); ngh.lon = splitapply(@mean,joined.longitude,g); % Annotations text(ngh.lon(2),ngh.lat(2),ngh.name(2),'Color','w') text(ngh.lon(4),ngh.lat(4),ngh.name(4),'Color','w') text(ngh.lon(6),ngh.lat(6),ngh.name(6),'Color','w') text(ngh.lon(11),ngh.lat(11),ngh.name(11),'Color','w') text(ngh.lon(13),ngh.lat(13),ngh.name(13),'Color','w') text(ngh.lon(17),ngh.lat(17),ngh.name(17),'Color','w') text(ngh.lon(18),ngh.lat(18),ngh.name(18),'Color','w') text(ngh.lon(22),ngh.lat(22),ngh.name(22),'Color','w')

In this post, I focused on word embeddings and sentiment analysis as an example of new features available in Text Analytics Toolbox. Hopefully you saw that the toolbox makes advanced text processing techniques very accessible. You can do more with word embeddings besides sentiment analysis, and the toolbox offers many more features besides word embeddings, such as Latent Semantic Analysis or Latent Dirichlet Allocation.

Hopefully I have more opportunities to discuss those other interesting features in Text Analytics Toolbox in the future.

Get a free trial version to play with it and let us know what you think here!

Get
the MATLAB code

Published with MATLAB® R2017b

Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.... read more >>

]]>Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.

- Part 1: The Citizen CATE Experiment
- Part 2: Training the Volunteers
- Part 3: Rehearsing for the Eclipse
**Part 4: Imaging the Eclipse**

Here is Part 4 of the Series.

In Part 1 of this series, I discussed MathWorks participation in the Citizen CATE Experiment - a citizen science project to image the 2017 solar eclipse. In Part 2 of this series, I described the volunteers, the equipment they will be using, and how they are being trained. In Part 3 of this series, I talked about how the volunteers were rehearsing for the eclipse and the MATLAB software being used to capture the images from the eclipse.

In this final post, I will describe what happened on the day of the eclipse and show some images of the corona during totality.

Weather was always a concern for viewing the eclipse. Fortunately, eclipse day dawned bright and clear. We set up in a public park in the town of Hartsville, Tennessee near the mid-line of the band of totality. The town had planned events for Eclipse day so we were joined in the park by a few hundred townspeople.

We started by setting up the telescope, camera, and software using the procedures we had perfected through all the previous practice sessions. The pre-eclipse steps included recording the GPS data, aligning the telescope, checking the telescope tracking, and adjusting the focus to get the most precise images possible. Here is a picture of my son, William, using the MATLAB software to monitoring live images of the sun before the eclipse began.

Our goal was to be completely finished with the preliminary steps before the partial phase of the eclipse began. Here is an example image taken during those preparations. This image shows a number of big sunspots visible on the surface of the sun.

The partial phase began with *first contact* around 12:00 pm CDT. First contact is when the moon begins to obscure the sun. The partial phase was expected to last about one and a half hours. Here is an image of the partial eclipse about 25 minutes after first contact.

A number of interesting phenomena happened during the partial phase. First, the temperature began to drop. From first contact to totality, the temperature dropped about 7 degrees F. As the eclipse approached totality, the light around us began to dim. About halfway through the partial phase, we observed crescent-shaped images of the sun from light passing through the trees. This occurs because the trees act like thousands of little pin-hole cameras.

Totality began about 1:30 pm CDT and lasted for about 2 minutes and 30 seconds. As the last of the sun's disk disappeared behind the moon, there was an audible gasp from the crowd as the corona appeared as a white halo around the sun. I heard people around us use words like "beautiful", "other-worldly", and even "spiritual" to describe the appearance of the sun during totality. As a first-timer, I didn't know what to expect. Now that I have experienced my first total solar eclipse, I can honestly say that it is the most amazing natural phenomenon that I have ever experienced.

While we were mesmerized by totality, the MATLAB program was acquiring the images just as it was designed to do. Here are two images taken during totality. The first is from my site in Tennessee. The second is from Andrei's site in South Carolina. Notice the apparent difference in the shape of the prominences at 3 and 4 o'clock.

Of the 68 Citizen CATE teams, 58 reported clear weather in their locations at the time of the eclipse. Most of the teams have uploaded their data to the National Solar Observatory site. Dr. Penn and his team have already begun to analyze the data. A first look movie of the combined data from the Citizen CATE sites has been posted to the Citizen CATE website.

With this final post on the 2017 solar eclipse, I'd like to thank Loren for letting me hijack her blog for this series. I'd also like to thank Harrison Roberts and Andrei Ursache my fellow MATLAB Citizen CATE volunteers for their help during the creation of these posts. Finally, I'd like to thank Dr. Matthew Penn and the National Solar Observatory for allowing us to be a part of this exciting project.

Remember, there is another total solar eclipse that will pass through the United States on April 8th, 2024. Can't wait...

If you have an interest in amateur Astronomy or have any questions or comments about the Citizen CATE experiment, please let us know here.

*Copyright 2017 The MathWorks, Inc.*

Get
the MATLAB code

Published with MATLAB® R2017b

Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using Tables.... read more >>

]]>Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using Tables.

Tables are convenient containers for column-oriented data. The variables in a table can have different data types, but must have the same number of rows. You can access table data by row, by variable, or by variable name. And you can specify groups within table variables, to perform calculations on those groups. Today I will show several different ways to apply functions to groups in tables.

First, let's create a table. You can create a table using the `table` function, the Import Tool, or the `readtable` function. I'm going to use `readtable` to read data from a sample file that ships with MATLAB. The file `outages.csv` contains simulated data for electric power outages over a period of 12 years in the United States. Read the table. To display the first five rows, use table indexing.

T = readtable('outages.csv','Format','%C%D%f%f%D%C'); T(1:5,:)

ans = 5×6 table Region OutageTime Loss Customers RestorationTime Cause _________ ________________ ______ __________ ________________ _______________ SouthWest 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 winter storm SouthEast 2003-01-23 00:49 530.14 2.1204e+05 NaT winter storm SouthEast 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 winter storm West 2004-04-06 05:44 434.81 3.4037e+05 2004-04-06 06:10 equipment fault MidWest 2002-03-16 06:18 186.44 2.1275e+05 2002-03-18 23:23 severe storm

There are six columns in the file. Since I want to specify the data types of the table variables, I use the `Format` name-value pair argument. `T` has two categorical variables (specified with `%C`), two datetime variables (`%D`) and two numeric variables (`%f`).

Once data is in a table, you can perform calculations on different table variables. One way is to use dot notation to refer to table variables by name and perform calculations. For example, calculate the total number of customers affected by power outages using the `sum` function, omitting any `NaN` values that might be in `T.Customers`.

```
totalCustomers = sum(T.Customers,'omitnan')
```

totalCustomers = 1.903e+08

You also can apply a function to every table variable, using the `varfun` function. `varfun` is particularly useful for applying a function and returning the results in another table.

However, you might want to apply a function only to variables of a particular type. You can specify the type of variable with the `vartype` function. Here I'm going to subscript into `T` for the numeric variables and calculate the mean of each one. And since the variables have some `NaN` values, I'm going to wrap `mean` in an anonymous function so I can use the `'omitnan'` flag.

T2 = T(:,vartype('numeric')); omean = @(x) mean(x,'omitnan'); meanT = varfun(omean,T2)

meanT = 1×2 table Fun_Loss Fun_Customers ________ _____________ 563.89 1.6693e+05

A more interesting view of the table is to group the data, and start looking for differences between the groups. For example, what is the difference in the mean power loss between the Northeast and Southwest regions?

To answer this question, you can start by specifying one or more table variables as *grouping variables*. The grouping variables define groups into which you split the other table variables. Then you can apply a function to each group within each table variable and the results. For more on grouping variables, see Grouping Variables to Split Data.

You can use grouping variables with the `varfun` function I described above. Specify `Region` as the grouping variable in this table. It's a categorical variable, making it especially convenient to use as a grouping variable. Specify `Loss` and `Customers` as the input variables, and calculate the mean values by region for these two variables. I'm going to use the same `omean` function I specified above.

meanByRegion = varfun(omean,T,'GroupingVariables','Region',... 'InputVariables',{'Loss','Customers'})

meanByRegion = 5×4 table Region GroupCount Fun_Loss Fun_Customers _________ __________ ________ _____________ MidWest 142 1137.7 2.4015e+05 NorthEast 557 551.65 1.4917e+05 SouthEast 389 495.35 1.6776e+05 SouthWest 26 493.88 2.6975e+05 West 354 433.37 1.5201e+05

Another way to apply a grouping variable to the data in a table variable is to use the `accumarray` function. If I convert `T.Region` to a numeric array, and extract `T.Loss` from `T`, then I can calculate mean power loss per region using `accumarray`, just as I did with `varfun`.

regions = double(T.Region); A = accumarray(regions,T.Loss,[],omean)

A = 1137.7 551.65 495.35 493.88 433.37

`accumarray` operates on arrays. So to use `accumarray` on a table variable, you must extract it from the table. Also, the grouping variable must be numeric, and the output argument is an array, not a table. While you can use `accumarray` on table variables, it lacks many of the conveniences that `varfun` provides.

You can specify more than one grouping variable using `varfun`. For example, I'll specify `Region` and `Cause` as grouping variables. The groups are defined as combinations of regions and causes.

meanByRegionAndCause = varfun(omean,T,'GroupingVariables',{'Region','Cause'},... 'InputVariables',{'Loss','Customers'}); meanByRegionAndCause(1:5,:)

ans = 5×5 table Region Cause GroupCount Fun_Loss Fun_Customers _______ ________________ __________ ________ _____________ MidWest attack 12 0 0 MidWest energy emergency 19 536.09 57603 MidWest equipment fault 9 343.37 29704 MidWest severe storm 31 1055.7 4.3584e+05 MidWest thunder storm 32 941.07 1.3301e+05

Another simple calculation is the mean duration of the power outages by region. The basis of this calculation is the difference between the `OutageTime` and `RestorationTime` table variables.

You might think to use `varfun` for this calculation too. The difficulty is that `varfun` applies a function to each table variable separately. But this calculation requires both variables. In other words, we want to apply a function that requires two input arguments.

The solution is to use the `rowfun` function. With `rowfun`, all the table variables are taken as input arguments. If the table has `N` variables, then the function you apply must accept `N` input arguments.

To perform this calculation, I'll start by defining an anonymous function that subtracts one input argument from the other, and then calculates the mean. Then I will calculate the mean duration of power outages by region using `rowfun`.

meanDiff = @(a,b) mean(abs(a-b),'omitnan'); meanOutageTimes = rowfun(meanDiff,T,'GroupingVariables','Region',... 'InputVariables',{'OutageTime','RestorationTime'})

meanOutageTimes = 5×3 table Region GroupCount Var3 _________ __________ _________ MidWest 142 819:14:44 NorthEast 557 581:02:18 SouthEast 389 40:49:49 SouthWest 26 59:31:07 West 354 673:27:12

Note that `OutageTime` and `RestorationTime` are datetime variables. You can perform arithmetic with datetime variables, just as you can with numeric variables.

To rename a variable, you can access the `VariableNames` property of the table and change its name there. Tables contain metadata, such as variable names and descriptions, in a property called `Properties`.

meanOutageTimes.Properties.VariableNames{'Var3'} = 'mean_OutageTime'; meanOutageTimes(1:5,:)

ans = 5×3 table Region GroupCount mean_OutageTime _________ __________ _______________ MidWest 142 819:14:44 NorthEast 557 581:02:18 SouthEast 389 40:49:49 SouthWest 26 59:31:07 West 354 673:27:12

`varfun` and `rowfun` are suitable when you want to perform the same calculation with the table variables. But what if you want to calculate different quantities? For example, you might want to calculate the mean power loss, but also the minimum duration of power outages, by region. In that case, you can perform the calculations and tabulate the results using the `findgroups` and `splitapply` functions.

These functions work a little differently from `varfun` and `rowfun`. The `findgroups` function returns numeric indices that correspond to the grouping variables you specify. As a second output, it also returns a table of the groups. Then you can call `splitapply` to apply a function to each group within the variables.

For example, I'll calculate the mean power loss, the minimum outage time, and the maximum number of customers affected, by region. I will specify the groups once using `findgroups`. Then I can use the groups in multiple calls to `splitapply`.

I'm also going to switch gears a little bit, by removing all rows with `NaN` or missing values. I'll use the `rmmissing` function to remove these rows, so that I don't have to specify the `'omitnan'` flag in every function call.

```
T = rmmissing(T);
[G,results] = findgroups(T(:,'Region'));
meanLoss = splitapply(@mean,T.Loss,G);
minOutageTime = splitapply(@min,T.RestorationTime - T.OutageTime,G);
maxCustomers = splitapply(@max,T.Customers,G)
```

maxCustomers = 3.295e+06 5.9689e+06 2.2249e+06 1.8202e+06 4.26e+06

The output of `splitapply` is an array. `maxCustomers` is a 5-by-1 numeric array. However, you can tabulate the results by adding them as variables to the table `results`, the second output from `findgroups`.

results.meanLoss = meanLoss; results.minOutageTime = minOutageTime; results.maxCustomers = maxCustomers

results = 5×4 table Region meanLoss minOutageTime maxCustomers _________ ________ _____________ ____________ MidWest 907.19 00:00:00 3.295e+06 NorthEast 383.86 00:00:00 5.9689e+06 SouthEast 508.34 00:00:00 2.2249e+06 SouthWest 541.66 00:28:00 1.8202e+06 West 429.73 00:00:00 4.26e+06

So far, I've worked with a table that is small enough to fit into memory. But what if there is so much data that it won't all fit into memory? I might still want to treat it as a table and use the same functions.

That can easily be done by creating a *tall table*. A tall table is a kind of a tall array in MATLAB. Tall arrays and tall tables know how to read the data in one chunk at a time, perform the calculations you want, and then gather up the output at the end. The syntax for working with a tall table is very similar to that for working with a table.

For more information on tall arrays and tall tables, see Tall Arrays.

To show how this works, I'll create a tall table out of `outages.csv`. Of course it's a bit silly to create a tall table out of a file that fits in memory, but this illustrates how to work with a much larger table.

Instead of calling `readtable`, I'll start by calling `datastore` to read `outages.csv` into a datastore.

Then I'll call `tall` on `ds`, to create a tall table out of the datastore. Now I have a tall table that looks and acts very much like the table I worked with in the previous sections. One difference you see is that the tall table is displayed as an M-by-6 table, showing that the number of rows is not yet known.

```
ds = datastore('outages.csv');
T = tall(ds)
```

Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers. T = M×6 tall table Region OutageTime Loss Customers RestorationTime Cause ___________ ________________ ______ __________ ________________ _________________ 'SouthWest' 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 'winter storm' 'SouthEast' 2003-01-23 00:49 530.14 2.1204e+05 NaT 'winter storm' 'SouthEast' 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 'winter storm' 'West' 2004-04-06 05:44 434.81 3.4037e+05 2004-04-06 06:10 'equipment fault' 'MidWest' 2002-03-16 06:18 186.44 2.1275e+05 2002-03-18 23:23 'severe storm' 'West' 2003-06-18 02:49 0 0 2003-06-18 10:54 'attack' 'West' 2004-06-20 14:39 231.29 NaN 2004-06-20 19:16 'equipment fault' 'West' 2002-06-06 19:28 311.86 NaN 2002-06-07 00:51 'equipment fault' : : : : : : : : : : : :

Now I can use `findgroups` and `splitapply` to perform some of the same calculations as before. Again I will wrap `mean` in an anonymous function so I can ignore `NaN` values, as I did above.

```
omean = @(x) mean(x,'omitnan');
T.Region = categorical(T.Region);
T.Cause = categorical(T.Cause);
G = findgroups(T.Region);
meanLoss = splitapply(omean,T.Loss,G);
meanCustomers = splitapply(omean,T.Customers,G)
```

meanCustomers = M×1 tall double column vector ? ? ? : :

Instead of seeing the output, we see a trail of question marks. What does this mean? Simply that I have not performed any calculations yet. At this point, I have only defined the calculations I want to perform on the tall table. No calculations are performed until I *gather* the results, by calling the `gather` function. The output of `gather` is not a tall array and must fit into memory.

meanLoss = gather(meanLoss); meanCustomers = gather(meanCustomers)

Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 4: Completed in 4 sec - Pass 2 of 4: Completed in 2 sec - Pass 3 of 4: Completed in 2 sec - Pass 4 of 4: Completed in 2 sec Evaluation completed in 25 sec Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 2 sec Evaluation completed in 5 sec meanCustomers = 2.4015e+05 1.4917e+05 1.6776e+05 2.6975e+05 1.5201e+05

Now let's tabulate the results in a table. Since `meanLoss` and `meanCustomers` are not tall arrays, `results` is not a tall table. Add the regions as another table variable.

results = table(meanLoss,meanCustomers) region = unique(T.Region); results.region = gather(region)

results = 5×2 table meanLoss meanCustomers ________ _____________ 1137.7 2.4015e+05 551.65 1.4917e+05 495.35 1.6776e+05 493.88 2.6975e+05 433.37 1.5201e+05 Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 1 sec Evaluation completed in 2 sec results = 5×3 table meanLoss meanCustomers region ________ _____________ _________ 1137.7 2.4015e+05 MidWest 551.65 1.4917e+05 NorthEast 495.35 1.6776e+05 SouthEast 493.88 2.6975e+05 SouthWest 433.37 1.5201e+05 West

Here I have shown you several ways you can group data in tables for calculations. To summarize, you can use one or more grouping variables to specify groups within the other table variables. Then you can apply functions to the groups in the table variables.

To apply:

- A function to a single table variable and return an array, use
`accumarray`. - The same function to each table variable and return a table, use
`varfun`. - A function that requires all the table variables as input arguments and return a table, use
`rowfun`. - Different functions to different table variables and build a table of results, use
`findgroups`and`splitapply`.

How about you? Have done analysis with tables where `accumarray`, `varfun`, `rowfun`, `findgroups` and `splitapply` might have helped? Do you have cases where you need tall tables? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2017a

Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.... read more >>

]]>Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.

- Part 1: The Citizen CATE Experiment
- Part 2: Training the Volunteers
**Part 3: Rehearsing for the Eclipse**- Part 4: Imaging the Eclipse

Here is Part 3 of the Series.

In Part 1 of this series, I discussed MathWorks participation in the Citizen CATE Experiment - a citizen science project to image the 2017 solar eclipse. In Part 2 of this series, I described the volunteers, the equipment they will be using, and how they are being trained.

In this post, I will be joined by Andrei Ursache - another of our MathWorks Citizen CATE volunteers. Andrei is an Application Engineer who works with Image Acquisition Toolbox and Image Processing Toolbox. I will start by giving a brief description of what the volunteer teams are doing to prepare for the eclipse. Andrei will then discuss the MATLAB Solar Eclipse App that will be used by the volunteers and what will happen during and after totality.

The volunteer teams have been working hard to prepare for the eclipse. Each team has been asked to practice capturing images for both solar and lunar observation. The purpose of these practice sessions is to become familiar with the equipment, the software used to capture the images, and the observation protocol.

The complete hardware and software setup consists of the following:

- Windows 10 laptop
- 5 megapixel Point Grey (FLIR) USB3 monochrome camera (Grasshopper3 GS3-U3-51S5M with a 12-bit Sony IMX250 sensor and 2448 x 2048 resolution)
- Daystar 80 mm diameter, 500 mm focal length refractor telescope
- Celestron CG4 equatorial mount with motor drives for tracking
- Arduino microcontroller
- GPS module with antenna
- MATLAB Solar Eclipse App

The software is described in the sections below.

Hi, this is Andrei. As a volunteer for the Citizen CATE Experiment, I wrote the MATLAB Solar Eclipse App which will be used by the volunteer teams to capture the August 2017 total solar eclipse at each observation site on the totality path.

The software is a MATLAB app and uses functionality from Image Acquisition Toolbox, Image Processing Toolbox, and Parallel Computing Toolbox. A standalone executable application, built with MATLAB Compiler, has been installed on each of the laptop computers used by the volunteer teams. The app user interface is workflow-oriented based on the solar observation protocol put together by the Citizen CATE scientists. The graphical user interface consists of multiple tabs, each of them focused on a specific task, such as Alignment, Focus, Calibration, and Totality. As an example, here is a screenshot of the Focus tab, which is used to fine-tune the telescope focus:

A live view of the video stream will be captured by the telescope camera which is connected to the laptop via a USB3 cable. The live view will allow the volunteers to zoom in on a region of interest in order to fine tune the telescope focus. Image Acquisition Toolbox `videoinput` functionality will control the camera, transfer the acquired images into the MATLAB workspace, and provide a preview. A live image histogram, calculated with the `imhist` function in Image Processing Toolbox, will provide visual feedback for optimizing the exposure time. In the histogram, the x-axis shows the pixel intensity values (0-65535 for 16-bit scaled data) and the y-axis shows the number of pixels at each pixel value. The image histogram shows if there are overexposed pixels and will allow the volunteers to choose an appropriate exposure time. A focus quality indicator and a line profile will provide visual feedback for fine-tuning the telescope focus.

During the 2-3 minutes of totality, the camera will capture a stream of images of the Sun's corona. Because the corona's brightness can vary significantly as you move away from the Sun's surface, high dynamic range (HDR) images are required to cover its full brightness range. The camera's exposure time will be controlled by a varying pulse-width TTL signal, output by the Arduino as shown in the picture below. Those exposure times are 0.4, 1.3, 4, 13, 40, 130, 400, and 1300 milliseconds. The multi-exposure sequence will be continuously repeated during totality. Each sequence will then be combined into a single HDR image. For 2.5 minutes of totality, each volunteer site will create about 75 HDR images.

This might be a complicated programming exercise in other languages but in MATLAB it can be achieved with a few lines of code:

v = videoinput('pointgrey', 1, 'F7_Mono16_2448x2048_Mode7'); triggerconfig(v, 'hardware', 'risingEdge', 'externalTriggerMode1-Source0'); start(v) frames = getdata(v); montage(frames(:,:,:,1:8), 'Size', [2 4])

Here is a series of images from a recent lunar observation practice session.

Acquired image frames are saved to disk as TIF format files using the `imwrite` function:

for ii = 1:size(frames,4) filename = sprintf('frame_%d.tif', ii); imwrite(frames(:,:,:,ii), filename, 'tiff'); end

In the app, we do not want the image saving operation to delay the execution of other code. Saving to disk is done in a parallel worker using the `parfeval` function from Parallel Computing Toolbox. For an example of how to simultaneously acquire images and save them to disk, see this MATLAB Answers post.

While acquiring the images, the software also logs the frame timestamps. In order to synchronize the timestamps from different observation sites, the software gets GPS time information from the GPS module. The GPS module is transmitting NMEA strings (lines of ASCII text) to the computer via a virtual COM port, as in the example below. The NMEA sentences contain information such as the GPS date and time, latitude, and longitude.

$GNZDA,180755.000,20,07,2017,,*47 $GPGGA,180755.000,4217.9848,N,07121.0476,W,1,18,0.6,83.6,M,-33.8,M,,0000*5A $GNRMC,180755.000,A,4217.9848,N,07121.0476,W,0.00,0.03,200717,,,A*61

We make use of the MATLAB `serial` function to communicate with and transfer data from the GPS module. For an example of how to log GPS NMEA strings to a text file, see this MATLAB Answers post .

After the eclipse, I will post the app to the MATLAB File Exchange for anyone who wants to see how it all works.

After totality ends, the multiple exposure images taken during the eclipse will be processed to create a set of high dynamic range (HDR) images as described above. To create a high-dynamic range image from a multi-exposure image sequence, we use the `makehdr` function in Image Processing Toolbox. The `tonemap` function is used to compress the HDR image for viewing on a computer screen, which has a much lower dynamic range.

filenames = cellstr("frame_" + (1:8) + ".tif"); exposures = [0.4 1.3 4 13 40 130 400 1300]; hdr = makehdr(filenames,'RelativeExposure',exposures/exposures(3),... 'MinimumLimit',1000,'MaximumLimit',54000); hdr8 = tonemap(hdr); imshow(hdr8);

Here is an HDR image which was obtained from combining a cycle of 8 multiple exposure images from a lunar practice session and another one created during the March 2016 eclipse in Indonesia.

After the eclipse, all the data taken by the volunteer teams will be uploaded to a server at the National Solar Observatory. The HDR images from all volunteer sites will then be put together to create a 90 minute video of totality.

If you have an interest in amateur Astronomy or have any questions or comments about the Citizen CATE experiment, please let us know here.

That's all for now. In the last post in this series, I'll tell you about the day of the eclipse. I'll describe what happened during totality and show you some images of the corona taken on that day.

One final note. The project sponsors and volunteers have prepared very carefully for the big event. Working with the volunteer teams, they've tried to account for all contingencies. However, there is one variable that we cannot control - weather. We just have to hope that mother nature cooperates on that day.

Get
the MATLAB code

Published with MATLAB® R2017a

A lot of information is shared on the web and a lot of people are interested in taking advantage of it. It can be used to enrich the existing data, for example. However, information is buries in HTML tags and it is not easy to extract useful information. Today's guest blogger, Toshi Takeuchi shows us how he uses MATLAB for web scraping to harvest useful data from the web and then uses fuzzy string match to enrich existing data.... read more >>

]]>A lot of information is shared on the web and a lot of people are interested in taking advantage of it. It can be used to enrich the existing data, for example. However, information is buries in HTML tags and it is not easy to extract useful information. Today's guest blogger, Toshi Takeuchi shows us how he uses MATLAB for web scraping to harvest useful data from the web and then uses fuzzy string match to enrich existing data.

- NIPS 2015 Papers
- Scraping Data from a Web Page
- Matching Scraped Data to Database Table
- Fuzzy String Matching
- Validating Fuzzy Match Results
- Updating Missing Values with Fuzzy Match Result
- Reviewing Unmatched
- Substring Match
- Updating Missing Values with Substring Match Result
- Visualizing Paper Author Affiliation
- Summary

Web scraping is actually pretty easy with MATLAB thanks to new string fucntions introdiced in R2016B.

I am going to use as an example the same data used in Text Mining Machine Learning Research Papers with MATLAB.

If you would like to follow along, please download

- the source of this post by clicking on "Get the MATLAB code" at the bottom of this page
- the data from Kaggle's NIPS 2015 Papers page
- my custom function levenshtein.m
- my custom script to generate GIF animation animateNIPS2015.m

Here I am using `sqlite` in Database Toolbox to load data from a sqlite file. If you don't have Database Toolbox, you can try `readtable` to read CSV files. `Authors` table only list names, but I want to enrich it with authors' affilation to see which organizations are active in this academic conference.

db = 'output/database.sqlite'; % database file conn = sqlite(db,'readonly'); % create connection Authors = fetch(conn,'SELECT * FROM Authors'); % get data with SQL command Papers = fetch(conn,'SELECT * FROM Papers'); % get data with SQL command PaperAuthors = fetch(conn,'SELECT * FROM PaperAuthors'); % get data with SQL command close(conn) % close connection Authors = cell2table(Authors,'VariableNames',{'ID','Name'}); % convert to table Papers = cell2table(Papers,'VariableNames', ... % convert to table {'ID','Title','EventType','PdfName','Abstract','PaperText'}); PaperAuthors = cell2table(PaperAuthors,'VariableNames', ... % convert to table {'ID','PaperID','AuthorID'}); head(Authors)

ans = 8×2 table ID Name ___ ______________________ 178 'Yoshua Bengio' 200 'Yann LeCun' 205 'Avrim Blum' 347 'Jonathan D. Cohen' 350 'Samy Bengio' 521 'Alon Orlitsky' 549 'Wulfram Gerstner' 575 'Robert C. Williamson'

Luckily, there is an HTML file that lists each paper with its authors and their affiliation. Each list item start with `<i><span class="larger-font">` and ends with `</b><br><br>`, titles and authors are separated by `</span></i><br><b>`, mutiple co-authors by semicolon, and finally the names and affiliation by comma.

dbtype output/accepted_papers.html 397:400

397 <div><div><h3>NIPS 2015 Accepted Papers</h3><p><br></p> 398 <i><span class="larger-font">Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing</span></i><br><b> 399 Nihar Shah*, UC Berkeley; Dengyong Zhou, MSR</b><br><br><i><span class="larger-font">Learning with Symmetric Label Noise: The Importance of Being Unhinged</span></i><br><b> 400 Brendan van Rooyen, NICTA; Aditya Menon*, NICTA; Robert Williamson, NICTA</b><br><br><i><span class="larger-font">Algorithmic Stability and Uniform Generalization</span></i><br><b>

Since I have the html file locally, I use fileread to load text from it. If you want to scape a web page directly, you would use webread instead. Imported text is converted into string to take advantage of built-in string functions,

html = string(fileread('output/accepted_papers.html')); % load text from file % html = string(webread('https://nips.cc/Conferences/2015/AcceptedPapers'));

Usually scraping data from a web page or other unstructured text data sources requires regular expressions and many people find it powerful but very difficult to use. String functions in MATLAB like `extractBetween`, `extractBefore`, `extractAfter`, `erase`, and `replace` makes it ridiculously simple!

pattern1 = '<div><h3>NIPS 2015 Accepted Papers</h3><p><br></p>';% start of list pattern2 = '</div> <!--div class="col-xs-12 col-sm-9"-->'; % end of list list = extractBetween(html, pattern1, pattern2); % extract list pattern1 = '<i><span class="larger-font">'; % start of list item pattern2 = '</b><br><br>'; % end of list item listitems = extractBetween(list, pattern1, pattern2); % extract list items pattern1 = ['</span></i><br><b>' newline]; % end of title titles = extractBefore(listitems,pattern1); % extract titles namesorgs = extractAfter(listitems,pattern1); % extract names orgs namesorgs = erase(namesorgs,'*'); % erase * namesorgs = erase(namesorgs,'"'); % erase " namesorgs = replace(namesorgs,' ', ' '); % remove double space disp([titles(1:2) namesorgs(1:2)])

"Double or Nothing: Multiplicati…" "Nihar Shah, UC Berkeley; Dengyo…" "Learning with Symmetric Label N…" "Brendan van Rooyen, NICTA; Adit…"

Since multiple co-authors are still contained in a single string, let's separate them into a list of co-authors and their affilation. When you split a string, you get varying number of substrings depending on each row. So we need to use arrayfun to parse|UniformOutput| option set to `false` to split it row by row, trim the result with `strtrim` the cell aray with `vertcat` to get the list of coauthors with their affiliation.

namesorgs = replace(namesorgs,'&','&'); % revert escaped & namesorgs = erase(namesorgs,[char(194) ' </b><span><strong>']); % remove extra tags namesorgs = replace(namesorgs,['</strong></span><b>' ... % replace missing semicolon char(194)],';'); coauth = arrayfun(@(x) strtrim(split(x,';')), namesorgs, ... % split by semicolon 'UniformOutput', false); % and trim white space coauth = vertcat(coauth{:}); % unnest cell array coauth(1:5)

ans = 5×1 string array "Nihar Shah, UC Berkeley" "Dengyong Zhou, MSR" "Brendan van Rooyen, NICTA" "Aditya Menon, NICTA" "Robert Williamson, NICTA"

You now see how easy web scraping is with MATLAB.

Now that we have the list of names with affiliation, we just have to match it to the `Authors` table by name, right? Unfortunately, we see a lot of missing values because the names didn't match even with `contains` partial match function.

The real hard part now is what to do after you scraped the data from the web.

authors = Authors.Name; % author names names = strtrim(extractBefore(coauth,',')); % extract and trim names org = strings(length(authors),1); % initialize accumulator for ii = 1:length(authors) % for each name in |authors| res = coauth(contains(names,authors(ii))); % find match in |names| if isempty(res) % if no match org(ii) = missing; % mark it missing end res = extractAfter(res,','); % extract after comma res = strtrim(res); % remove white space res = unique(res); % remove duplicates res(strlength(res) == 0) = []; % remove emptry string if length(res) == 1 % if single string org(ii) = res; % use it as is elseif length(res) > 1 % if multiple string org(ii) = join(res,';'); % join them with semicolon else % otherwise org(ii) = missing; % mark it missing end end head(table(authors, org, 'VariableNames',{'Name','Org'}))

ans = 8×2 table Name Org ______________________ _____________________________________ 'Yoshua Bengio' "U. Montreal" 'Yann LeCun' "New York University" 'Avrim Blum' <missing> 'Jonathan D. Cohen' <missing> 'Samy Bengio' "Google Research" 'Alon Orlitsky' "University of California, San Diego" 'Wulfram Gerstner' "EPFL" 'Robert C. Williamson' <missing>

For example, the partial match doesn't work if middle initial is missing or nicknames are used instead of full names. There can be other irregularities. Yikes, we are dealing with unstructured data!

[authors(4) coauth(contains(names,'Jonathan Cohen')); authors(8) coauth(contains(names,'Robert Williamson')); authors(406) coauth(contains(names,'Sanmi Koyejo')); authors(440) coauth(contains(names,'Danilo Rezende')); authors(743) coauth(contains(names,'Bill Dally')); authors(769) coauth(contains(names,'Julian Yarkony'))]

ans = 6×2 string array "Jonathan D. Cohen" "Jonathan Cohen, Princeton Unive…" "Robert C. Williamson" "Robert Williamson, NICTA" "Oluwasanmi O. Koyejo" "Sanmi Koyejo, Stanford University" "Danilo Jimenez Rezende" "Danilo Rezende, Google DeepMind" "William Dally" "Bill Dally , Stanford Universit…" "Julian E. Yarkony" "Julian Yarkony, Dr."

What can we do when exact match approach doesn't work? Maybe we can come up with various rules to match strings using regular expressions, but that is very time consuming. Let's revisit the 40-year-old Algorithm That Cannot Be Improved to solve this problem. I created a new custom function `levenshtein` for this example. It calculates the edit distance that measures the minimum number of edit operations required to transform one string into another,as a way to quantify how similar or dissimilar they are. For more details of this algorithm, please check out the blog post linked above

Converting Sunday to Saturday requires 3 edit operations.

levenshtein('sunday', 'saturday')

ans = 3

Perhaps it is easier to understand if I show how similar they are as match rate rather than number of edit operations?

levenshtein('sunday', 'saturday', 'ratio')

ans = 0.78571

Now we can find "Jonathan Cohen" in the top 3 matches for "Jonathan D. Cohen".

fhandle = @(s) levenshtein(authors(4), s, 'ratio'); % function handle ratios = arrayfun(fhandle, extractBefore(coauth,',')); % get match rates [~,idx] = sort(ratios,'descend'); % rank by match rate [repmat(authors(4),[3,1]) coauth(idx(1:3)) ratios(idx(1:3))]

ans = 3×3 string array "Jonathan D. Cohen" "Jonathan Cohen, Pr…" "0.90323" "Jonathan D. Cohen" "Jonathan Bassen, s…" "0.8125" "Jonathan D. Cohen" "Jonathan Vacher, U…" "0.8125"

Let;s try this approach for the first 10 missing names with `ignoreCase` option enabled. It looks like we can be fairly confident about the result as long as the maximum match rate is 0.8 or higher.

org(org == 'Dr.') = missing; % remove salutation missing_ids = find(ismissing(org)); % get missing ids matches = strings(1,3); % initialize accumulator for ii = 1:10 % for each missing id cid = missing_ids(ii); % current id fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ingoreCase'); ratios = arrayfun(fhandle, names); % get match rates [~,idx] = max(ratios); % get index of max value matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)]; % update accumulator end matches

matches = 10×3 string array "Avrim Blum" "Manuel Blum, Unive…" "0.71429" "Jonathan D. Cohen" "Jonathan Cohen, Pr…" "0.90323" "Robert C. Williamson" "Robert Williamson,…" "0.91892" "Juergen Schmidhuber" "J?rgen Schmidhuber," "0.94595" "Wolfgang Maass" "Wolfgang Maass," "1" "Robert E. Schapire" "Robert Schapire, M…" "0.90909" "TomÃ¡s Lozano-PÃ©rez" "TomÃ¡s Lozano-PÃ©r…" "1" "Dit-Yan Yeung" "Dit Yan Yeung, HKUST" "0.96154" "Geoffrey J. Gordon" "Geoff Gordon, CMU" "0.8" "Brendan J. Frey" "Brendan Frey, U. T…" "0.88889"

Now we can apply this approach to update the missing values. Now 89.3% of missing values are identified!

for ii = 1:length(missing_ids) % for each missing id cid = missing_ids(ii); % current id fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ignoreCase'); ratios = arrayfun(fhandle, names); % get match rates [~,idx] = max(ratios); % get index of max value if ratios(idx) >= 0.8 % if max is 0.8 res = extractAfter(coauth(idx),','); % get org name res = strtrim(res); % trim white spaces if strlength(res) == 0 % if null string org(cid) = 'UNK'; % unknown else % otherwise org(cid) = res; % update org end end end fprintf('Missing reduced by %.1f%%\n', (1 - sum(ismissing(org))/length(missing_ids))*100)

Missing reduced by 89.3%

When you review what was not matched, you can see that they were no good matches except for "John W. Fisher III", "Oluwasanmi O. Koyejo", or "Danielo Jimenz Rezende". They have much lower match rate and didn't make the cut, but we don't necessarily ease the cut-off. Is there an alternative?

missing_ids = find(ismissing(org)); % get missing ids matches = strings(1,3); % initialize accumulator for ii = 1:10 % for each missing id thru 10th cid = missing_ids(ii); % current id fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ignoreCase'); ratios = arrayfun(fhandle, names); % get match rates [~,idx] = max(ratios); % get index of max value matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)]; % update accumulator end matches

matches = 10×3 string array "Avrim Blum" "Manuel Blum, Unive…" "0.7619" "John W. Fisher III" "John Fisher, MIT" "0.75862" "Ã–zgÃ¼r ÅžimÅŸek" "Ozgur Simsek, Max …" "0.71429" "Joseph J. Lim" "Joseph Salmon, Tel…" "0.76923" "Pascal Fua" "Pascal Vincent, U.…" "0.70833" "FranÃ§ois Fleuret" "Matthias Feurer, U…" "0.71875" "Oluwasanmi O. Koyejo" "Sanmi Koyejo, Stan…" "0.75" "Samet Oymak" "James Kwok, Hong K…" "0.71429" "Danilo Jimenez Rez…" "Danilo Rezende, Go…" "0.77778" "Kafui Dzirasa" "Rahul Krishnan, Ne…" "0.66667"

Instead of trying to match the whole string, we can reorder and find the most similar substring with the longer string based on the shorter string using `ignore order` and `partial` options:

`ignoreOrder`breaks strings into tokens, reorder tokens by finding union and intersection of tokens, and compute edit distance.`partial`finds substring within longer string that is closest to the shorter string.

This clearly gives higher match rates for in some specific cases. "John W. Fisher III" can be reordered to "John Fisher III W." and the first two tokens (a substring) matches "John Fisher", which makes it a 100% match.

matches = strings(1,3); % initialize accumulator for ii = 1:10 % for each missing id thru 10th cid = missing_ids(ii); % current id fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ingoreCase', 'ignoreOrder', 'partial'); ratios = arrayfun(fhandle, names); % get match rates [~,idx] = max(ratios); % get index of max value matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)]; % update accumulator end matches

matches = 10×3 string array "Avrim Blum" "Manuel Blum, Unive…" "0.75" "John W. Fisher III" "John Fisher, MIT" "1" "Ã–zgÃ¼r ÅžimÅŸek" "Ozgur Simsek, Max …" "0.66667" "Joseph J. Lim" "Joseph Wang," "0.81818" "Pascal Fua" "Pascal Vincent, U.…" "0.85" "FranÃ§ois Fleuret" "Tian Tian, Tsinghu…" "0.75" "Oluwasanmi O. Koyejo" "Sanmi Koyejo, Stan…" "0.79167" "Samet Oymak" "James Hensman, The…" "0.77273" "Danilo Jimenez Rez…" "Danilo Rezende, Go…" "1" "Kafui Dzirasa" "Kai Fan, Duke Univ…" "0.71429"

Substring match is a tricky choice - it can produce more false positives.So let's use match rate of 0.9 or above for cut-off. With this approach, the missing values were further reduced by 35.5%.

for ii = 1:length(missing_ids) % for each missing id cid = missing_ids(ii); % current id fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ingoreCase', 'ignoreOrder', 'partial'); ratios = arrayfun(fhandle, names); % get match rates [~,idx] = max(ratios); % get index of max value if ratios(idx) >= 0.9 % if max is 0.9 res = extractAfter(coauth(idx),','); % get org name res = strtrim(res); % trim white spaces if strlength(res) == 0 % if null string org(cid) = 'UNK'; % unknown else % otherwise org(cid) = res; % update org end end end fprintf('Missing reduced by %.1f%%\n', (1 - sum(ismissing(org))/length(missing_ids))*100)

Missing reduced by 35.5%

Now we can use the matched data to visualize the graph formed by Paper Author Affiliation. `PaperAuthor` table can be seen as the edge list where paper and authors represents nodes of a graph and each row represents an edge. We just need to replace authors with their affiliation so that each node represents a paper or the affiliation of its authors, and edges represents the authorship of the paper.

- A paper has multiple incoming edges if co-authored by scholars from multiple organizations,
- and those papers are also connected to other node if their co-authors also contribute to other papers.
- Papers authored within the same organization will be isolated, and it is removed from the graph for simplifying visualzation.
- The resulting graph represents the largest connected component of the overall graph.
- Node size and color reflects outdegrees - how many papers is contributed by the members of a given organization

This visualization shows that we have more work to do - for example, "University of Texas Austin" and "UT Austin" are the same organization but we have two separate nodes because organizatoin names are not standardized, and that means some organizations are probably under-reprsented. Can you improve this visualzation?

Af = table(Authors.ID,org, 'VariableNames',{'AuthorID','Org'}); % affiliation Af(Af.Org == 'UNK' | ismissing(Af.Org),:) = []; % remove UNK & missing T = innerjoin(PaperAuthors(:,2:3), Af); % join with PaperAuthors T = innerjoin(T,Papers(:,[1,2]), 'Keys', 1); % join with Papers [T,~,idx] = unique(T(:,[3,4]),'rows'); % remove duplicate rows w = accumarray(idx,1); % count duplicates s = cellstr(T.Org); % convert to cellstr t = cellstr(T.Title); % convert to cellstr G = digraph(s,t, w); % create directed graph bins = conncomp(G,'type', 'weak', 'OutputForm', 'cell'); % get connected comps [~, idx] = max(cellfun(@length, bins)); % find largest comp G = subgraph(G, bins{idx}); % subgraph largest comp figure % new figure colormap cool % use cool colormap msize = 10*(outdegree(G) + 3)./max(outdegree(G)); % marker size ncol = outdegree(G) + 3; % node colors named = outdegree(G) > 7; % nodes to label h = plot(G, 'MarkerSize', msize, 'NodeCData', ncol); % plot graph layout(h,'force3','Iterations',30) % change layout labelnode(h, find(named), G.Nodes.Name(named)); % add node labels title('NIPS 2015 Papers - Author Affilation Graph') % add title axis tight off % set axis set(gca,'clipping','off') % turn off clipping zoom(1.3) % zoom in

Web scraping can be a very useful skill to have to collect information from the web, and MATLAB makes it very easy to extract information from a web page. The resulting data is often unstructured, but you can deal with it using techniques like fuzzy string matching.

I hope this example gives you a lot of new ideas. Give it a try and let us know how your experience goes here!

Get
the MATLAB code

Published with MATLAB® R2017a

Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.... read more >>

]]>Today our guest blogger, David Garrison, will continue his series on MathWorks involvement in the 2017 solar eclipse.

- Part 1: The Citizen CATE Experiment
**Part 2: Training the Volunteers**- Part 3: Rehearsing for the Eclipse
- Part 4: Imaging the Eclipse Here is Part 1 of the Series.

Here is Part 2 of the Series.

In Part 1 of this series I discussed the total solar eclipse that will occur on August 21, 2017 where the path of totality will pass over the continental United States. The National Solar Observatory is running a citizen science project called the Citizen CATE Experiment. The project will have volunteer teams spread out along the path of totality to capture images of the sun's corona during the eclipse. This will give solar astronomers an unprecedented amount of data that they can use to study coronal dynamics to try to understand how temperatures in the corona can be orders of magnitude higher than on the sun's surface. MathWorks is one of the sponsors of this project and several MathWorks staff are among the volunteers.

In this post, I'm going to talk about the project volunteers, the equipment they will be using on the day of the eclipse, and how the volunteer teams are being trained.

As I mentioned, the Citizen CATE Experiment will use a fleet of telescopes spread across the United States to observe and record data from the eclipse. Those telescopes will be operated by a group of volunteers at each site. Each state on the eclipse path has a coordinator who is organizing their state's volunteers to make sure they have the equipment and training they need for the eclipse. The states on the eclipse path are Oregon, Idaho, Wyoming, Nebraska, Kansas, Missouri, Kentucky, Tennessee, North Carolina, and South Carolina.

The project's original goal was to recruit 60 volunteer teams. Those would be funded by money from corporate sponsors and from grants from NASA and the National Science Foundation. Currently, the number of teams is up to 68. Those extra 8 teams are all self-funded.

This is a citizen science project so most of the volunteers are amateurs. There will be a few professional astronomers -- professors and PhD students -- among the volunteers. The rest will be university groups, high schools, astronomy clubs and individuals. For example, Wyoming, which has the largest number of sites at 11, has 7 sites operated by teams from local high schools. Wyoming also has a site that will be operated by an astronomy club coming from New York.

One of the great things about this project is the participation of young people who are interested in science and technology. Here are a couple of pictures of some of the younger volunteers. On the left is Arianna Roberts, daughter of Harrison Roberts one of our MathWorks volunteers. Harrison and his family will be observing from a site in South Carolina. On the right is me with my two sons, Thomas and William. We will be observing from a site in Tennessee. Our third MathWorks staff member who will be volunteering is Andrei Ursache. Andrei and his two daughters, Elena and Corina, will be observing from a site in South Carolina.

Each Citizen CATE team will use identical equipment at each observing site. That will ensure that the data from all the sites can be combined to get images for 90 minutes of totality as the eclipse makes its way across the continental United States.

**The Telescopes:** The telescopes are 80 mm refractors manufactured and donated by DayStar Filters . What this means is that the telescope has an aperture of 80 millimeters (about 3.1 inches). The *aperture* is a measure of a telescope's light gathering capacity. Usually the higher the aperture the better -- more light gathering capability for viewing dim objects. For solar astronomy, an 80 mm aperture telescope is plenty big enough. A refracting telescope is one that uses a lens to focus the light. The other major type is a reflecting telescope which uses a mirror to focus the light. Each telescope will be equipped with a white-light, full aperture filter. The filter can only be removed when the sun is fully eclipsed.

**The Mount and Drive Systems:** The telescopes sit on a German equatorial mount manufactured and donated by Celestron. The advantage of this type of mount is that it can be set up to follow an astronomical object as it moves across the sky (due to the earth's rotation). The mount has two motors which slowly move the telescope to keep the target in view over time.

**The Cameras:** Cameras will be attached to each telescope to capture images of the eclipse during the totality phase. The cameras are Point Grey Grasshopper3 5.0 MP Mono USB3 Vision cameras. These cameras are monochrome with a resolution of 2448 by 2028 pixels (about 5 Megapixels). They use a Sony IMX250 CMOS sensor. They have a maximum frame rate of about 75 frames per second and an exposure range of 0.006 microseconds to 32 seconds. Camera control and image capture will be done using MathWorks Image Acquisition Toolbox.

**The Software:** Each volunteer will also have a laptop that will be running an application written with MATLAB and Image Acquisition Toolbox. During the period of totality, the camera will cycle through a series of different exposure times. Each cycle will take approximately 2 seconds. After the eclipse is over, Image Processing Toolbox will be used to create a high-dynamic range (HDR) image from each multiple exposure cycle. A site with 2.5 minutes of totality will get about 75 HDR images.

Here is a picture of the telescope assembled with the camera attached.

During the months of April and May, the state coordinators organized training for all of their state's volunteer teams. We attended our training at Tennessee Technological University with about ten other volunteer teams.

The first thing we had to learn was how to assemble the tripod, mount, and telescope. That included installing the motors that allow the telescope to track the sun. Then we learned how to attach the camera and connect it to the PC to capture the images. As soon as we finished, we were told to disassemble everything and do it all again. The trainers wanted to make sure we really learned how everything works.

After lunch, we took all the equipment outside to learn to track the sun and use the cameras. Surprisingly, the hardest thing to do was to "capture the sun". That means aiming the telescope at the sun. As soon as you put a white light filter on the telescope, everything goes dark -- the sun is the only thing bright enough to penetrate the filter. So you have to manually move the telescope around until the sun comes into view. Once that's done, the motors on the telescope mount will track the sun.

The last step in the training was to disassemble the everything and pack it up for shipping. We shipped our equipment back to Massachusetts so we could practice for the eclipse. We'll ship it back to Tennessee a few days before the eclipse.

If you have an interest in amateur Astronomy or have any questions or comments about the Citizen CATE experiment, please here.

That's all for now. In my next post I'll describe how the teams are rehearsing for the eclipse. I'll talk in some detail about how MathWorks software is being used to capture and post-process images from the eclipse. Finally, I'll show some images from trial runs performed by some of the teams.

Get
the MATLAB code

Published with MATLAB® R2017a

I recently started following a new blog by Martin Trauth, called MATLAB Recipes for Earth Sciences. Even if you are not an earth scientist (which, by the way, is part of my scientific lineage), you may find many useful nuggets there.... read more >>

]]>I recently started following a new blog by Martin Trauth, called MATLAB Recipes for Earth Sciences. Even if you are not an earth scientist (which, by the way, is part of my scientific lineage), you may find many useful nuggets there.

Martin is a geoscience professor at University of Potsdam. He's written well-regarded textbooks, one with the same title as his blog (which he currently posts to at a very healthy pace, one I cannot sustain!).

Martin's post cover a huge range of topics, many of which are not at all specific to, though helpful to, geoscientists. Some notable titles I read recently include:

- Missing Version History of MATLAB Functions
- Counting Flamingos with MATLAB
- Better Avoid Running Means
- Detecting Change Points in Time Series with MATLAB
- Calculating 3D Point Clouds From Stereo Images Using MATLAB
- Classical Linear Regression of Log-Transformed Data

In addition to posts on computation, Martin's blog also addresses working with Lego Mindstorms hardware with MATLAB. And more.

I think Martin's site is one you might want to start following. Do you know of other sites in a similar vein? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2017a