Of course the data we collect is always perfect - NOT! Maybe yours is different. What can go wrong? So many things. Instruments drift, web sites go down, power goes out, ... So what can you do if you have gaps in your data, and the analysis you want to perform won't tolerate that condition? You might decide you need to fill in missing values.... read more >>

]]>Of course the data we collect is always perfect - NOT! Maybe yours is different. What can go wrong? So many things. Instruments drift, web sites go down, power goes out, ... So what can you do if you have gaps in your data, and the analysis you want to perform won't tolerate that condition? You might decide you need to fill in missing values.

We've been working on supplying functionality that makes dealing with missing data easier for a long time, starting with the introduction of `NaN` values right in the beginning. In a floating point array, NaNs act as placeholders. That's great, but what can you do from there?

Some functions, or variants of them, work differently your array contains any `NaN` values, e.g., `mean`.

We first helped you figure out if you have missing values. And later added the ability to fill and remove missing values. More recently, we added the ability to mark missing values, even if you don't know the datatype of the array. This makes it easier to supply `NaN`, `NaT` (not a time) values, and similarly for `categorical` and `string` arrays, without needing to know which one is appropriate - as may happen with different columns in a `table`.

Do you use the functionality to deal with missing values? If so, tell us how. If not, please tell us what is missing!

You can let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

Today's guest blogger is Josh Meyer, a Technical Writer for the MATLAB Math and Big Data teams. He is going to discuss a common issue encountered in scattered data interpolation, and how to fix it!... read more >>

]]>Today's guest blogger is Josh Meyer, a Technical Writer for the MATLAB Math and Big Data teams. He is going to discuss a common issue encountered in scattered data interpolation, and how to fix it!

A common issue that can arise while performing interpolation on scattered data is that the resulting functional surface seems to be of lower quality than you would expect. For example, when the surface does not seem to pass through all of the sample data points. This post explores how and why the scaling of scattered data can affect the interpolation results.

Let's jump right into an example so you can see what I'm talking about. Imagine you just collected data from sensors to sample some values at several points. Ultimately you want to fit a surface to the data so that you can approximate the value of the underlying function at points where you don't have data. So, you start by plotting.

rng default x = rand(500,1)/100; y = 2.*(rand(500,1)-0.5).*90; vals = (x.*100).^2; ptColor = [.6 .07 .07]; plot3(x,y,vals,'.','Color',ptColor) grid xlabel('x') ylabel('y') zlabel('v(x,y)') title('Scattered Data Points')

Next, you use `scatteredInterpolant` to create an interpolant for the data. This computes an interpolating function for the observed points, allowing you to query the function anywhere within its convex hull. You create a grid of query points, evaluate the interpolant at those points, and plot the functional surface. All done!

F = scatteredInterpolant(x,y,vals); X = linspace(min(x),max(x),25); Y = linspace(min(y),max(y),25); [xq, yq] = meshgrid(X,Y); zq = F(xq,yq); hold on triColor = [0.68 0.88 0.95]; surf(xq,yq,zq,'FaceColor',triColor) hold off xlabel('x') ylabel('y') zlabel('v(x,y)') title('Scattered Data with Interpolated Surface')

... but wait, the result is definitely not what you expected! What are all of those "folds" in the surface? And why does it look like the surface doesn't pass through all the points? Interpolated surfaces should pass through *all* of the data points!

Let's take a quick detour to talk about scattered data interpolation before coming back to this problematic data.

Unlike gridded interpolation where the point locations are well-defined, scattered data presents different challenges. To find the interpolated value at a given query point, you need to use the values of nearby points. But how can you do that when the data is scattered around? The location of a point can't be used to predict the location of another point. And comparing the locations of all the points ad nauseum to determine which are close to a given query point is not a very efficient approach.

```
X = [-1.5 3.2; 1.8 3.3; -3.7 1.5;
-1.5 1.3; 0.8 1.2; 3.3 1.5;
-4.0 -1.0; -2.3 -0.7; 0 -0.5;
2.0 -1.5; 3.7 -0.8; -3.5 -2.9;
-0.9 -3.9; 2.0 -3.5; 3.5 -2.25];
V = X(:,1).^2 + X(:,2).^2;
plot(X(:,1),X(:,2),'r*')
```

To solve this, MATLAB first computes a Delaunay triangulation of the scattered data. This creates a series of triangles out of the data points, such that the circumscribed circles created by the vertices of each triangle do not enclose any points. The computed Delaunay triangulation is unique, up to trivial symmetries. And the best part is, the triangulation can be easily queried to determine what points are closest to a given query point.

This plot shows the Delaunay triangulation and the circumscribed circles for the scattered data. Notice that the red data points lie on the boundaries of one or more circumscribed circles, but none of them lie in the interior of one of the circles.

tr = delaunayTriangulation(X(:,1),X(:,2)); [C,r] = circumcenter(tr); a = C(:,1); b = C(:,2); pos = [a-r, b-r, 2*r, 2*r]; hold on triplot(tr) for k = 1:length(r) rectangle('Position',pos(k,:),'Curvature',[1 1],'EdgeColor',triColor) end xlabel('x') ylabel('y') title('Delaunay Triangulation of Scattered Data') hold off

Armed with the triangulation of the data, finding the value of the interpolated surface at a given query point becomes a problem of querying the triangulation structure to determine which triangle encloses the query point. Then, the data points comprising the vertices of that triangle can be used to calculate the value of the interpolated surface at the query point, depending on which interpolation method is being used (nearest-neighbor, linear, etc...).

This plot shows a query point in a 2-D triangulation. To find the interpolated value at this query point, MATLAB uses the vertices of the enclosing triangle. By repeating this calculation at many different query points, a functional surface for the data forms. This functional surface allows you to make predictions at points where you did not collect data.

triplot(tr) hold on plot(X(:,1),X(:,2),'r*') trisurf(tr.ConnectivityList,X(:,1),X(:,2),V,... 'FaceColor',triColor, ... 'FaceAlpha',0.9) axis([-4, 4, -4, 4, 0, 25]); grid on plot3(-2.6,-2.6,0,'*b','LineWidth', 1.6) plot3([-2.6 -2.6]',[-2.6 -2.6]',[0 13.52]','-b','LineWidth',1.6) hold off view(322.5, 30); text(-2.0, -2.6,'Xq',... 'FontWeight','bold', ... 'HorizontalAlignment','center',... 'BackgroundColor','none'); xlabel('x') ylabel('y') zlabel('v(x,y)') title('Interpolation of Query Point')

Now that we know `scatteredInterpolant` uses a Delaunay triangulation of the data to perform its calculations, let's examine the underlying triangulation of the data we are having trouble interpolating.

tri = delaunayTriangulation(x,y); triplot(tri) hold on plot(x,y,'r*') hold off xlabel('x') ylabel('y') title('Triangulation of Original Data')

What a mess! There are only a few small pockets of reasonable triangles, with most of them being long and thin, connecting points that are far away from each other. It looks like the triangulation *does* pass through all of the data points, while the poor interpolated surface we saw earlier does not. We would expect that interpolated surface to look better.

Based on this triangulation, MATLAB has the not-enviable job of determining which of those long, thin triangles each query point lies in. And this brings us to ...

The poor interpolation results are caused by the very different scales of the data (the x-axis ranges from [0, 0.01] and the y-axis from [-100, 100]) and the resulting long, thin triangles in the triangulation. After MATLAB classifies the query points into their corresponding triangles, it then uses the triangle vertices to find the value of the interpolated surface. Since the long, thin triangles have vertices that are far away from each other, this results in the local values of the interpolated surface being based on the values of far away points, rather than neighboring points.

Here is a GIF showing what happens to a triangulation as the *x*-data is scaled, while the *y*-data is constant. What starts out as a well-defined triangulation quickly becomes full of many long, thin triangles.

The data is effectively being shoved into a smaller and smaller area along the x-axis. If the axes are kept equal, the above animation looks like this.

Clearly, the triangulation underpinning the scattered interpolation calculation is sensitive to distortions in the data. The algorithm expects the *x,y*-coordinates to be reasonably scaled relative to each other. But, why?

This behavior stems from the use of circumscribed circles in the Delaunay triangulation algorithm. When you squeeze the data in one direction, you are effectively altering the radii of the circumscribed circles. Taken to the extreme, as the points get squeezed together and become a straight line in one dimension, the radii of the circumscribed circles goes to infinity since the locus of points equidistant from the center of the circle becomes a straight line. The distribution of triangles is increasingly due to the proximity of the points in the *y* direction, hence why points that have close *y* values but very different *x* values become the triangle vertices.

If we add the circumscribed circles to the above animation, you can see that the radius of the circles blows up as the data gets squeezed.

The solution to this issue is typically to remove the distortions in the data by normalizing. Normalization can improve the interpolation results in some cases when the triangulation has many thin triangles, but in others it can compromise the accuracy of the solution. Whether to use normalization is a judgment made based on the nature of the data being interpolated.

**Benefits**: Normalizing your data can potentially improve the interpolation result when the independent variables have different units and substantially different scales. In this case, scaling the inputs to have similar magnitudes might improve the numerical aspects of the interpolation. An example where normalization would be beneficial is if`x`represents engine speed in RPMs from 500 to 3500, and`y`represents engine load from 0 to 1. The scales of`x`and`y`differ by a few orders of magnitude, and they have different units.

**Cautions**: Use caution when normalizing your data if the independent variables have the same units, even if the scales of the variables are different. With data of the same units, normalization distorts the solution by adding a directional bias, which affects the underlying triangulation and ultimately compromises the accuracy of the interpolation. An example where normalization is erroneous is if both`x`and`y`represent locations and have units of meters. Scaling`x`and`y`unequally is not recommended because 10 m due East should be spatially the same as 10 m due North.

For our problem, let's assume `x` and `y` have different units and normalize them so that they have similar magnitudes. You can use the relatively new `normalize` function to do this; by default it uses Z-scores of the data. This transforms the data so that the original mean $\mu$ becomes 0, and the original standard deviation $\sigma$ becomes 1:

$$x' = \frac{\left(x - \mu\right)}{\sigma}.$$

This normalization is very common and is also called *standardization*.

x = normalize(x); y = normalize(y);

Now that the data is normalized, let's take a look at the triangulation.

tri = delaunayTriangulation(x,y); triplot(tri) hold on plot(x,y,'r*') hold off xlabel('x') ylabel('y') title('Triangulation of Normalized Data')

This is much better! The triangles are mostly well defined. There are only a few pockets of long, thin triangles near the edges. Let's take a look at the scattered interpolation surface now.

X = linspace(min(x),max(x),25); Y = linspace(min(y),max(y),25); [xq, yq] = meshgrid(X,Y); F = scatteredInterpolant(x,y,vals); zq = F(xq,yq); plot3(x,y,vals,'.','Color',ptColor) hold on surf(xq,yq,zq,'FaceColor',triColor) hold off grid on xlabel('x') ylabel('y') zlabel('v(x,y)') title('Normalized Data with Interpolated Surface')

In this case, normalizing the sample points permits `scatteredInterpolant` to compute a better triangulation, and thus a better interpolated surface.

Normalization is just one form of data scaling. With normalization you are changing the shape of the distribution of data and shifting the location of the mean value.

You can also `rescale` the data, changing the extent of its largest and smallest values (for example, making 0 the smallest value and 1 the largest value). When you rescale data you preserve the shape of the distribution but squeeze or expand it along the number line. This changes the mean value but preserves the standard deviation.

Data scaling is important to pay attention to. If the data is not properly scaled, it can hide relationships in the data or distort their strength. The results of some algorithms can vary a lot depending on the exact scaling used. We just saw how the triangulation of scattered data is sensitive to scaling, but it can cause problems in other algorithms as well. For example, machine learning algorithms typically depend on the Euclidean distance between points, so they are sensitive to the scaling of the data. Without proper scaling, some features will have too large a contribution compared to others, skewing the results.

But this is a broad topic best covered in another post!

Do you use scattered data interpolation regularly in your work? If so tell us about it in the comments!

I'd like to thank Cosmin Ionita, Damian Sheehy, and Konstantin Kovalev for their valuable insights while I was writing this post.

Get
the MATLAB code

Published with MATLAB® R2018b

#shelovesmatlab

Today's guest blogger is Penny Anderson, Senior Engineer Manager, MATLAB Products. And she is going to answer the big question today, WHY?... read more >>

]]>#shelovesmatlab

Today's guest blogger is Penny Anderson, Senior Engineer Manager, MATLAB Products. And she is going to answer the big question today, WHY?

- A Suggestion for WHY
- What Does The WHY Function Do?
- The Names in WHY
- Adding Female Names to WHY
- New WHY
- Paula Loves MATLAB
- More About Paula Burgi
- Can you tell us how you came to be in this field?
- How did you decide to become a researcher?
- Who are some of the women in your personal and professional life who have influenced you the most? What was their contribution to your growth?
- How do MATLAB and other MathWorks tools help you in your work?
- Why Do You Love MATLAB?

On July 12th, 2012 a suggestion came in through our MATLAB Central forum

*The "why" function on MATLAB assumes the gender of users.* *In fact, both men and women use MATLAB.* *For example, the proper noun section* *contains solely names generally reserved for men.* *It would be appreciated if this were remedied.*

This comment came from a MATLAB user Paula Burgi, then an undergraduate student of Geoscience at Smith College. When we asked her about it more recently, Paula recalled:

*It was actually during my first summer doing scientific research during* *college. I was learning how to use Matlab, and came across the “why”* *command while trying to teach myself some Matlab basics. At the time I* *was under the impression that coding languages were only very serious* *and* *practical tools, no humor allowed. So it was a fun surprise to see that* *Matlab had these built-in commands intended just to give the user a* *laugh. However, since I was attending Smith College (an all-women’s* *institution), it did not escape me that almost all the names that were* *implemented in the “why” command were male. It was a sad reminder that* *the rest of the world was not filled with female scientists and* *computer* *programmers, contrary to my experience at Smith. Having both male and* *female names seemed like a simple but effective measure to make science* *feel just a little bit more inclusive, so I decided that I would send* *an* *email to Matlab making this suggestion.*

Four days later Mathworks Employee #1 Loren Shure turned the comment into an Enhancement request and it landed on my desk. At the time, I was managing a couple of groups of software developers – all of them men – working on MATLAB. Because of my management responsibilities, I didn’t get to do much fun coding work anymore. So, when this small but intriguing project caught my attention, I jumped on it.

I took a look at the `why` function, which is really a simple demo that builds up arbitrarily long sentences leveraging a random number generator and a fixed set of nouns, adjectives, verbs, proper nouns and so on. Here are a couple fun examples:

rng(27); why

A good kid obeyed a very young and terrified mathematician.

rng(111); why

Loren suggested it.

Sure enough, the proper names used to construct the sentences included:

- Cleve for Cleve Moler, the original author of MATLAB and co-founder of MathWorks
- Jack for Jack Little, our President and CEO, founder of MathWorks
- Bill for Bill McKeeman, a compiler expert and the author of the why function
- Joe for Joe Hicklin, our chief scientist and all-around idea instigator
- Pete, for G.W. “Pete” Stewart, a numerical analyst and long-time colleague of Cleve Moler
- Loren for Loren Shure, MathWorks Employee #1, actually a woman, but with a more conventional male spelling to her name
- Damian for Damian Packer, one of the original authors of Handle Graphics in MATLAB
- And Barney for the purple dinosaur?

“Hmm”, I thought. “She’s right. There are no obvious female names. I can fix that!”

I added my own name – Penny. After 16 years of being in software development and management at the very core of the math capabilities in MATLAB, I was one of the most senior women in Engineering.

Even though I was the only woman in my group at that time, I was actually unusual for the software development industry in that I had already had three female managers in my years at MathWorks, including my original hiring manager Loren Shure. Since Loren’s name was already on the list, I added the other two names – Mary Ann and Nausheen – as well.

Finally, I added the name – Mara – of another long-time colleague who was also a manager of a software development group at MathWorks.

Soon I was ready to submit the change to our continuous integration system. After passing all tests, the new improved version of the why demo went out to MATLAB users world-wide with Release R2013a.

And now Paula and all MATLAB users may be amused by reasons like:

rng(5); why

Because Mara wanted it.

rng(13); why

Mary Ann wanted it that way.

rng(44); why

Nausheen told me to.

rng(75); why

Penny knew it was a good idea.

Over the years I thought of Paula’s request from time to time. Finally, in 2018 as our grassroots group of Women at MathWorks was spreading **#shelovesmatlab** and we were getting ready to attend the Grace Hopper Celebration of Women in Computing, I reached out to Paula by email, after tracking her down to her new role as a PhD student at Cornell. To my pleasant surprise she responded enthusiastically.

*What a nice email to get! I'm surprised and delighted to hear that my* *comments years ago are still being considered today. I tried it out for* *myself, and sure enough I got Mary Ann, Penny, and even Mara!*

*As a graduate student in geophysics, I still use MATLAB every day! It's* *definitely my preferred data analysis software.*

We took the opportunity to interview Paula about her work and her love of MATLAB.

Paula is currently a PhD student at Cornell University, studying geophysics and remote sensing. She received a B.A. at Smith College in Geosciences and Astronomy, then went on to work as a research assistant at the Earth Observatory of Singapore. There, she helped with active seismic imaging of the Main Frontal Thrust in Nepal, and conducted research on shallow megathrust fault geometry in Bangladesh. She now studies how remote sensing techniques can be improved to study earth deformation.

Installing GPS stations to track fault movement in Sagaing, Myanmar.

I’ve always been interested in earth science, especially earthquakes. I grew up in Oak Park, IL, but traveled to Switzerland often during my childhood because my father is Swiss. The stark contrast between the very flat Midwest and the dramatic topography of the Swiss Alps had me pondering about geological processes at a young age. When I learned that researching the earth was a job, I never looked back! I ended up studying tectonics, particularly earthquakes. I have done both geological fieldwork to look at effects of past earthquakes, as well as geophysical work using geodetic data to track current deformation of the earth’s crust. Now, I use satellite radar data (InSAR) to observe deformation due to earthquakes, volcanoes, and many other sources. I am currently assessing how vegetation can affect these measurements (figure below).

Example of satellite radar data (InSAR), showing the signal introduced by vegetation loss in Oregon (rectangular features).

I decided in high school that I wanted to pursue science. I loved my high school physics class, the teacher really wanted students to conceptualize what we were learning in class as to apply it to everyday life. I also took geology and astronomy (both of which ended up being my majors in college), both of which were very self-guided classes, much like real research! What finally cemented my desire to become a researcher was a summer research internship I did in college. I used GPS data to deduce information about the subduction zone offshore Chile. This internship was my first real experience with research (and Matlab!), and I absolutely loved the everyday challenges of taking a large-scale research question, breaking it down, and posing it in terms of a script.

As cliche as it may sound, my mother had a huge impact on my personal growth as a scientist. She is a doctor, graduating from medical school when only ~25% of her class was women. Her curiosity about the natural world definitely rubbed off on me, like when she would get out a microscope to enthusiastically look at the different kinds of mold that would grow on expired food items. She led by example when it came to pursuing things that you are passionate about, particularly when most people doubt you’ll be able to achieve it. I’ve had 2 excellent female academic advisors, one at the Earth Obervatory of Singapore and the other during my current PhD research. Both of these women have shown me that it is very much possible to have a balanced work-home life, and that it does not make you any “less” of a scientist.

My undergraduate thesis looked at GPS and geologic data before and after a large earthquake in Costa Rica. I used this data to model where the earthquake occured, and to gain insight into how our modeling techniques can bias our output. My main question was: How do different fault models effect the earthquake location and magnitude? Determining earthquake location and magnitude using surface displacement data is a classic inverse problem. I used Matlab’s inverse linear inverse function to determine the earthquake parameters. I also used Matlab’s mapping toolbox to visualize the data and models. Finally, I used Matlab’s parallel computing toolbox, which sped up calculations by a lot! I found that the different fault models resulted in up to 30% more energy released by an earthquake. The most realistic fault model resulted in a slip distribution that appeared to be controlled by small variations in the surface of the fault. I learned all I know to this point about computational geophysics from Matlab. I still use Matlab almost every day, and continue to grow as a scientist because of it!

Fault slip from the 2012 Costa Rica Earthquake, derived from GPS data using Matlab’s built-in matrix inversion functions.

Please share this post with the #shelovesmatlab hashtag. Know an amazing woman who loves our tools you’d like us to blog about? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

"All the noise, noise, noise, NOISE!"

-- The Grinch

Today's guest blogger is Tom Lane. Tom has been a MathWorks developer since 1999, working primarily on the Statistics and Machine Learning Toolbox. He'd like to share with you a couple of issues that MATLAB users repeatedly encounter.... read more >>

]]>"All the noise, noise, noise, NOISE!"

-- The Grinch

Today's guest blogger is Tom Lane. Tom has been a MathWorks developer since 1999, working primarily on the Statistics and Machine Learning Toolbox. He'd like to share with you a couple of issues that MATLAB users repeatedly encounter.

The topic for today is curve fitting. Let's look at a simple exponential function:

```
rng default
x = rand(10,1);
y = 10*exp(-5*x);
```

We can plot this, but many of the values are smooshed up against the `X` axis. The semilogy function can help with that, and also turn the relationship into a straight line.

subplot(1,2,1); plot(x,y,'x'); subplot(1,2,2); semilogy(x,y,'x'); % log(y) = -5*x + log(10)

Suppose we have the `X` and `Y` values, and we can see or guess the functional form, but we don't know the constant values 10 and 5. We can estimate them from the data. We can do this using either the original data shown at the left, or the `log(Y)` transformed data shown at the right.

```
p1 = fitnlm(x,y,'y ~ b1*exp(b2*x)',[1 1])
p2 = polyfit(x,log(y),1); p2(2) = exp(p2(2))
```

p1 = Nonlinear regression model: y ~ b1*exp(b2*x) Estimated Coefficients: Estimate SE tStat pValue ________ __ _____ ______ b1 10 0 Inf 0 b2 -5 0 -Inf 0 Number of observations: 10, Error degrees of freedom: 8 Root Mean Squared Error: 0 R-Squared: 1, Adjusted R-Squared 1 F-statistic vs. zero model: Inf, p-value = 0 p2 = -5.0000 10.0000

Both fits give the same coefficients.

Some time ago, a MATLAB user reported that he was fitting this curve to his own data, and getting different parameter estimates from the ones given by other software. They weren't dramatically different, but larger than could be attributed to rounding error. They were different enough to raise suspicion. If we add a little noise to `log(y)`, we can reproduce what the user saw.

y = exp(log(y) + randn(size(y))/10); p1 = fitnlm(x,y,'y ~ b1*exp(b2*x)',[1 1]) p2 = polyfit(x,log(y),1); p2(2) = exp(p2(2)) xx = linspace(0,1)'; subplot(1,1,1) plot(x,y,'x', xx,predict(p1,xx),'r-')

p1 = Nonlinear regression model: y ~ b1*exp(b2*x) Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ b1 10.015 0.34113 29.36 1.9624e-09 b2 -4.8687 0.23021 -21.149 2.625e-08 Number of observations: 10, Error degrees of freedom: 8 Root Mean Squared Error: 0.141 R-Squared: 0.997, Adjusted R-Squared 0.996 F-statistic vs. zero model: 1.91e+03, p-value = 1.91e-11 p2 = -4.8871 10.0008

Which estimates to believe? Well, it turns out that once you add noise, these models are no longer equivalent. Adding noise to the original data is one thing. Adding noise to the log data gives noise values that, back on the original scale, grow with the value of `y`. We can see the difference between the two more easily if we generate a larger set of data.

rng default X = rand(100,1); Y = 10*exp(-5*X); subplot(1,2,1); Y1 = Y + randn(size(Y))/5; plot(X,Y1,'x', xx,10*exp(-5*xx),'r-'); subplot(1,2,2); Y2 = exp(log(Y) + randn(size(Y))/10); plot(X,Y2,'x', xx,10*exp(-5*xx),'r-');

It's hard to see at the top of the plots, but near `y=0` we can see that the noise is larger on the left than on the right. On the left the noise is additive. On the right, the noise is multiplicative.

Which model is correct or appropriate? We'd have to understand the data to decide that. One clue is that if negative values are plausible when the curve approaches zero, then an additive model may be appropriate. If the noise is more plausibly described in terms like +/- 10%, then the multiplicative model may be appropriate.

Now, not all models are easily transformed this way. A multiplicative model may still be appropriate even when no such simple transformation exists. Fortunately, the `fitnlm` function from the Statistics and Machine Learning Toolbox has a feature that lets you specify the so-called "error model" directly.

p1 = fitnlm(x,y,'y ~ b1*exp(b2*x)',[1 1],'ErrorModel','proportional')

p1 = Nonlinear regression model: y ~ b1*exp(b2*x) Estimated Coefficients: Estimate SE tStat pValue ________ ______ _______ __________ b1 9.9973 0.8188 12.21 1.8787e-06 b2 -4.8771 0.1162 -41.973 1.144e-10 Number of observations: 10, Error degrees of freedom: 8 Root Mean Squared Error: 0.121 R-Squared: 0.974, Adjusted R-Squared 0.97 F-statistic vs. zero model: 344, p-value = 1.75e-08

This model isn't exactly the same as the ones before. It's modeling additive noise, but with a scale factor that increases with the size of the fitted function. But rest assured, it takes into account the different noise magnitudes, so it may be useful for data having that characteristic.

Now that I have your attention, though, I'd like to address another topic related to curve fitting. This time let's consider the Weibull curve.

rng default x = wblrnd(5,2,100,1); subplot(1,1,1) histogram(x, 'BinEdges',0:14, 'Normalization','pdf')

The Weibull density has this form:

X = linspace(0,20); A = 5; B = 2; Y = (B/A) * (X/A).^(B-1) .* exp(-(X/A).^B); hold on plot(X,Y) hold off

Suppose we don't know the parameters `A` and `B`. Once again, there are two ways of estimating them. First, we could get the bin centers and bin heights, and use curve fitting to estimate the parameters.

heights = histcounts(x, 'BinEdges',0:14, 'Normalization','pdf'); centers = (0.5:1:13.5)'; fitnlm(centers,heights,@(params,x)wblpdf(x,params(1),params(2)),[2 2])

ans = Nonlinear regression model: y ~ wblpdf(x,params1,params2) Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ params1 4.6535 0.18633 24.975 1.0287e-11 params2 1.9091 0.10716 17.815 5.3598e-10 Number of observations: 14, Error degrees of freedom: 12 Root Mean Squared Error: 0.0182 R-Squared: 0.937, Adjusted R-Squared 0.932 F-statistic vs. zero model: 197, p-value = 6.62e-10

The alternative is not to treat this like a curve fitting problem, but to treat it like a distribution fitting problem instead. After all, there is no need for us to artificially bin the data before fitting. Let's just fit the data as we have it.

```
fitdist(x,'weibull')
```

ans = WeibullDistribution Weibull distribution A = 4.76817 [4.29128, 5.29806] B = 1.96219 [1.68206, 2.28898]

It's much simpler to call the distribution fitting function than to set this up as a curve fitting function. But in case that doesn't convince you, I would like to introduce the concept of statistical efficiency. Notice that the distribution fitting parameters are closer in this case to the known values 5 and 2. Is that just a coincidence? A method is statistically more efficient than another if it can get the same accuracy using less data. Let's have a contest. We will fit both of these models 1000 times, collect the estimates of `A`, and see which one is more variable.

AA = zeros(1000,2); for j=1:1000 x = wblrnd(5,2,100,1); heights = histcounts(x, 'BinEdges',0:14, 'Normalization','pdf'); f = fitnlm(centers,heights,@(params,x)wblpdf(x,params(1),params(2)),[2 2]); p = fitdist(x,'weibull'); AA(j,1) = f.Coefficients.Estimate(1); AA(j,2) = p.A; end mean_AA = mean(AA) std_AA = std(AA)

mean_AA = 5.0145 4.9973 std_AA = 0.3028 0.2566

We have a winner! Both values have a mean that is close to the known values of 5, but the values produced by distribution fitting are less variable.

When you approach a curve fitting problem, I recommend you consider a few things:

- Is this a curve fitting problem at all? If it involves the relationship between two variables, it's curve fitting. If it involves the distribution of a single variable, try to approach it as a distribution fitting problem.
- Is the function one that can transform to a simpler form, such as a linear relationship? If so, consider doing that to make the fitting process more efficient.
- However, consider the noise. Does the noise seem additive or multiplicative? Does the noise vary with the magnitude of
`Y`? Are negative`Y`values plausible? These questions can help you decide whether to fit on the original or transformed scale, and whether to specify an error model.

What criteria do you use to choose a model for fitting your data? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

*Today's guest post comes from Sean de Wolski, one of Loren's fellow Application Engineers. You might recognize him from MATLAB answers and the pick of the week blog!*... read more >>

*Today's guest post comes from Sean de Wolski, one of Loren's fellow Application Engineers. You might recognize him from MATLAB answers and the pick of the week blog!*

One of my colleagues approached me last month and asked for help debugging an error with `splitapply`. Splitapply takes group information and applies a function to each group in the data (sort of like a pivot table). Note, that everything here also applies to the lower level but more powerful function `accumarray`.

The documentation provides numerous simple examples for what `splitapply` does so check them out if you're not familiar with it.

Here is an anonymized version of the data set my colleague had. The first three variables are categorical identifiers and the fourth are some data associated with them.

```
load('Bears.mat')
disp(Bears)
```

Candy Animal Sports Bytes ________ __________ ___________ ______________________________________ Cinnamon Black Brown -0.8095 0.40391 2 Cinnamon Polar Brown -2.9443 0.096455 3 Cinnamon Polar Brown 1.4384 0.13197 9 Cinnamon Sloth Chicago 0.32519 0.94205 1 Cinnamon Sloth Chicago -0.75493 0.95613 5 Cinnamon Sloth Chicago 1.3703 0.57521 2 Cinnamon Sloth Chicago -1.7115 0.05978 10 Cinnamon Sloth Chicago -0.10224 0.23478 8 Cinnamon Sun Baylor -0.24145 0.35316 6 Cinnamon Sun Baylor 0.31921 0.82119 5 Cinnamon Polar Brown 0.31286 0.015403 1 Cinnamon Polar Brown -0.86488 0.043024 7 Cinnamon Black Brown -0.030051 0.16899 1 Cinnamon Black Brown -0.16488 0.64912 1 Cinnamon Polar Brown 0.62771 0.73172 6 Cinnamon Polar Brown 1.0933 0.64775 1 Cinnamon Spectacled Coast Guard 1.1093 0.45092 9 Cinnamon Sloth Chicago -0.86365 0.54701 9 Gummy Sun Baylor 0.077359 0.29632 8 Gummy Sun Baylor -1.2141 0.74469 2 Gummy Sun Baylor -1.1135 0.18896 7 Gummy Sun Baylor -0.0068493 0.68678 6 Gummy Sun Baylor 1.5326 0.18351 10 Gummy Sloth Chicago -0.76967 0.36848 7 Gummy Sloth Chicago 0.37138 0.62562 9 Gummy Spectacled Coast Guard -0.22558 0.78023 5 Gummy Spectacled Coast Guard 1.1174 0.081126 5 Gummy Spectacled Coast Guard -1.0891 0.92939 9 Gummy Sloth Chicago 0.032557 0.77571 1 Gummy Sloth Chicago 0.55253 0.48679 2 Gummy Spectacled Coast Guard 1.1006 0.43586 2 Gummy Spectacled Coast Guard 1.5442 0.44678 4 Gummy Spectacled Coast Guard 0.085931 0.30635 9 Gummy Spectacled Coast Guard -1.4916 0.50851 9 Gummy Spectacled Coast Guard -0.7423 0.51077 1 Gummy Spectacled Coast Guard -1.0616 0.81763 4 Gummy Spectacled Coast Guard 2.3505 0.79483 6 Gummy Sloth Chicago -0.6156 0.64432 5 Gummy Spectacled Coast Guard 0.74808 0.37861 7 Gummy Sloth Chicago -0.19242 0.81158 7 Gummy Spectacled Coast Guard 0.88861 0.53283 3 Gummy Spectacled Coast Guard -0.76485 0.35073 5 Gummy Spectacled Coast Guard -1.4023 0.939 1 Gummy Black Brown -1.4224 0.87594 10 Gummy Sloth Chicago 0.48819 0.55016 2 Gummy Sloth Chicago -0.17738 0.62248 2 Gummy Sloth Chicago -0.19605 0.58704 4 Gummy Sloth Chicago 1.4193 0.20774 2 Gummy Sloth Chicago 0.29158 0.30125 5 Gummy Sloth Chicago 0.19781 0.47092 4 Gummy Polar Brown 1.5877 0.23049 10 Gummy Polar Brown -0.80447 0.84431 10 Gummy Sloth Chicago 0.69662 0.19476 1 Gummy Black Brown 0.83509 0.22592 8 Gummy Black Brown -0.24372 0.17071 3 Gummy Sloth Chicago 0.21567 0.22766 5 Gummy Black Brown -1.1658 0.4357 6 Gummy Sloth Chicago -1.148 0.3111 10 Gummy Sloth Chicago 0.10487 0.92338 5 Gummy Sloth Chicago 0.72225 0.43021 10 Gummy Sloth Chicago 2.5855 0.18482 4 Gummy Sloth Chicago -0.66689 0.90488 8 Gummy Sloth Chicago 0.18733 0.97975 7 Gummy Sloth Chicago -0.082494 0.43887 6 Gummy Sloth Chicago -1.933 0.11112 7 Gummy Sloth Chicago -0.43897 0.25806 7 Gummy Sloth Chicago -1.7947 0.40872 2 Gummy Sloth Chicago 0.84038 0.5949 2 Gummy Sloth Chicago -0.88803 0.26221 10 Gummy Sloth Chicago 0.10009 0.60284 2 Gummy Sloth Chicago -0.54453 0.71122 1 Gummy Sloth Chicago 0.30352 0.22175 6 Gummy Sloth Chicago -0.60033 0.11742 9 Gummy Sloth Chicago 0.48997 0.29668 7 Gummy Sloth Chicago 0.73936 0.31878 2 Gummy Sloth Chicago 1.7119 0.42417 4 Gummy Sloth Chicago -0.19412 0.50786 5 Gummy Sloth Chicago -2.1384 0.085516 10 Gummy Sloth Chicago -0.83959 0.26248 2 Gummy Sloth Chicago 1.3546 0.80101 9 Gummy Sloth Chicago -1.0722 0.02922 7 Gummy Sloth Chicago 0.96095 0.92885 4 Gummy Sloth Chicago 0.12405 0.73033 2 Gummy Sloth Chicago 1.4367 0.48861 5 Gummy Sloth Chicago -1.9609 0.57853 5 Gummy Black Brown -0.1977 0.23728 2 Gummy Sloth Chicago -1.2078 0.45885 6 Gummy Sloth Chicago 2.908 0.96309 3 Gummy Sloth Chicago 0.82522 0.54681 4 Gummy Sloth Chicago 1.379 0.52114 6 Gummy Sloth Chicago -1.0582 0.23159 3 Gummy Sloth Chicago -0.46862 0.4889 3 Gummy Black Brown -0.27247 0.62406 7 Gummy Polar Brown 1.0984 0.67914 3 Gummy Sloth Chicago -0.27787 0.39552 9 Gummy Sloth Chicago 0.70154 0.36744 10 Gummy Sloth Chicago -2.0518 0.98798 8 Gummy Black Brown -0.35385 0.037739 4 Gummy Sloth Chicago -0.82359 0.88517 6 Gummy Sloth Chicago -1.5771 0.91329 2 Gummy Sloth Chicago 0.50797 0.79618 10 Gummy Sloth Chicago 0.28198 0.098712 9 Gummy Sloth Chicago 0.03348 0.26187 9 Gummy Sloth Chicago -1.3337 0.33536 3 Gummy Sloth Chicago 1.1275 0.67973 6 Gummy Sloth Chicago 0.35018 0.13655 1 Gummy Sloth Chicago -0.29907 0.72123 5 Gummy Sloth Chicago 0.02289 0.10676 4 Gummy Sloth Chicago -0.262 0.65376 2 Gummy Sloth Chicago -1.7502 0.49417 2 Gummy Sloth Chicago -0.28565 0.77905 5 Gummy Black Brown -0.83137 0.71504 1 Gummy Sloth Chicago -0.97921 0.90372 6 Gummy Sloth Chicago -1.1564 0.89092 5 Gummy Sloth Chicago -0.53356 0.33416 7 Gummy Sloth Chicago -2.0026 0.69875 7 Gummy Sloth Chicago 0.96423 0.19781 7 Gummy Sloth Chicago 0.52006 0.030541 1 Gummy Sloth Chicago -0.020028 0.74407 1 Gummy Sloth Chicago -0.034771 0.50002 4 Gummy Sloth Chicago -0.79816 0.47992 6 Gummy Sloth Chicago 1.0187 0.90472 7 Gummy Sloth Chicago -0.13322 0.60987 5 Gummy Sloth Chicago -0.71453 0.61767 9 Gummy Sloth Chicago 1.3514 0.85944 8 Gummy Sloth Chicago -0.22477 0.80549 10 Cinnamon Polar Brown -0.58903 0.57672 6

The operation he was trying to calculate was the nan-omitted mean of Bytes based on two of the categories.

```
[animalcandy, animal, candy] = findgroups(Bears.Animal,Bears.Candy);
meanbyte = splitapply(@(x)mean(x, 'omitnan'), Bears.Bytes, animalcandy);
```

Error using vertcat Dimensions of arrays being concatenated are not consistent. Error in splitapply>localapply (line 257) finalOut{curVar} = vertcat(funOut{:,curVar}); Error in splitapply (line 132) varargout = localapply(fun,splitData,gdim,nargout); Error in mainDebuggingGroupedOps (line 32) meanbyte = splitapply(@(x)mean(x, 'omitnan'), Bears.Bytes, animalcandy);

Hmm, I've seen that error before, but what does it have to do with this? How do we debug this? One could put a break point at the anonymous function `@(x)mean(x, 'omitnan')` and then step with the debugger until the error occurs.

This would likely work for a small number of groups, but as the number of groups gets larger, it would be lots of steps, one for each function evaluation. You'd also likely have to do it twice, a second time after the error occurs. Setting the debugger to stop on errors may work as well for `splitapply` but not for `accumarray` which is builtin and even with `splitapply` may not stop you in a useful spot.

A trick I like to use is to just replace the function handle with {}. This takes whatever is provided and packs it into a cell so you can see exactly what is being passed into each function evaluation for each group.

bytecell = splitapply(@(x){x}, Bears.Bytes, animalcandy); disp(bytecell)

[14×3 double] [ 1×3 double] [ 5×3 double] [ 2×3 double] [78×3 double] [ 6×3 double] [ 8×3 double] [ 3×3 double] [ 3×3 double] [ 7×3 double]

From here we can see that the second cell has only one row. Since mean takes the mean of the first non-singleton dimension, it's reducing this to a scalar by taking the mean of the row where the rest of the elements are coming rows from taking the mean of columns. A scalar can't concatenate with a matrix so we get the error.

The fix for this is simple, pass in the dimension to mean to force it to always take column mean. Then rebuild the table with the labels.

```
meanbyte = splitapply(@(x)mean(x, 1, 'omitnan'), Bears.Bytes, animalcandy);
disp(table(animal, candy, meanbyte))
```

animal candy meanbyte __________ ________ ___________________________________ Spectacled Gummy 0.075572 0.55805 5 Spectacled Cinnamon 1.1093 0.45092 9 Sun Gummy -0.1449 0.42005 6.6 Sun Cinnamon 0.03888 0.58718 5.5 Sloth Gummy -0.081113 0.51523 5.2949 Sloth Cinnamon -0.28948 0.55249 5.8333 Black Gummy -0.45653 0.4153 5.125 Black Cinnamon -0.33481 0.40734 1.3333 Polar Gummy 0.62722 0.58464 7.6667 Polar Cinnamon -0.13228 0.32043 4.7143

In this case, the fix was fairly discernible from a quick inspection. If it was not, we could loop over the cell and evaluate the function on each element to see where the error occurs. If the error occurs on a specific cells' data, the loop will stop there and we can investigate. If it's on the concatenate step, that'll be obvious at the end.

fun = @(x)mean(x, 'omitnan'); meanbytecell = cell(size(bytecell)); for ii = 1:numel(bytecell) meanbytecell{ii} = fun(bytecell{ii}); end disp(meanbytecell)

[1×3 double] [ 3.5201] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double]

And now it is obvious why these won't concatenate.

I find looping over the cell in this manner to be much easier than looping over the original data set and trying to identify which elements are in which groups and indexing correctly.

I'm also a big fan of using `splitapply/accumarray` with cell output for making objects or plots based on grouped data where the object can't be returned directly. Continuing this example we'll use a histogram for each group of original data, wrapping `histogram` in {}.

figure axes('ColorOrder', parula(numel(animal))) hold on h = splitapply(@(x){histogram(x)}, Bears.Bytes, animalcandy); legend([h{:}], compose("%s/%s", animal, candy));

On an aside, development has been working to make grouped operations easier over the last few releases with a collection of new functions:

`varfun`- Apply function to table input variables based on grouping ones`groupsummary`- Group summary statistics`grouptransform`- Transform based on groups`retime`- Aggregate based on time

Doing this same operation with `varfun` would look like this:

meanbytetable = varfun(@(x)mean(x, 1, 'omitnan'), Bears, ... 'GroupingVariables', {'Animal', 'Candy'}, ... 'InputVariables', {'Bytes'}); disp(meanbytetable)

Animal Candy GroupCount Fun_Bytes __________ ________ __________ ___________________________________ Spectacled Gummy 14 0.075572 0.55805 5 Spectacled Cinnamon 1 1.1093 0.45092 9 Sun Gummy 5 -0.1449 0.42005 6.6 Sun Cinnamon 2 0.03888 0.58718 5.5 Sloth Gummy 78 -0.081113 0.51523 5.2949 Sloth Cinnamon 6 -0.28948 0.55249 5.8333 Black Gummy 8 -0.45653 0.4153 5.125 Black Cinnamon 3 -0.33481 0.40734 1.3333 Polar Gummy 3 0.62722 0.58464 7.6667 Polar Cinnamon 7 -0.13228 0.32043 4.7143

Do you work with grouped data functions? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

*Today we have two guest bloggers, Lisa Kempler and Pradeep Ramamoorthy, who work at MathWorks in Natick, Massachusetts supporting and developing online tools for researchers. Their post talks about a relatively new code-sharing platform, Code Ocean.*... read more >>

*Today we have two guest bloggers, Lisa Kempler and Pradeep Ramamoorthy, who work at MathWorks in Natick, Massachusetts supporting and developing online tools for researchers. Their post talks about a relatively new code-sharing platform, Code Ocean.*

Code Ocean is a cloud-based platform aimed at furthering computational reproducibility and open research. The site, accessible via a web browser, enables researchers to share code and data associated with their published research. Visitors to the site can view and run the code, thereby verifying that the code produces the results described in the original research paper. The platform supports a variety of programming languages, including MATLAB.

In an effort to provide easier access to code and data, Code Ocean recently announced the ability to export compute capsules:

https://codeocean.com/blog/post/new-compute-capsules-now-exportable-from-code-ocean

Users can now download these code capsules, or containers -- encapsulations of code, data and computational environment – in order to reuse and build on the published research and code, including the computational environment. MATLAB users who download compute capsules containing MATLAB code can run the code and view the associated results on their local computers.

Reproducible Research (RR) has been a big push for a long time by publishers and funding organizations that want to ensure that research is sufficiently vetted. There are two major benefits of good, verifiable research: 1) publications provide high-value information, and 2) those researchers doing follow-on research can confidently build on the work of their peers.

Researchers’ desire to leverage historical research has led to a movement around open science, or, more broadly, open research. The primary goal of the openness is the same as the underlying drivers of RR; if you make sure the results can be reproduced, then it’s reasonable to build on those results. However, “open” takes it a step further. Openness pushes RR beyond proving the validity of the research to Reuse – a requirement to make the research approaches and resulting artifacts broadly accessible.

Using Code Ocean, published authors can reproduce and verify their research results. However, Code Ocean’s main value to researchers is this ability to reuse the work of their published peers.

Although the ability to download the code, associated data, and related graphical and numeric output gives researchers a huge head-start, buy-in by researchers who submit for publication is still limited. In a recent article (An empirical analysis of journal policy effectiveness for computational reproducibility), Stodden et al demonstrate the lack of engagement in RR by most researchers. The study deemed 56 of 204 published papers computationally reproducible, even after multiple attempts to get additional information from authors of the remaining 148. The study’s finding, 25% compliance for published papers that are inherently computational, tells us that the norm is still 1) non-reproducibility and 2) not-so-transparent paths to Reuse for most published computational research.

Having MATLAB language support on Code Ocean makes it easier for researchers to share their work. Using these tested outputs, MATLAB users can create new research, and transfer their learnings to new innovations and products in science and industry. Code Ocean’s easy-upload and sharing platform holds the possibility of increasing RR compliance (and, in turn, Reuse-ability), as publishers, authors, and follow-on researchers see the value in sharing.

Compute capsules are the foundational units on Code Ocean. They encapsulate the elements required to reproduce and reuse research – code, data, documentation and a specification of the computational environment. Researchers create a compute capsule associated with their research, and visitors open these capsules to examine and run the code.

Let’s say you’re a researcher working in the field of neuroscience. You hear about ongoing research and development of models for simulating brain fibers.

Once you log in to the Code Ocean website (setting up an account is quick, and free), you can explore the curated gallery of published compute capsules, or search for relevant terms. If you search for ‘fiber’ or ‘brain’, you see relevant results, as shown below:

The 1st search result – Fiber Source Separation – looks promising and might be what you were looking for. Clicking on the link will take you to the Code Ocean IDE, which allows you to interact with the code, look at supporting documents and visualizations, and run the code on Code Ocean’s cloud platform.

To export this capsule, just select the ‘Export’ option from the ‘Capsule’ menu.

Selecting this option should initialize the download process. Once downloaded, you can then extract the downloaded package. REPRODUCING.md, below, is your read-me file, with the steps needed to reproduce the results of the capsule. The next step, unpacking the capsule, requires you to install Docker and some experience using Docker.

The ability to view and reuse code associated with published research is a big plus. Having the bi-directional linkage between the code and the published article, from papers on publisher sites to the code and from Code Ocean capsules back to the papers, makes it easy to find and use the different related components. If you have a published paper with associated MATLAB code, consider uploading it to Code Ocean. Or visit Code Ocean to view and download research-related MATLAB code.

Have you used Code Ocean (or similar platforms) for your research and code-sharing needs? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

Do you ever need to explain something to others unfamiliar with your work what it's about? One situation I frequently face is explaining machine learning to audiences who want to learn more about it but are not yet particularly conversant in it. This is regardless of the audience, be... read more >>

]]>Do you ever need to explain something to others unfamiliar with your work what it's about? One situation I frequently face is explaining machine learning to audiences who want to learn more about it but are not yet particularly conversant in it. This is regardless of the audience, be they students, professors, researchers, or folks working in government and industry as scientists and engineers. So, what do I do?

A few years ago I wanted to find a way to explain machine learning in a way that would make it understandable and fun. I came up with an explanation that illustrates what's going on in machine learning without any of the mathematical details.

Most people I know learned regression somewhere along the way, often in a stats class, or perhaps they were exposed to clustering and classification using the famous Fisher Iris dataset (here are our classification and clustering examples). I tried this approach a few times and, as much as I like flowers and sepal length and petal width, I thought I could do better to make machine learning concepts easier to grasp.

I came up with an idea of using animals. Who doesn’t like dogs, cats, and birds, or at least some of these? Well it worked out well and over the last couple years I’ve showed this over a hundred times and always get positive feedback from the audience. A few months ago some colleagues asked me what I was showing customers these days. When I described my animal story they were pretty excited and thought I should record a video to help others understand this machine learning area that everyone seems to want to know more about.

Well, to cut to the chase

(spoiler alert: a cheetah is the regression winner), you can now watch the video.

**Does This Explanation Help?**

Did you find this explanation useful, either for yourself, or to pass along to others? What other concepts have you needed to "illustrate"? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2018b

My guest blogger today is Sebastian Gross. He is an engineer in our academia team in Munich, Germany, and works with our university customers to make them successful in research and teaching. In today’s post he introduces TopoToolbox[1,2].... read more >>

]]>My guest blogger today is Sebastian Gross. He is an engineer in our academia team in Munich, Germany, and works with our university customers to make them successful in research and teaching. In today’s post he introduces TopoToolbox[1,2].

Thank you, Loren, for giving me this opportunity to write about TopoToolbox. This collection of functions offers quantitative methods for the analysis of relief and flow pathways in digital elevation models while not using a GIS environment. One application is to learn about flooding risks in a particular geographical area.

The idea to write a blog post about it came to me when reading news such as ‘Flooding has slammed every Iowa county since 1988, some as many as 17 times’[3]. These make you think. Even closer for me, severe floodings were reported in six German federal states and seven other countries in 2013[4].

Most of us have seen some degree of flooding happen - a street submerged in water, a flooded field, or a river that has left its usual boundaries. Just this year on a private vacation trip my wife and I encountered a bridge which had collapsed due to a river’s force and we had to continue our approach to an archeological site on foot. Research suggests that the risk is increasing because of climate change[5]. So, I was tempted to take a closer look at flooding.

A while back during the European Geoscience Union General Assembly (EGU) in Vienna, Austria, in April 2018, I met Wolfgang Schwanghart who is a geomorphologist at the University of Potsdam in Germany. Wolfgang is one of the authors of TopoToolbox. I had previously met the second author Dirk Scherler as well. Dirk works at the German Research Center for Geosciences (GFZ) in Potsdam and one of his research fields is geomorphology.

TopoToolbox is available via File Exchange[6] as zip-File or from the project’s website[1] in several formats including a toolbox file (.mltbx) for simple installation and management. The toolbox comes with an introduction document in Live Script format, which let’s you easily move along the code sections, read the rich comments, and view the results inline.

You can import topological data from opentopography.org with the command:

DEM = readopentopo();

------------------------------------- readopentopo process: DEM type: SRTMGL3 API url: http://opentopo.sdsc.edu/otr/getdem Local file name: C:\Users\loren\AppData\Local\Temp\tp6995bbab_ef43_46a2_bac9_b701b896fb61.tif Area: 2.4e+03 sqkm ------------------------------------- Starting download: 21-Sep-2018 11:27:03 Download finished: 21-Sep-2018 11:27:13 Reading DEM: 21-Sep-2018 11:27:13 GRIDobj cannot derive a map projection structure. This is either because the grid is in a geographic coordinate system or because geotiff2mstruct cannot identify the projected coordinate system used. TopoToolbox assumes that horizontal and vertical units of DEMs are the same. It is recommended to use a projected coordinate system, preferably UTM WGS84. Use the function GRIDobj/reproject2utm to reproject your grid. DEM read: 21-Sep-2018 11:27:14 Temporary file deleted Done: 21-Sep-2018 11:27:14 -------------------------------------

The process also returns a warning with additional information. We will run the suggested reprojection later.

However, the returned data is from the area in Fresno, California, if you do not specify anything else. By chance, I passed the area a few years back, but I used data from the Bavarian Danube river area highlighted in the article[4].

The dataset of southern Bavaria can be loaded with

```
load('bavaria_dem.mat');
```

To prepare our data for use with TopoToolbox, I follow the warning message's advice and run `reproject2utm` .

DEM = reproject2utm(DEM,90);

The results so far can be displayed with a simple command `imagesc` .

imagesc(DEM);

However, we can achieve better highlighting of ridges, slopes, and mountains (hillshading) with `imageschs`

imageschs(DEM);

Now, back to the original idea of looking at water pathways. Using the `fillsinks` command,

```
DEMf = fillsinks(DEM);
% holes in the map are filled automatically.
```

With the prepared elevation model, we can calculate the flow direction which is stored in a `FLOWobj` . These flow directions are the basis for deriving river networks stored in another object, the `STREAMobj`.

```
FD = FLOWobj(DEMf);
S = STREAMobj(FD,'minarea',1000);
```

We can display the network with the `plot` command.

plot(S);

Finally, the water accumulation in our network can be calculated using the `FLOWobj` .

A = flowacc(FD);

And the resulting water accimulation can be displayed with in a graph easily.

imageschs(DEM,dilate(sqrt(A),ones(5)),'colormap',flowcolor, ... 'colorbarylabel','Flow accumulation [sqrt(# of pixels)]', ... 'ticklabel','nice');

So, why don’t you head over and try the area where you live? The command `readopentopo` can be used with the interactive switch `readopentopo('interactive',true)` to let you choose your area of interest freely.

You get a map window

and can select your favorite area

When you finish selecting the area in the map window of the interactive mode, you can confirm it by clicking a button. It works like a charm.

Let us know how this went for you here !

[1] TopoToolbox – MATLAB-based software for topographic analysis (website, accessed: July 18th, 2018)

[4] 2013 European floods, Wikipedia, (article, accessed: July 18th, 2018)

[6] TopoToolbox (File Exchange, accessed: July 18th, 2018)

Get
the MATLAB code

Published with MATLAB® R2018b

Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. In today's post, Stephen shows us new functions for displaying, arranging, and plotting data in tables and timetables.... read more >>

]]>Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. In today's post, Stephen shows us new functions for displaying, arranging, and plotting data in tables and timetables.

In R2013b, MATLAB® introduced the `table` data type, as a convenient container for column-oriented data. And in R2016b, MATLAB introduced the `timetable` data type, which is a table that has timestamped rows.

From the beginning, these data types offered advantages over cell arrays and structures. But over the course of several releases, the table and graphics development teams have added many new functions for tables and timetables. These functions add convenient ways to display and arrange tabular data. Also, they offer new ways to make plots or charts directly from tables, without the intermediate step of peeling out variables. As of R2018b, MATLAB boasts many new functions to help you make more effective use of tables and timetables.

To begin, I will use the `readtable` function to read data from a sample file that ships with MATLAB. The file `outages.csv` contains simulated data for electric power outages over a period of 12 years in the United States. The call to `readtable` returns a table, `T`, with six variables and 1468 rows, so I will suppress the output using a semicolon.

```
T = readtable('outages.csv');
```

One typical way to examine the data in a large table is to display the first few rows of the table. You can use indexing to access a subset of rows (and/or a subset of variables, for that matter). For example, this syntax returns the first three rows of `T`.

T(1:3,:)

ans = 3×6 table Region OutageTime Loss Customers RestorationTime Cause ___________ ________________ ______ __________ ________________ ______________ 'SouthWest' 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 'winter storm' 'SouthEast' 2003-01-23 00:49 530.14 2.1204e+05 NaT 'winter storm' 'SouthEast' 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 'winter storm'

I have a confession to make: I have written many table examples, using that syntax. And occasionally, I **still** catch myself starting with code like `T(3,:)`, which accesses only one row.

Happily, in R2016b we added the `head` function to return the top rows of a table. Here's the call to return the first three rows using the `head` function.

head(T,3)

ans = 3×6 table Region OutageTime Loss Customers RestorationTime Cause ___________ ________________ ______ __________ ________________ ______________ 'SouthWest' 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 'winter storm' 'SouthEast' 2003-01-23 00:49 530.14 2.1204e+05 NaT 'winter storm' 'SouthEast' 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 'winter storm'

Similarly, `tail` returns the bottom rows of a table. (If you do not specify the number of rows, then `head` and `tail` return eight rows.)

After examining your table, you might find that you want to organize your table by moving related variables next to each other. For example, in `T` you might want to move `Region` and `Cause` so that they are together.

One way to move table variables is by indexing. But if you use indexing and want to keep all the variables, then you must specify them all in order, as shown in this syntax.

T = T(:,{'OutageTime','Loss','Customers','RestorationTime','Region','Cause'})

You also can use numeric indices. While more compact, this syntax is less readable.

T = T(:,[2:5 1 6])

When your table has many variables, it is awkward to move variables using indexing. Starting in R2018a, you can use the `movevars` function instead. Using `movevars`, you only have to specify the variables of interest. Move the `Region` variable so it is before `Cause`.

T = movevars(T,'Region','Before','Cause'); head(T,3)

ans = 3×6 table OutageTime Loss Customers RestorationTime Region Cause ________________ ______ __________ ________________ ___________ ______________ 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 'SouthWest' 'winter storm' 2003-01-23 00:49 530.14 2.1204e+05 NaT 'SouthEast' 'winter storm' 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 'SouthEast' 'winter storm'

It is also likely that you want to add data to your table. For example, let's calculate the duration of the power outages in `T`. Specify the format to display the duration in days.

```
OutageDuration = T.RestorationTime - T.OutageTime;
OutageDuration.Format = 'dd:hh:mm:ss';
```

It is easy to add `OutageDuration` to the end of a table using dot notation.

T.OutageDuration = OutageDuration;

However, you might want to add it at another location in `T`. In R2018a, you can use the `addvars` function. Add `OutageDuration` so that it is after `OutageTime`.

T = addvars(T,OutageDuration,'After','OutageTime'); head(T,3)

ans = 3×7 table OutageTime OutageDuration Loss Customers RestorationTime Region Cause ________________ ______________ ______ __________ ________________ ___________ ______________ 2002-02-01 12:18 06:04:32:00 458.98 1.8202e+06 2002-02-07 16:50 'SouthWest' 'winter storm' 2003-01-23 00:49 NaN 530.14 2.1204e+05 NaT 'SouthEast' 'winter storm' 2003-02-07 21:15 09:10:59:00 289.4 1.4294e+05 2003-02-17 08:14 'SouthEast' 'winter storm'

Now, let's remove `RestorationTime`. You can easily remove variables using dot notation and an empty array.

T.RestorationTime = [];

However, in R2018a there is also a function to remove table variables. To remove `RestorationTime`, use the `removevars` function.

```
T = removevars(T,'RestorationTime');
head(T,3)
```

ans = 3×6 table OutageTime OutageDuration Loss Customers Region Cause ________________ ______________ ______ __________ ___________ ______________ 2002-02-01 12:18 06:04:32:00 458.98 1.8202e+06 'SouthWest' 'winter storm' 2003-01-23 00:49 NaN 530.14 2.1204e+05 'SouthEast' 'winter storm' 2003-02-07 21:15 09:10:59:00 289.4 1.4294e+05 'SouthEast' 'winter storm'

If your table contains dates and times in a `datetime` array, you can easily convert it to a timetable using the `table2timetable` function. In this example, `table2timetable` converts the values in `OutageTime` to *row times*. Row times are time stamps that label the rows of a timetable.

TT = table2timetable(T); head(TT,3)

ans = 3×5 timetable OutageTime OutageDuration Loss Customers Region Cause ________________ ______________ ______ __________ ___________ ______________ 2002-02-01 12:18 06:04:32:00 458.98 1.8202e+06 'SouthWest' 'winter storm' 2003-01-23 00:49 NaN 530.14 2.1204e+05 'SouthEast' 'winter storm' 2003-02-07 21:15 09:10:59:00 289.4 1.4294e+05 'SouthEast' 'winter storm'

When you display a timetable, it looks very similar to a table. One important difference is that a timetable has fewer variables than you might expect by glancing at the display. `TT` has five variables, not six. The vector of row times, `OutageTime`, is not considered a timetable variable, since its values label the rows. However, you can still access the row times using dot notation, as in `T.OutageTime`. You can use the vector of row times as an input argument to a function. For example, you can use it as the *x*-axis of a plot.

The row times of a timetable do not have to be ordered. If you want to be sure that the rows of a timetable are sorted by the row times, use the `sortrows` function.

TT = sortrows(TT); head(TT,3)

ans = 3×5 timetable OutageTime OutageDuration Loss Customers Region Cause ________________ ______________ ______ __________ ___________ ______________ 2002-02-01 12:18 06:04:32:00 458.98 1.8202e+06 'SouthWest' 'winter storm' 2002-03-05 17:53 04:20:48:00 96.563 2.8666e+05 'MidWest' 'wind' 2002-03-16 06:18 02:17:05:00 186.44 2.1275e+05 'MidWest' 'severe storm'

Now I will show you why I converted `T` to a timetable. Starting in R2018b, you can plot the variables of a table or timetable in a *stacked plot*. In a stacked plot, the variables are plotted in separate *y*-axes, but using a common *x*-axis. And if you make a stacked plot from a timetable, the *x*-values are the row times.

To plot the variables of `TT`, use the `stackedplot` function. The function plots variables that can be plotted (such as numeric, `datetime`, and `categorical` arrays) and ignores variables that cannot be plotted. `stackedplot` also returns properties of the stacked plot as an object that allows customization of the stacked plot.

s = stackedplot(TT)

s = StackedLineChart with properties: SourceTable: [1468×5 timetable] DisplayVariables: {'OutageDuration' 'Loss' 'Customers'} Color: [0 0.4470 0.7410] LineStyle: '-' LineWidth: 0.5000 Marker: 'none' MarkerSize: 6 Use GET to show all properties

One thing you can tell right away from this plot is that there must be a few timetable rows with bad data. There is one point for a power outage that supposedly lasted for over 9,000 days (or 24 years), which would mean it ended some time in the 2040s.

The `stackedplot` function ignored the `Region` and `Cause` variables, because these variables are cell arrays of character vectors. You might want to convert these variables to a different, and more useful, data type. While you can convert variables one at a time, there is now a more convenient way to convert all table variables of a specified data type.

Starting in R2018b, you can convert table variables in place using the `convertvars` function. For example, identify all the cell arrays of character vectors in `TT` (using `iscellstr`) and convert them to `categorical` arrays. Now `Region` and `Cause` contain discrete values assigned to categories. Categorical values are displayed without any quotation marks.

```
TT = convertvars(TT,@iscellstr,'categorical');
head(TT,3)
```

ans = 3×5 timetable OutageTime OutageDuration Loss Customers Region Cause ________________ ______________ ______ __________ _________ ____________ 2002-02-01 12:18 06:04:32:00 458.98 1.8202e+06 SouthWest winter storm 2002-03-05 17:53 04:20:48:00 96.563 2.8666e+05 MidWest wind 2002-03-16 06:18 02:17:05:00 186.44 2.1275e+05 MidWest severe storm

If your table or timetable has variables with values that belong to a finite set of discrete categories, then there are other interesting plots that you can make. Starting in R2017a, you can make a *heat map* of any two variables that contain discrete values using the `heatmap` function. For example, make a heat map of the `Region` and `Cause` variables to visualize where and why outages occur. Again, `heatmap` returns an object so you can customize the plot.

h = heatmap(TT,'Region','Cause')

h = HeatmapChart (Count of Cause vs. Region) with properties: SourceTable: [1468×5 timetable] XVariable: 'Region' YVariable: 'Cause' ColorVariable: '' ColorMethod: 'count' Use GET to show all properties

You also can make a pie chart of any `categorical` variable (as of R2014b), using the `pie` function. However, you cannot call `pie` on a table. So, to make a pie chart of the power outages by region, use dot notation to access the `Region` variable.

pie(TT.Region)

MATLAB also has other functions to reorganize variables in more complex ways, and to join tables. I won't show them all in action, but I will describe some of them briefly. All these functions work with both tables and timetables.

R2018a includes functions to:

- Reorient rows to become variables (
`rows2vars`)

- Invert a nested table-in-table (
`inner2outer`)

And from the original release of tables in R2013b, there are functions to:

Let's table discussion of these new functions for now. But we are eager to hear about your reactions to them. Do they help you make more effective use of tables and timetables? Please let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

I'd like to introduce this week's guest blogger Alan Weiss. Alan writes documentation for mathematical toolboxes here at MathWorks. ... read more >>

]]>Hi, folks. This post is about a new solver in Global Optimization Toolbox,

- It is best suited to optimizing expensive functions, meaning functions that are time-consuming to evaluate. Typically, you
use
surrogateopt to optimize a simulation or ODE or other expensive objective function. - It is inherently a global solver. The longer you let it run, the more it searches for a global solution.
- It requires no start point, just problem bounds. Of course, you can specify start points if you like.

The solver is closest in concept to a Statistics and Machine Learning Toolbox solver,

Let's try to minimize a simple function of two variables:

,

where and .

x0 = [1,2]; x1 = [-3,-5]; fun = @(x)sum((x/5).^2,2)-4*sech(sum((x-x0).^2,2))-6*sech(sum((x-x1).^2,2)); [X,Y] = meshgrid(linspace(-10,10)); Z = fun([X(:),Y(:)]); Z = reshape(Z,size(X)); surf(X,Y,Z) view([53,3])

You can see that there are three local minima, the global minimum near [–3,–5], one that is nearly as good near [1,2], and a poor one near [0,0]. You can see this because MATLAB® evaluated 10,000 points for the plot. The value of

To search for a global minimum, set finite bounds and call

```
type slowfun
```

function y = slowfun(x,x0,x1) y = sum((x/5).^2,2)-4*sech(sum((x-x0).^2,2))-6*sech(sum((x-x1).^2,2)); pause(0.5)

Set bounds and run the optimization of

fun = @(x)slowfun(x,x0,x1); lb = [-10,-10]; ub = -lb; rng default % for reproducibility tic [xsol,fval] = surrogateopt(fun,lb,ub)

Surrogateopt stopped because it exceeded the function evaluation limit set by 'options.MaxFunctionEvaluations'.

xsol =1×20.8779 1.7580

fval = -3.8348

toc

Elapsed time is 114.326822 seconds.

To try to find a better solution, run the solver again. To run faster, set the

options = optimoptions('surrogateopt','UseParallel',true); tic [xsol,fval] = surrogateopt(fun,lb,ub,options)

Surrogateopt stopped because it exceeded the function evaluation limit set by 'options.MaxFunctionEvaluations'.

xsol =1×2-2.8278 -4.7159

fval = -4.7542

toc

Elapsed time is 20.130099 seconds.

This time,

The

**Construct Surrogate**. In this phase, the algorithm takes a fixed number of quasirandom points within the problem bounds and evaluates the objective function at those points. It then interpolates and extrapolates these points to a smooth function called the surrogate in the entire bounded region. In subsequent phases, the surrogate interpolates the evaluated values at all the quasirandom points where the objective has been evaluated. The interpolation function is a radial basis function, because these functions are computationally inexpensive to create and evaluate. Radial basis functions also make it inexpensive to add new points to the surrogate.

**Search for Minimum**. This phase is more complicated than the**Construct Surrogate**phase, and is detailed below.

In theory, there are two main considerations in searching for a minimum of the objective: refining an existing point to get a better value of the objective function, and searching in places that have not yet been evaluated in hopes of finding a better global minimum.

After constructing a merit function, ** incumbent**, meaning the point that has lowest objective function value for points evaluated since the start of the most recent surrogate construction phase. These sample points are distributed according to a multidimensional normal distribution, centered on the incumbent and truncated to stay within bounds, with standard deviation in each coordinate proportional to the difference between those coordinate bounds. The standard deviations are multiplied by a

After evaluating the merit function on the sample points, the solver chooses the point with lowest merit function value, and evaluates the objective function at that point. This point is called an ** adaptive point**. The algorithm updates the surrogate (interpolation) to include this point.

- If the objective function value at the adaptive point is sufficiently lower than the value at the incumbent, then the adaptive point becomes the incumbent. This is termed a successful search.
- If the objective function value at the adaptive point is not sufficiently lower than the value at the incumbent, then the search is termed unsuccessful.

The scale

- If there have been three successful searches since the last scale change, then the scale is doubled:
sigma becomes2*sigma . - If there have been
max(5,nvar) unsuccessful searches since the last scale change, wherenvar is the number of problem dimensions, then the scale is halved:sigma becomessigma/2 .

Generally,

For details, see the Surrogate Optimization Algorithm section in the documentation.

You can optionally use the

Run the same problem as before, while using the

options.PlotFcn = 'surrogateoptplot'; options.UseParallel = false; rng(4) % For reproducibility [xsol,fval] = surrogateopt(fun,lb,ub,options)

Surrogateopt stopped because it exceeded the function evaluation limit set by 'options.MaxFunctionEvaluations'.

xsol =1×2-2.7998 -4.7109

fval = -4.7532

Here is how to interpret the plot. Starting from the left, follow the green line, which represents the best function value found. The solver reaches the vicinity of the second-best point at around evaluation number 30. Around evaluation number 70, there is a surrogate reset. After this, follow the dark blue incumbent line, which represents the best objective function found since the previous surrogate reset. This line approaches the second-best point after evaluation number 100. There is another surrogate reset before evaluation number 140. Just before evaluation number 160, the solver reaches the vicinity of the global optimum. The solver continues to refine this solution until the end of the evaluations, number 200. Notice that after evaluation number 190, all of the adaptive samples are at or near the global optimum, showing the shrinking scale. There is a similar discussion of interpreting the

I hope that this brief description gives you some understanding of the

Did you enjoy seeing how to use surrogate optimization? Do you have problems that might be addressed by this new solver? Tell us about it here.

]]>