I have been a little bit slow in announcing some of the coolest new features/products in the most recent release, R2019a. Why should you even care? What does Stateflow do? It allows you show the logic behind the work you are doing. You may say "I can already do that in MATLAB with if/else statements, switch/case statements, etc. And you can. But as you add extra conditions, the nesting of code and at least my ability to fully comprehend it can create a lot of mental overhead and burden. With state charts, you can encapsulate extra condition behavior without quite some much clutter, and yet high clarity.... read more >>

]]>I have been a little bit slow in announcing some of the coolest new features/products in the most recent release, R2019a. Why should you even care? What does Stateflow do? It allows you show the logic behind the work you are doing. You may say "I can already do that in MATLAB with if/else statements, switch/case statements, etc. And you can. But as you add extra conditions, the nesting of code and at least my ability to fully comprehend it can create a lot of mental overhead and burden. With state charts, you can encapsulate extra condition behavior without quite some much clutter, and yet high clarity.

You can get a quick idea from this picture, showing how to design the system of a lamp with the option to have the light blinking at different rates.

Guy beat me to the punch announcing Stateflow charts for MATLAB. Rather than replicate his wonderful post, I do want to encourage you to read about it. I also encourage you to watch a small video which, I hope, will give you some ideas.

One of the main MATLAB applications we see state charts helping with include the logic that controls Apps. I am quite sure there are others? Do have see a place in your workflow that a state chart can help with? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2019a

I'm pleased to announce the introduction of a new product today, the Performance Review Toolbox. It's so new, there isn't even a web page or marketing material for it.... read more >>

]]>I'm pleased to announce the introduction of a new product today, the Performance Review Toolbox. It's so new, there isn't even a web page or marketing material for it.

Do you need to write an annual self-review? We do at MathWorks. Needless to say, if you been doing a similar job for the past few years, it can be challenging to figure out what's new to say. MATLAB can help you, using Text Analytics Toolbox, along with machine and deep learning from Statistics and Machine Learning Toolbox and Deep Learning Toolbox. Simply have available the proper document directories from your work in the previous year, as well as relevant threads in email and internal 'groups', source code control and bug/enhancement reporting systems, etc. With the Performance Review Toolbox, you can use MATLAB to mine all the sources for good information, do affinity analysis to see what the big themes have been. After generating the outline for your document, you then can modify it to your liking. With the new outline in hand, you then use MATLAB to generate a draft of your self-review.

Another part of the review process is providing peer review comments. In a similar way to writing your review, the Performance Review Toolbox can assist you by again searching the proper sources for interactions you've had in the past year working with the individual for whom you need to write a peer quote. Using sentiment analysis tools, you similarly generate an outline with the top plusses and areas for improvement. After modifying as necessary, generate the quote!

If you are a manager, I am sorry. Version 1 of the Performance Review Toolbox does not generate reviews automatically for your use for your employees. It's a much requested feature.

Early reviews of this new product are promising.

If so, please first note the post's date, then let us know here.

Get
the MATLAB code

Published with MATLAB® R2019a

Ever need to create a vector of dates using some sort of pattern? Perhaps these will be used as the `edges` argument for a histogram, with each a month.... read more >>

Ever need to create a vector of dates using some sort of pattern? Perhaps these will be used as the `edges` argument for a histogram, with each a month.

What's the best way to create a `datetime` array where each element is the first of the month? And what does "best" even mean? - fewest keystrokes, fewest function calls, most readable, most flexible, most maintainable,etc. Suppose I want to produce something like this:

2018-Jan-01, 2018-Feb-01, 2018-Mar-01, ...

In fact, there are lots of suitable ways. Here are a few.

Just use `datetime` and specify the month vectors. This works

mydates1 = datetime(2018,1:12,1)

mydates1 = 1×12 datetime array Columns 1 through 5 01-Jan-2018 01-Feb-2018 01-Mar-2018 01-Apr-2018 01-May-2018 Columns 6 through 10 01-Jun-2018 01-Jul-2018 01-Aug-2018 01-Sep-2018 01-Oct-2018 Columns 11 through 12 01-Nov-2018 01-Dec-2018

While I'm at it, I can find the number of days in each month I have, using `eomday`.

numDays = eomday(year(mydates1),month(mydates1)); bar(mydates1,numDays)

If the months in question span years, you can do this fairly compactly using `calmonths`.

mydates2 = datetime(2017,1,1):calmonths(1):datetime(2018,7,1)

mydates2 = 1×19 datetime array Columns 1 through 5 01-Jan-2017 01-Feb-2017 01-Mar-2017 01-Apr-2017 01-May-2017 Columns 6 through 10 01-Jun-2017 01-Jul-2017 01-Aug-2017 01-Sep-2017 01-Oct-2017 Columns 11 through 15 01-Nov-2017 01-Dec-2017 01-Jan-2018 01-Feb-2018 01-Mar-2018 Columns 16 through 19 01-Apr-2018 01-May-2018 01-Jun-2018 01-Jul-2018

In addition, you can go beyond 12 months and `datetime` still works as expected, without needing `calmonths`.

mydates3 = datetime(2018,1:24,1);

or

mydates4 = datetime(2017,7:18,1)

mydates4 = 1×12 datetime array Columns 1 through 5 01-Jul-2017 01-Aug-2017 01-Sep-2017 01-Oct-2017 01-Nov-2017 Columns 6 through 10 01-Dec-2017 01-Jan-2018 01-Feb-2018 01-Mar-2018 01-Apr-2018 Columns 11 through 12 01-May-2018 01-Jun-2018

However keeping track of the relative month shifts to start has its own mental overhead for me.

Try this instead. Find the right start date, and then add on the correct number of `calmonths` from there.

mydates5 = datetime(2017,7,1) + calmonths(0:11)

mydates5 = 1×12 datetime array Columns 1 through 5 01-Jul-2017 01-Aug-2017 01-Sep-2017 01-Oct-2017 01-Nov-2017 Columns 6 through 10 01-Dec-2017 01-Jan-2018 01-Feb-2018 01-Mar-2018 01-Apr-2018 Columns 11 through 12 01-May-2018 01-Jun-2018

Do you have another way you like to produce lists of dates? What's your preference for these sorts of situations? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2018b

Of course the data we collect is always perfect - NOT! Maybe yours is different. What can go wrong? So many things. Instruments drift, web sites go down, power goes out, ... So what can you do if you have gaps in your data, and the analysis you want to perform won't tolerate that condition? You might decide you need to fill in missing values.... read more >>

]]>Of course the data we collect is always perfect - NOT! Maybe yours is different. What can go wrong? So many things. Instruments drift, web sites go down, power goes out, ... So what can you do if you have gaps in your data, and the analysis you want to perform won't tolerate that condition? You might decide you need to fill in missing values.

We've been working on supplying functionality that makes dealing with missing data easier for a long time, starting with the introduction of `NaN` values right in the beginning. In a floating point array, NaNs act as placeholders. That's great, but what can you do from there?

Some functions, or variants of them, work differently your array contains any `NaN` values, e.g., `mean`.

We first helped you figure out if you have missing values. And later added the ability to fill and remove missing values. More recently, we added the ability to mark missing values, even if you don't know the datatype of the array. This makes it easier to supply `NaN`, `NaT` (not a time) values, and similarly for `categorical` and `string` arrays, without needing to know which one is appropriate - as may happen with different columns in a `table`.

Do you use the functionality to deal with missing values? If so, tell us how. If not, please tell us what is missing!

You can let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

Today's guest blogger is Josh Meyer, a Technical Writer for the MATLAB Math and Big Data teams. He is going to discuss a common issue encountered in scattered data interpolation, and how to fix it!... read more >>

]]>Today's guest blogger is Josh Meyer, a Technical Writer for the MATLAB Math and Big Data teams. He is going to discuss a common issue encountered in scattered data interpolation, and how to fix it!

A common issue that can arise while performing interpolation on scattered data is that the resulting functional surface seems to be of lower quality than you would expect. For example, when the surface does not seem to pass through all of the sample data points. This post explores how and why the scaling of scattered data can affect the interpolation results.

Let's jump right into an example so you can see what I'm talking about. Imagine you just collected data from sensors to sample some values at several points. Ultimately you want to fit a surface to the data so that you can approximate the value of the underlying function at points where you don't have data. So, you start by plotting.

rng default x = rand(500,1)/100; y = 2.*(rand(500,1)-0.5).*90; vals = (x.*100).^2; ptColor = [.6 .07 .07]; plot3(x,y,vals,'.','Color',ptColor) grid xlabel('x') ylabel('y') zlabel('v(x,y)') title('Scattered Data Points')

Next, you use `scatteredInterpolant` to create an interpolant for the data. This computes an interpolating function for the observed points, allowing you to query the function anywhere within its convex hull. You create a grid of query points, evaluate the interpolant at those points, and plot the functional surface. All done!

F = scatteredInterpolant(x,y,vals); X = linspace(min(x),max(x),25); Y = linspace(min(y),max(y),25); [xq, yq] = meshgrid(X,Y); zq = F(xq,yq); hold on triColor = [0.68 0.88 0.95]; surf(xq,yq,zq,'FaceColor',triColor) hold off xlabel('x') ylabel('y') zlabel('v(x,y)') title('Scattered Data with Interpolated Surface')

... but wait, the result is definitely not what you expected! What are all of those "folds" in the surface? And why does it look like the surface doesn't pass through all the points? Interpolated surfaces should pass through *all* of the data points!

Let's take a quick detour to talk about scattered data interpolation before coming back to this problematic data.

Unlike gridded interpolation where the point locations are well-defined, scattered data presents different challenges. To find the interpolated value at a given query point, you need to use the values of nearby points. But how can you do that when the data is scattered around? The location of a point can't be used to predict the location of another point. And comparing the locations of all the points ad nauseum to determine which are close to a given query point is not a very efficient approach.

```
X = [-1.5 3.2; 1.8 3.3; -3.7 1.5;
-1.5 1.3; 0.8 1.2; 3.3 1.5;
-4.0 -1.0; -2.3 -0.7; 0 -0.5;
2.0 -1.5; 3.7 -0.8; -3.5 -2.9;
-0.9 -3.9; 2.0 -3.5; 3.5 -2.25];
V = X(:,1).^2 + X(:,2).^2;
plot(X(:,1),X(:,2),'r*')
```

To solve this, MATLAB first computes a Delaunay triangulation of the scattered data. This creates a series of triangles out of the data points, such that the circumscribed circles created by the vertices of each triangle do not enclose any points. The computed Delaunay triangulation is unique, up to trivial symmetries. And the best part is, the triangulation can be easily queried to determine what points are closest to a given query point.

This plot shows the Delaunay triangulation and the circumscribed circles for the scattered data. Notice that the red data points lie on the boundaries of one or more circumscribed circles, but none of them lie in the interior of one of the circles.

tr = delaunayTriangulation(X(:,1),X(:,2)); [C,r] = circumcenter(tr); a = C(:,1); b = C(:,2); pos = [a-r, b-r, 2*r, 2*r]; hold on triplot(tr) for k = 1:length(r) rectangle('Position',pos(k,:),'Curvature',[1 1],'EdgeColor',triColor) end xlabel('x') ylabel('y') title('Delaunay Triangulation of Scattered Data') hold off

Armed with the triangulation of the data, finding the value of the interpolated surface at a given query point becomes a problem of querying the triangulation structure to determine which triangle encloses the query point. Then, the data points comprising the vertices of that triangle can be used to calculate the value of the interpolated surface at the query point, depending on which interpolation method is being used (nearest-neighbor, linear, etc...).

This plot shows a query point in a 2-D triangulation. To find the interpolated value at this query point, MATLAB uses the vertices of the enclosing triangle. By repeating this calculation at many different query points, a functional surface for the data forms. This functional surface allows you to make predictions at points where you did not collect data.

triplot(tr) hold on plot(X(:,1),X(:,2),'r*') trisurf(tr.ConnectivityList,X(:,1),X(:,2),V,... 'FaceColor',triColor, ... 'FaceAlpha',0.9) axis([-4, 4, -4, 4, 0, 25]); grid on plot3(-2.6,-2.6,0,'*b','LineWidth', 1.6) plot3([-2.6 -2.6]',[-2.6 -2.6]',[0 13.52]','-b','LineWidth',1.6) hold off view(322.5, 30); text(-2.0, -2.6,'Xq',... 'FontWeight','bold', ... 'HorizontalAlignment','center',... 'BackgroundColor','none'); xlabel('x') ylabel('y') zlabel('v(x,y)') title('Interpolation of Query Point')

Now that we know `scatteredInterpolant` uses a Delaunay triangulation of the data to perform its calculations, let's examine the underlying triangulation of the data we are having trouble interpolating.

tri = delaunayTriangulation(x,y); triplot(tri) hold on plot(x,y,'r*') hold off xlabel('x') ylabel('y') title('Triangulation of Original Data')

What a mess! There are only a few small pockets of reasonable triangles, with most of them being long and thin, connecting points that are far away from each other. It looks like the triangulation *does* pass through all of the data points, while the poor interpolated surface we saw earlier does not. We would expect that interpolated surface to look better.

Based on this triangulation, MATLAB has the not-enviable job of determining which of those long, thin triangles each query point lies in. And this brings us to ...

The poor interpolation results are caused by the very different scales of the data (the x-axis ranges from [0, 0.01] and the y-axis from [-100, 100]) and the resulting long, thin triangles in the triangulation. After MATLAB classifies the query points into their corresponding triangles, it then uses the triangle vertices to find the value of the interpolated surface. Since the long, thin triangles have vertices that are far away from each other, this results in the local values of the interpolated surface being based on the values of far away points, rather than neighboring points.

Here is a GIF showing what happens to a triangulation as the *x*-data is scaled, while the *y*-data is constant. What starts out as a well-defined triangulation quickly becomes full of many long, thin triangles.

The data is effectively being shoved into a smaller and smaller area along the x-axis. If the axes are kept equal, the above animation looks like this.

Clearly, the triangulation underpinning the scattered interpolation calculation is sensitive to distortions in the data. The algorithm expects the *x,y*-coordinates to be reasonably scaled relative to each other. But, why?

This behavior stems from the use of circumscribed circles in the Delaunay triangulation algorithm. When you squeeze the data in one direction, you are effectively altering the radii of the circumscribed circles. Taken to the extreme, as the points get squeezed together and become a straight line in one dimension, the radii of the circumscribed circles goes to infinity since the locus of points equidistant from the center of the circle becomes a straight line. The distribution of triangles is increasingly due to the proximity of the points in the *y* direction, hence why points that have close *y* values but very different *x* values become the triangle vertices.

If we add the circumscribed circles to the above animation, you can see that the radius of the circles blows up as the data gets squeezed.

The solution to this issue is typically to remove the distortions in the data by normalizing. Normalization can improve the interpolation results in some cases when the triangulation has many thin triangles, but in others it can compromise the accuracy of the solution. Whether to use normalization is a judgment made based on the nature of the data being interpolated.

**Benefits**: Normalizing your data can potentially improve the interpolation result when the independent variables have different units and substantially different scales. In this case, scaling the inputs to have similar magnitudes might improve the numerical aspects of the interpolation. An example where normalization would be beneficial is if`x`represents engine speed in RPMs from 500 to 3500, and`y`represents engine load from 0 to 1. The scales of`x`and`y`differ by a few orders of magnitude, and they have different units.

**Cautions**: Use caution when normalizing your data if the independent variables have the same units, even if the scales of the variables are different. With data of the same units, normalization distorts the solution by adding a directional bias, which affects the underlying triangulation and ultimately compromises the accuracy of the interpolation. An example where normalization is erroneous is if both`x`and`y`represent locations and have units of meters. Scaling`x`and`y`unequally is not recommended because 10 m due East should be spatially the same as 10 m due North.

For our problem, let's assume `x` and `y` have different units and normalize them so that they have similar magnitudes. You can use the relatively new `normalize` function to do this; by default it uses Z-scores of the data. This transforms the data so that the original mean $\mu$ becomes 0, and the original standard deviation $\sigma$ becomes 1:

$$x' = \frac{\left(x - \mu\right)}{\sigma}.$$

This normalization is very common and is also called *standardization*.

x = normalize(x); y = normalize(y);

Now that the data is normalized, let's take a look at the triangulation.

tri = delaunayTriangulation(x,y); triplot(tri) hold on plot(x,y,'r*') hold off xlabel('x') ylabel('y') title('Triangulation of Normalized Data')

This is much better! The triangles are mostly well defined. There are only a few pockets of long, thin triangles near the edges. Let's take a look at the scattered interpolation surface now.

X = linspace(min(x),max(x),25); Y = linspace(min(y),max(y),25); [xq, yq] = meshgrid(X,Y); F = scatteredInterpolant(x,y,vals); zq = F(xq,yq); plot3(x,y,vals,'.','Color',ptColor) hold on surf(xq,yq,zq,'FaceColor',triColor) hold off grid on xlabel('x') ylabel('y') zlabel('v(x,y)') title('Normalized Data with Interpolated Surface')

In this case, normalizing the sample points permits `scatteredInterpolant` to compute a better triangulation, and thus a better interpolated surface.

Normalization is just one form of data scaling. With normalization you are changing the shape of the distribution of data and shifting the location of the mean value.

You can also `rescale` the data, changing the extent of its largest and smallest values (for example, making 0 the smallest value and 1 the largest value). When you rescale data you preserve the shape of the distribution but squeeze or expand it along the number line. This changes the mean value but preserves the standard deviation.

Data scaling is important to pay attention to. If the data is not properly scaled, it can hide relationships in the data or distort their strength. The results of some algorithms can vary a lot depending on the exact scaling used. We just saw how the triangulation of scattered data is sensitive to scaling, but it can cause problems in other algorithms as well. For example, machine learning algorithms typically depend on the Euclidean distance between points, so they are sensitive to the scaling of the data. Without proper scaling, some features will have too large a contribution compared to others, skewing the results.

But this is a broad topic best covered in another post!

Do you use scattered data interpolation regularly in your work? If so tell us about it in the comments!

I'd like to thank Cosmin Ionita, Damian Sheehy, and Konstantin Kovalev for their valuable insights while I was writing this post.

Get
the MATLAB code

Published with MATLAB® R2018b

#shelovesmatlab

Today's guest blogger is Penny Anderson, Senior Engineer Manager, MATLAB Products. And she is going to answer the big question today, WHY?... read more >>

]]>#shelovesmatlab

Today's guest blogger is Penny Anderson, Senior Engineer Manager, MATLAB Products. And she is going to answer the big question today, WHY?

- A Suggestion for WHY
- What Does The WHY Function Do?
- The Names in WHY
- Adding Female Names to WHY
- New WHY
- Paula Loves MATLAB
- More About Paula Burgi
- Can you tell us how you came to be in this field?
- How did you decide to become a researcher?
- Who are some of the women in your personal and professional life who have influenced you the most? What was their contribution to your growth?
- How do MATLAB and other MathWorks tools help you in your work?
- Why Do You Love MATLAB?

On July 12th, 2012 a suggestion came in through our MATLAB Central forum

*The "why" function on MATLAB assumes the gender of users.* *In fact, both men and women use MATLAB.* *For example, the proper noun section* *contains solely names generally reserved for men.* *It would be appreciated if this were remedied.*

This comment came from a MATLAB user Paula Burgi, then an undergraduate student of Geoscience at Smith College. When we asked her about it more recently, Paula recalled:

*It was actually during my first summer doing scientific research during* *college. I was learning how to use Matlab, and came across the “why”* *command while trying to teach myself some Matlab basics. At the time I* *was under the impression that coding languages were only very serious* *and* *practical tools, no humor allowed. So it was a fun surprise to see that* *Matlab had these built-in commands intended just to give the user a* *laugh. However, since I was attending Smith College (an all-women’s* *institution), it did not escape me that almost all the names that were* *implemented in the “why” command were male. It was a sad reminder that* *the rest of the world was not filled with female scientists and* *computer* *programmers, contrary to my experience at Smith. Having both male and* *female names seemed like a simple but effective measure to make science* *feel just a little bit more inclusive, so I decided that I would send* *an* *email to Matlab making this suggestion.*

Four days later Mathworks Employee #1 Loren Shure turned the comment into an Enhancement request and it landed on my desk. At the time, I was managing a couple of groups of software developers – all of them men – working on MATLAB. Because of my management responsibilities, I didn’t get to do much fun coding work anymore. So, when this small but intriguing project caught my attention, I jumped on it.

I took a look at the `why` function, which is really a simple demo that builds up arbitrarily long sentences leveraging a random number generator and a fixed set of nouns, adjectives, verbs, proper nouns and so on. Here are a couple fun examples:

rng(27); why

A good kid obeyed a very young and terrified mathematician.

rng(111); why

Loren suggested it.

Sure enough, the proper names used to construct the sentences included:

- Cleve for Cleve Moler, the original author of MATLAB and co-founder of MathWorks
- Jack for Jack Little, our President and CEO, founder of MathWorks
- Bill for Bill McKeeman, a compiler expert and the author of the why function
- Joe for Joe Hicklin, our chief scientist and all-around idea instigator
- Pete, for G.W. “Pete” Stewart, a numerical analyst and long-time colleague of Cleve Moler
- Loren for Loren Shure, MathWorks Employee #1, actually a woman, but with a more conventional male spelling to her name
- Damian for Damian Packer, one of the original authors of Handle Graphics in MATLAB
- And Barney for the purple dinosaur?

“Hmm”, I thought. “She’s right. There are no obvious female names. I can fix that!”

I added my own name – Penny. After 16 years of being in software development and management at the very core of the math capabilities in MATLAB, I was one of the most senior women in Engineering.

Even though I was the only woman in my group at that time, I was actually unusual for the software development industry in that I had already had three female managers in my years at MathWorks, including my original hiring manager Loren Shure. Since Loren’s name was already on the list, I added the other two names – Mary Ann and Nausheen – as well.

Finally, I added the name – Mara – of another long-time colleague who was also a manager of a software development group at MathWorks.

Soon I was ready to submit the change to our continuous integration system. After passing all tests, the new improved version of the why demo went out to MATLAB users world-wide with Release R2013a.

And now Paula and all MATLAB users may be amused by reasons like:

rng(5); why

Because Mara wanted it.

rng(13); why

Mary Ann wanted it that way.

rng(44); why

Nausheen told me to.

rng(75); why

Penny knew it was a good idea.

Over the years I thought of Paula’s request from time to time. Finally, in 2018 as our grassroots group of Women at MathWorks was spreading **#shelovesmatlab** and we were getting ready to attend the Grace Hopper Celebration of Women in Computing, I reached out to Paula by email, after tracking her down to her new role as a PhD student at Cornell. To my pleasant surprise she responded enthusiastically.

*What a nice email to get! I'm surprised and delighted to hear that my* *comments years ago are still being considered today. I tried it out for* *myself, and sure enough I got Mary Ann, Penny, and even Mara!*

*As a graduate student in geophysics, I still use MATLAB every day! It's* *definitely my preferred data analysis software.*

We took the opportunity to interview Paula about her work and her love of MATLAB.

Paula is currently a PhD student at Cornell University, studying geophysics and remote sensing. She received a B.A. at Smith College in Geosciences and Astronomy, then went on to work as a research assistant at the Earth Observatory of Singapore. There, she helped with active seismic imaging of the Main Frontal Thrust in Nepal, and conducted research on shallow megathrust fault geometry in Bangladesh. She now studies how remote sensing techniques can be improved to study earth deformation.

Installing GPS stations to track fault movement in Sagaing, Myanmar.

I’ve always been interested in earth science, especially earthquakes. I grew up in Oak Park, IL, but traveled to Switzerland often during my childhood because my father is Swiss. The stark contrast between the very flat Midwest and the dramatic topography of the Swiss Alps had me pondering about geological processes at a young age. When I learned that researching the earth was a job, I never looked back! I ended up studying tectonics, particularly earthquakes. I have done both geological fieldwork to look at effects of past earthquakes, as well as geophysical work using geodetic data to track current deformation of the earth’s crust. Now, I use satellite radar data (InSAR) to observe deformation due to earthquakes, volcanoes, and many other sources. I am currently assessing how vegetation can affect these measurements (figure below).

Example of satellite radar data (InSAR), showing the signal introduced by vegetation loss in Oregon (rectangular features).

I decided in high school that I wanted to pursue science. I loved my high school physics class, the teacher really wanted students to conceptualize what we were learning in class as to apply it to everyday life. I also took geology and astronomy (both of which ended up being my majors in college), both of which were very self-guided classes, much like real research! What finally cemented my desire to become a researcher was a summer research internship I did in college. I used GPS data to deduce information about the subduction zone offshore Chile. This internship was my first real experience with research (and Matlab!), and I absolutely loved the everyday challenges of taking a large-scale research question, breaking it down, and posing it in terms of a script.

As cliche as it may sound, my mother had a huge impact on my personal growth as a scientist. She is a doctor, graduating from medical school when only ~25% of her class was women. Her curiosity about the natural world definitely rubbed off on me, like when she would get out a microscope to enthusiastically look at the different kinds of mold that would grow on expired food items. She led by example when it came to pursuing things that you are passionate about, particularly when most people doubt you’ll be able to achieve it. I’ve had 2 excellent female academic advisors, one at the Earth Obervatory of Singapore and the other during my current PhD research. Both of these women have shown me that it is very much possible to have a balanced work-home life, and that it does not make you any “less” of a scientist.

My undergraduate thesis looked at GPS and geologic data before and after a large earthquake in Costa Rica. I used this data to model where the earthquake occured, and to gain insight into how our modeling techniques can bias our output. My main question was: How do different fault models effect the earthquake location and magnitude? Determining earthquake location and magnitude using surface displacement data is a classic inverse problem. I used Matlab’s inverse linear inverse function to determine the earthquake parameters. I also used Matlab’s mapping toolbox to visualize the data and models. Finally, I used Matlab’s parallel computing toolbox, which sped up calculations by a lot! I found that the different fault models resulted in up to 30% more energy released by an earthquake. The most realistic fault model resulted in a slip distribution that appeared to be controlled by small variations in the surface of the fault. I learned all I know to this point about computational geophysics from Matlab. I still use Matlab almost every day, and continue to grow as a scientist because of it!

Fault slip from the 2012 Costa Rica Earthquake, derived from GPS data using Matlab’s built-in matrix inversion functions.

Please share this post with the #shelovesmatlab hashtag. Know an amazing woman who loves our tools you’d like us to blog about? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

"All the noise, noise, noise, NOISE!"

-- The Grinch

Today's guest blogger is Tom Lane. Tom has been a MathWorks developer since 1999, working primarily on the Statistics and Machine Learning Toolbox. He'd like to share with you a couple of issues that MATLAB users repeatedly encounter.... read more >>

]]>"All the noise, noise, noise, NOISE!"

-- The Grinch

Today's guest blogger is Tom Lane. Tom has been a MathWorks developer since 1999, working primarily on the Statistics and Machine Learning Toolbox. He'd like to share with you a couple of issues that MATLAB users repeatedly encounter.

The topic for today is curve fitting. Let's look at a simple exponential function:

```
rng default
x = rand(10,1);
y = 10*exp(-5*x);
```

We can plot this, but many of the values are smooshed up against the `X` axis. The semilogy function can help with that, and also turn the relationship into a straight line.

subplot(1,2,1); plot(x,y,'x'); subplot(1,2,2); semilogy(x,y,'x'); % log(y) = -5*x + log(10)

Suppose we have the `X` and `Y` values, and we can see or guess the functional form, but we don't know the constant values 10 and 5. We can estimate them from the data. We can do this using either the original data shown at the left, or the `log(Y)` transformed data shown at the right.

```
p1 = fitnlm(x,y,'y ~ b1*exp(b2*x)',[1 1])
p2 = polyfit(x,log(y),1); p2(2) = exp(p2(2))
```

p1 = Nonlinear regression model: y ~ b1*exp(b2*x) Estimated Coefficients: Estimate SE tStat pValue ________ __ _____ ______ b1 10 0 Inf 0 b2 -5 0 -Inf 0 Number of observations: 10, Error degrees of freedom: 8 Root Mean Squared Error: 0 R-Squared: 1, Adjusted R-Squared 1 F-statistic vs. zero model: Inf, p-value = 0 p2 = -5.0000 10.0000

Both fits give the same coefficients.

Some time ago, a MATLAB user reported that he was fitting this curve to his own data, and getting different parameter estimates from the ones given by other software. They weren't dramatically different, but larger than could be attributed to rounding error. They were different enough to raise suspicion. If we add a little noise to `log(y)`, we can reproduce what the user saw.

y = exp(log(y) + randn(size(y))/10); p1 = fitnlm(x,y,'y ~ b1*exp(b2*x)',[1 1]) p2 = polyfit(x,log(y),1); p2(2) = exp(p2(2)) xx = linspace(0,1)'; subplot(1,1,1) plot(x,y,'x', xx,predict(p1,xx),'r-')

p1 = Nonlinear regression model: y ~ b1*exp(b2*x) Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ b1 10.015 0.34113 29.36 1.9624e-09 b2 -4.8687 0.23021 -21.149 2.625e-08 Number of observations: 10, Error degrees of freedom: 8 Root Mean Squared Error: 0.141 R-Squared: 0.997, Adjusted R-Squared 0.996 F-statistic vs. zero model: 1.91e+03, p-value = 1.91e-11 p2 = -4.8871 10.0008

Which estimates to believe? Well, it turns out that once you add noise, these models are no longer equivalent. Adding noise to the original data is one thing. Adding noise to the log data gives noise values that, back on the original scale, grow with the value of `y`. We can see the difference between the two more easily if we generate a larger set of data.

rng default X = rand(100,1); Y = 10*exp(-5*X); subplot(1,2,1); Y1 = Y + randn(size(Y))/5; plot(X,Y1,'x', xx,10*exp(-5*xx),'r-'); subplot(1,2,2); Y2 = exp(log(Y) + randn(size(Y))/10); plot(X,Y2,'x', xx,10*exp(-5*xx),'r-');

It's hard to see at the top of the plots, but near `y=0` we can see that the noise is larger on the left than on the right. On the left the noise is additive. On the right, the noise is multiplicative.

Which model is correct or appropriate? We'd have to understand the data to decide that. One clue is that if negative values are plausible when the curve approaches zero, then an additive model may be appropriate. If the noise is more plausibly described in terms like +/- 10%, then the multiplicative model may be appropriate.

Now, not all models are easily transformed this way. A multiplicative model may still be appropriate even when no such simple transformation exists. Fortunately, the `fitnlm` function from the Statistics and Machine Learning Toolbox has a feature that lets you specify the so-called "error model" directly.

p1 = fitnlm(x,y,'y ~ b1*exp(b2*x)',[1 1],'ErrorModel','proportional')

p1 = Nonlinear regression model: y ~ b1*exp(b2*x) Estimated Coefficients: Estimate SE tStat pValue ________ ______ _______ __________ b1 9.9973 0.8188 12.21 1.8787e-06 b2 -4.8771 0.1162 -41.973 1.144e-10 Number of observations: 10, Error degrees of freedom: 8 Root Mean Squared Error: 0.121 R-Squared: 0.974, Adjusted R-Squared 0.97 F-statistic vs. zero model: 344, p-value = 1.75e-08

This model isn't exactly the same as the ones before. It's modeling additive noise, but with a scale factor that increases with the size of the fitted function. But rest assured, it takes into account the different noise magnitudes, so it may be useful for data having that characteristic.

Now that I have your attention, though, I'd like to address another topic related to curve fitting. This time let's consider the Weibull curve.

rng default x = wblrnd(5,2,100,1); subplot(1,1,1) histogram(x, 'BinEdges',0:14, 'Normalization','pdf')

The Weibull density has this form:

X = linspace(0,20); A = 5; B = 2; Y = (B/A) * (X/A).^(B-1) .* exp(-(X/A).^B); hold on plot(X,Y) hold off

Suppose we don't know the parameters `A` and `B`. Once again, there are two ways of estimating them. First, we could get the bin centers and bin heights, and use curve fitting to estimate the parameters.

heights = histcounts(x, 'BinEdges',0:14, 'Normalization','pdf'); centers = (0.5:1:13.5)'; fitnlm(centers,heights,@(params,x)wblpdf(x,params(1),params(2)),[2 2])

ans = Nonlinear regression model: y ~ wblpdf(x,params1,params2) Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ params1 4.6535 0.18633 24.975 1.0287e-11 params2 1.9091 0.10716 17.815 5.3598e-10 Number of observations: 14, Error degrees of freedom: 12 Root Mean Squared Error: 0.0182 R-Squared: 0.937, Adjusted R-Squared 0.932 F-statistic vs. zero model: 197, p-value = 6.62e-10

The alternative is not to treat this like a curve fitting problem, but to treat it like a distribution fitting problem instead. After all, there is no need for us to artificially bin the data before fitting. Let's just fit the data as we have it.

```
fitdist(x,'weibull')
```

ans = WeibullDistribution Weibull distribution A = 4.76817 [4.29128, 5.29806] B = 1.96219 [1.68206, 2.28898]

It's much simpler to call the distribution fitting function than to set this up as a curve fitting function. But in case that doesn't convince you, I would like to introduce the concept of statistical efficiency. Notice that the distribution fitting parameters are closer in this case to the known values 5 and 2. Is that just a coincidence? A method is statistically more efficient than another if it can get the same accuracy using less data. Let's have a contest. We will fit both of these models 1000 times, collect the estimates of `A`, and see which one is more variable.

AA = zeros(1000,2); for j=1:1000 x = wblrnd(5,2,100,1); heights = histcounts(x, 'BinEdges',0:14, 'Normalization','pdf'); f = fitnlm(centers,heights,@(params,x)wblpdf(x,params(1),params(2)),[2 2]); p = fitdist(x,'weibull'); AA(j,1) = f.Coefficients.Estimate(1); AA(j,2) = p.A; end mean_AA = mean(AA) std_AA = std(AA)

mean_AA = 5.0145 4.9973 std_AA = 0.3028 0.2566

We have a winner! Both values have a mean that is close to the known values of 5, but the values produced by distribution fitting are less variable.

When you approach a curve fitting problem, I recommend you consider a few things:

- Is this a curve fitting problem at all? If it involves the relationship between two variables, it's curve fitting. If it involves the distribution of a single variable, try to approach it as a distribution fitting problem.
- Is the function one that can transform to a simpler form, such as a linear relationship? If so, consider doing that to make the fitting process more efficient.
- However, consider the noise. Does the noise seem additive or multiplicative? Does the noise vary with the magnitude of
`Y`? Are negative`Y`values plausible? These questions can help you decide whether to fit on the original or transformed scale, and whether to specify an error model.

What criteria do you use to choose a model for fitting your data? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

*Today's guest post comes from Sean de Wolski, one of Loren's fellow Application Engineers. You might recognize him from MATLAB answers and the pick of the week blog!*... read more >>

*Today's guest post comes from Sean de Wolski, one of Loren's fellow Application Engineers. You might recognize him from MATLAB answers and the pick of the week blog!*

One of my colleagues approached me last month and asked for help debugging an error with `splitapply`. Splitapply takes group information and applies a function to each group in the data (sort of like a pivot table). Note, that everything here also applies to the lower level but more powerful function `accumarray`.

The documentation provides numerous simple examples for what `splitapply` does so check them out if you're not familiar with it.

Here is an anonymized version of the data set my colleague had. The first three variables are categorical identifiers and the fourth are some data associated with them.

```
load('Bears.mat')
disp(Bears)
```

Candy Animal Sports Bytes ________ __________ ___________ ______________________________________ Cinnamon Black Brown -0.8095 0.40391 2 Cinnamon Polar Brown -2.9443 0.096455 3 Cinnamon Polar Brown 1.4384 0.13197 9 Cinnamon Sloth Chicago 0.32519 0.94205 1 Cinnamon Sloth Chicago -0.75493 0.95613 5 Cinnamon Sloth Chicago 1.3703 0.57521 2 Cinnamon Sloth Chicago -1.7115 0.05978 10 Cinnamon Sloth Chicago -0.10224 0.23478 8 Cinnamon Sun Baylor -0.24145 0.35316 6 Cinnamon Sun Baylor 0.31921 0.82119 5 Cinnamon Polar Brown 0.31286 0.015403 1 Cinnamon Polar Brown -0.86488 0.043024 7 Cinnamon Black Brown -0.030051 0.16899 1 Cinnamon Black Brown -0.16488 0.64912 1 Cinnamon Polar Brown 0.62771 0.73172 6 Cinnamon Polar Brown 1.0933 0.64775 1 Cinnamon Spectacled Coast Guard 1.1093 0.45092 9 Cinnamon Sloth Chicago -0.86365 0.54701 9 Gummy Sun Baylor 0.077359 0.29632 8 Gummy Sun Baylor -1.2141 0.74469 2 Gummy Sun Baylor -1.1135 0.18896 7 Gummy Sun Baylor -0.0068493 0.68678 6 Gummy Sun Baylor 1.5326 0.18351 10 Gummy Sloth Chicago -0.76967 0.36848 7 Gummy Sloth Chicago 0.37138 0.62562 9 Gummy Spectacled Coast Guard -0.22558 0.78023 5 Gummy Spectacled Coast Guard 1.1174 0.081126 5 Gummy Spectacled Coast Guard -1.0891 0.92939 9 Gummy Sloth Chicago 0.032557 0.77571 1 Gummy Sloth Chicago 0.55253 0.48679 2 Gummy Spectacled Coast Guard 1.1006 0.43586 2 Gummy Spectacled Coast Guard 1.5442 0.44678 4 Gummy Spectacled Coast Guard 0.085931 0.30635 9 Gummy Spectacled Coast Guard -1.4916 0.50851 9 Gummy Spectacled Coast Guard -0.7423 0.51077 1 Gummy Spectacled Coast Guard -1.0616 0.81763 4 Gummy Spectacled Coast Guard 2.3505 0.79483 6 Gummy Sloth Chicago -0.6156 0.64432 5 Gummy Spectacled Coast Guard 0.74808 0.37861 7 Gummy Sloth Chicago -0.19242 0.81158 7 Gummy Spectacled Coast Guard 0.88861 0.53283 3 Gummy Spectacled Coast Guard -0.76485 0.35073 5 Gummy Spectacled Coast Guard -1.4023 0.939 1 Gummy Black Brown -1.4224 0.87594 10 Gummy Sloth Chicago 0.48819 0.55016 2 Gummy Sloth Chicago -0.17738 0.62248 2 Gummy Sloth Chicago -0.19605 0.58704 4 Gummy Sloth Chicago 1.4193 0.20774 2 Gummy Sloth Chicago 0.29158 0.30125 5 Gummy Sloth Chicago 0.19781 0.47092 4 Gummy Polar Brown 1.5877 0.23049 10 Gummy Polar Brown -0.80447 0.84431 10 Gummy Sloth Chicago 0.69662 0.19476 1 Gummy Black Brown 0.83509 0.22592 8 Gummy Black Brown -0.24372 0.17071 3 Gummy Sloth Chicago 0.21567 0.22766 5 Gummy Black Brown -1.1658 0.4357 6 Gummy Sloth Chicago -1.148 0.3111 10 Gummy Sloth Chicago 0.10487 0.92338 5 Gummy Sloth Chicago 0.72225 0.43021 10 Gummy Sloth Chicago 2.5855 0.18482 4 Gummy Sloth Chicago -0.66689 0.90488 8 Gummy Sloth Chicago 0.18733 0.97975 7 Gummy Sloth Chicago -0.082494 0.43887 6 Gummy Sloth Chicago -1.933 0.11112 7 Gummy Sloth Chicago -0.43897 0.25806 7 Gummy Sloth Chicago -1.7947 0.40872 2 Gummy Sloth Chicago 0.84038 0.5949 2 Gummy Sloth Chicago -0.88803 0.26221 10 Gummy Sloth Chicago 0.10009 0.60284 2 Gummy Sloth Chicago -0.54453 0.71122 1 Gummy Sloth Chicago 0.30352 0.22175 6 Gummy Sloth Chicago -0.60033 0.11742 9 Gummy Sloth Chicago 0.48997 0.29668 7 Gummy Sloth Chicago 0.73936 0.31878 2 Gummy Sloth Chicago 1.7119 0.42417 4 Gummy Sloth Chicago -0.19412 0.50786 5 Gummy Sloth Chicago -2.1384 0.085516 10 Gummy Sloth Chicago -0.83959 0.26248 2 Gummy Sloth Chicago 1.3546 0.80101 9 Gummy Sloth Chicago -1.0722 0.02922 7 Gummy Sloth Chicago 0.96095 0.92885 4 Gummy Sloth Chicago 0.12405 0.73033 2 Gummy Sloth Chicago 1.4367 0.48861 5 Gummy Sloth Chicago -1.9609 0.57853 5 Gummy Black Brown -0.1977 0.23728 2 Gummy Sloth Chicago -1.2078 0.45885 6 Gummy Sloth Chicago 2.908 0.96309 3 Gummy Sloth Chicago 0.82522 0.54681 4 Gummy Sloth Chicago 1.379 0.52114 6 Gummy Sloth Chicago -1.0582 0.23159 3 Gummy Sloth Chicago -0.46862 0.4889 3 Gummy Black Brown -0.27247 0.62406 7 Gummy Polar Brown 1.0984 0.67914 3 Gummy Sloth Chicago -0.27787 0.39552 9 Gummy Sloth Chicago 0.70154 0.36744 10 Gummy Sloth Chicago -2.0518 0.98798 8 Gummy Black Brown -0.35385 0.037739 4 Gummy Sloth Chicago -0.82359 0.88517 6 Gummy Sloth Chicago -1.5771 0.91329 2 Gummy Sloth Chicago 0.50797 0.79618 10 Gummy Sloth Chicago 0.28198 0.098712 9 Gummy Sloth Chicago 0.03348 0.26187 9 Gummy Sloth Chicago -1.3337 0.33536 3 Gummy Sloth Chicago 1.1275 0.67973 6 Gummy Sloth Chicago 0.35018 0.13655 1 Gummy Sloth Chicago -0.29907 0.72123 5 Gummy Sloth Chicago 0.02289 0.10676 4 Gummy Sloth Chicago -0.262 0.65376 2 Gummy Sloth Chicago -1.7502 0.49417 2 Gummy Sloth Chicago -0.28565 0.77905 5 Gummy Black Brown -0.83137 0.71504 1 Gummy Sloth Chicago -0.97921 0.90372 6 Gummy Sloth Chicago -1.1564 0.89092 5 Gummy Sloth Chicago -0.53356 0.33416 7 Gummy Sloth Chicago -2.0026 0.69875 7 Gummy Sloth Chicago 0.96423 0.19781 7 Gummy Sloth Chicago 0.52006 0.030541 1 Gummy Sloth Chicago -0.020028 0.74407 1 Gummy Sloth Chicago -0.034771 0.50002 4 Gummy Sloth Chicago -0.79816 0.47992 6 Gummy Sloth Chicago 1.0187 0.90472 7 Gummy Sloth Chicago -0.13322 0.60987 5 Gummy Sloth Chicago -0.71453 0.61767 9 Gummy Sloth Chicago 1.3514 0.85944 8 Gummy Sloth Chicago -0.22477 0.80549 10 Cinnamon Polar Brown -0.58903 0.57672 6

The operation he was trying to calculate was the nan-omitted mean of Bytes based on two of the categories.

```
[animalcandy, animal, candy] = findgroups(Bears.Animal,Bears.Candy);
meanbyte = splitapply(@(x)mean(x, 'omitnan'), Bears.Bytes, animalcandy);
```

Error using vertcat Dimensions of arrays being concatenated are not consistent. Error in splitapply>localapply (line 257) finalOut{curVar} = vertcat(funOut{:,curVar}); Error in splitapply (line 132) varargout = localapply(fun,splitData,gdim,nargout); Error in mainDebuggingGroupedOps (line 32) meanbyte = splitapply(@(x)mean(x, 'omitnan'), Bears.Bytes, animalcandy);

Hmm, I've seen that error before, but what does it have to do with this? How do we debug this? One could put a break point at the anonymous function `@(x)mean(x, 'omitnan')` and then step with the debugger until the error occurs.

This would likely work for a small number of groups, but as the number of groups gets larger, it would be lots of steps, one for each function evaluation. You'd also likely have to do it twice, a second time after the error occurs. Setting the debugger to stop on errors may work as well for `splitapply` but not for `accumarray` which is builtin and even with `splitapply` may not stop you in a useful spot.

A trick I like to use is to just replace the function handle with {}. This takes whatever is provided and packs it into a cell so you can see exactly what is being passed into each function evaluation for each group.

bytecell = splitapply(@(x){x}, Bears.Bytes, animalcandy); disp(bytecell)

[14×3 double] [ 1×3 double] [ 5×3 double] [ 2×3 double] [78×3 double] [ 6×3 double] [ 8×3 double] [ 3×3 double] [ 3×3 double] [ 7×3 double]

From here we can see that the second cell has only one row. Since mean takes the mean of the first non-singleton dimension, it's reducing this to a scalar by taking the mean of the row where the rest of the elements are coming rows from taking the mean of columns. A scalar can't concatenate with a matrix so we get the error.

The fix for this is simple, pass in the dimension to mean to force it to always take column mean. Then rebuild the table with the labels.

```
meanbyte = splitapply(@(x)mean(x, 1, 'omitnan'), Bears.Bytes, animalcandy);
disp(table(animal, candy, meanbyte))
```

animal candy meanbyte __________ ________ ___________________________________ Spectacled Gummy 0.075572 0.55805 5 Spectacled Cinnamon 1.1093 0.45092 9 Sun Gummy -0.1449 0.42005 6.6 Sun Cinnamon 0.03888 0.58718 5.5 Sloth Gummy -0.081113 0.51523 5.2949 Sloth Cinnamon -0.28948 0.55249 5.8333 Black Gummy -0.45653 0.4153 5.125 Black Cinnamon -0.33481 0.40734 1.3333 Polar Gummy 0.62722 0.58464 7.6667 Polar Cinnamon -0.13228 0.32043 4.7143

In this case, the fix was fairly discernible from a quick inspection. If it was not, we could loop over the cell and evaluate the function on each element to see where the error occurs. If the error occurs on a specific cells' data, the loop will stop there and we can investigate. If it's on the concatenate step, that'll be obvious at the end.

fun = @(x)mean(x, 'omitnan'); meanbytecell = cell(size(bytecell)); for ii = 1:numel(bytecell) meanbytecell{ii} = fun(bytecell{ii}); end disp(meanbytecell)

[1×3 double] [ 3.5201] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double]

And now it is obvious why these won't concatenate.

I find looping over the cell in this manner to be much easier than looping over the original data set and trying to identify which elements are in which groups and indexing correctly.

I'm also a big fan of using `splitapply/accumarray` with cell output for making objects or plots based on grouped data where the object can't be returned directly. Continuing this example we'll use a histogram for each group of original data, wrapping `histogram` in {}.

figure axes('ColorOrder', parula(numel(animal))) hold on h = splitapply(@(x){histogram(x)}, Bears.Bytes, animalcandy); legend([h{:}], compose("%s/%s", animal, candy));

On an aside, development has been working to make grouped operations easier over the last few releases with a collection of new functions:

`varfun`- Apply function to table input variables based on grouping ones`groupsummary`- Group summary statistics`grouptransform`- Transform based on groups`retime`- Aggregate based on time

Doing this same operation with `varfun` would look like this:

meanbytetable = varfun(@(x)mean(x, 1, 'omitnan'), Bears, ... 'GroupingVariables', {'Animal', 'Candy'}, ... 'InputVariables', {'Bytes'}); disp(meanbytetable)

Animal Candy GroupCount Fun_Bytes __________ ________ __________ ___________________________________ Spectacled Gummy 14 0.075572 0.55805 5 Spectacled Cinnamon 1 1.1093 0.45092 9 Sun Gummy 5 -0.1449 0.42005 6.6 Sun Cinnamon 2 0.03888 0.58718 5.5 Sloth Gummy 78 -0.081113 0.51523 5.2949 Sloth Cinnamon 6 -0.28948 0.55249 5.8333 Black Gummy 8 -0.45653 0.4153 5.125 Black Cinnamon 3 -0.33481 0.40734 1.3333 Polar Gummy 3 0.62722 0.58464 7.6667 Polar Cinnamon 7 -0.13228 0.32043 4.7143

Do you work with grouped data functions? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

*Today we have two guest bloggers, Lisa Kempler and Pradeep Ramamoorthy, who work at MathWorks in Natick, Massachusetts supporting and developing online tools for researchers. Their post talks about a relatively new code-sharing platform, Code Ocean.*... read more >>

*Today we have two guest bloggers, Lisa Kempler and Pradeep Ramamoorthy, who work at MathWorks in Natick, Massachusetts supporting and developing online tools for researchers. Their post talks about a relatively new code-sharing platform, Code Ocean.*

Code Ocean is a cloud-based platform aimed at furthering computational reproducibility and open research. The site, accessible via a web browser, enables researchers to share code and data associated with their published research. Visitors to the site can view and run the code, thereby verifying that the code produces the results described in the original research paper. The platform supports a variety of programming languages, including MATLAB.

In an effort to provide easier access to code and data, Code Ocean recently announced the ability to export compute capsules:

https://codeocean.com/blog/post/new-compute-capsules-now-exportable-from-code-ocean

Users can now download these code capsules, or containers -- encapsulations of code, data and computational environment – in order to reuse and build on the published research and code, including the computational environment. MATLAB users who download compute capsules containing MATLAB code can run the code and view the associated results on their local computers.

Reproducible Research (RR) has been a big push for a long time by publishers and funding organizations that want to ensure that research is sufficiently vetted. There are two major benefits of good, verifiable research: 1) publications provide high-value information, and 2) those researchers doing follow-on research can confidently build on the work of their peers.

Researchers’ desire to leverage historical research has led to a movement around open science, or, more broadly, open research. The primary goal of the openness is the same as the underlying drivers of RR; if you make sure the results can be reproduced, then it’s reasonable to build on those results. However, “open” takes it a step further. Openness pushes RR beyond proving the validity of the research to Reuse – a requirement to make the research approaches and resulting artifacts broadly accessible.

Using Code Ocean, published authors can reproduce and verify their research results. However, Code Ocean’s main value to researchers is this ability to reuse the work of their published peers.

Although the ability to download the code, associated data, and related graphical and numeric output gives researchers a huge head-start, buy-in by researchers who submit for publication is still limited. In a recent article (An empirical analysis of journal policy effectiveness for computational reproducibility), Stodden et al demonstrate the lack of engagement in RR by most researchers. The study deemed 56 of 204 published papers computationally reproducible, even after multiple attempts to get additional information from authors of the remaining 148. The study’s finding, 25% compliance for published papers that are inherently computational, tells us that the norm is still 1) non-reproducibility and 2) not-so-transparent paths to Reuse for most published computational research.

Having MATLAB language support on Code Ocean makes it easier for researchers to share their work. Using these tested outputs, MATLAB users can create new research, and transfer their learnings to new innovations and products in science and industry. Code Ocean’s easy-upload and sharing platform holds the possibility of increasing RR compliance (and, in turn, Reuse-ability), as publishers, authors, and follow-on researchers see the value in sharing.

Compute capsules are the foundational units on Code Ocean. They encapsulate the elements required to reproduce and reuse research – code, data, documentation and a specification of the computational environment. Researchers create a compute capsule associated with their research, and visitors open these capsules to examine and run the code.

Let’s say you’re a researcher working in the field of neuroscience. You hear about ongoing research and development of models for simulating brain fibers.

Once you log in to the Code Ocean website (setting up an account is quick, and free), you can explore the curated gallery of published compute capsules, or search for relevant terms. If you search for ‘fiber’ or ‘brain’, you see relevant results, as shown below:

The 1st search result – Fiber Source Separation – looks promising and might be what you were looking for. Clicking on the link will take you to the Code Ocean IDE, which allows you to interact with the code, look at supporting documents and visualizations, and run the code on Code Ocean’s cloud platform.

To export this capsule, just select the ‘Export’ option from the ‘Capsule’ menu.

Selecting this option should initialize the download process. Once downloaded, you can then extract the downloaded package. REPRODUCING.md, below, is your read-me file, with the steps needed to reproduce the results of the capsule. The next step, unpacking the capsule, requires you to install Docker and some experience using Docker.

The ability to view and reuse code associated with published research is a big plus. Having the bi-directional linkage between the code and the published article, from papers on publisher sites to the code and from Code Ocean capsules back to the papers, makes it easy to find and use the different related components. If you have a published paper with associated MATLAB code, consider uploading it to Code Ocean. Or visit Code Ocean to view and download research-related MATLAB code.

Have you used Code Ocean (or similar platforms) for your research and code-sharing needs? Let us know here.

Get
the MATLAB code

Published with MATLAB® R2018b

Do you ever need to explain something to others unfamiliar with your work what it's about? One situation I frequently face is explaining machine learning to audiences who want to learn more about it but are not yet particularly conversant in it. This is regardless of the audience, be... read more >>

]]>Do you ever need to explain something to others unfamiliar with your work what it's about? One situation I frequently face is explaining machine learning to audiences who want to learn more about it but are not yet particularly conversant in it. This is regardless of the audience, be they students, professors, researchers, or folks working in government and industry as scientists and engineers. So, what do I do?

A few years ago I wanted to find a way to explain machine learning in a way that would make it understandable and fun. I came up with an explanation that illustrates what's going on in machine learning without any of the mathematical details.

Most people I know learned regression somewhere along the way, often in a stats class, or perhaps they were exposed to clustering and classification using the famous Fisher Iris dataset (here are our classification and clustering examples). I tried this approach a few times and, as much as I like flowers and sepal length and petal width, I thought I could do better to make machine learning concepts easier to grasp.

I came up with an idea of using animals. Who doesn’t like dogs, cats, and birds, or at least some of these? Well it worked out well and over the last couple years I’ve showed this over a hundred times and always get positive feedback from the audience. A few months ago some colleagues asked me what I was showing customers these days. When I described my animal story they were pretty excited and thought I should record a video to help others understand this machine learning area that everyone seems to want to know more about.

Well, to cut to the chase

(spoiler alert: a cheetah is the regression winner), you can now watch the video.

**Does This Explanation Help?**

Did you find this explanation useful, either for yourself, or to pass along to others? What other concepts have you needed to "illustrate"? Let me know here.

Get
the MATLAB code

Published with MATLAB® R2018b