We all use Fourier analysis every day without even knowing it. Cell phones, disc drives, DVDs, and JPEGs all involve fast finite Fourier transforms. This post, which describes touch-tone telephone dialing, is the first of three posts about the computation and interpretation of FFTs. The posts are adapted from chapter 8 of my book, *Numerical Computing with MATLAB* .... read more >>

We all use Fourier analysis every day without even knowing it. Cell phones, disc drives, DVDs, and JPEGs all involve fast finite Fourier transforms. This post, which describes touch-tone telephone dialing, is the first of three posts about the computation and interpretation of FFTs. The posts are adapted from chapter 8 of my book, *Numerical Computing with MATLAB* .

Touch-tone telephone dialing is an example of everyday use of Fourier analysis. The basis for touch-tone dialing is the Dual Tone Multi-Frequency (DTMF) system. *Synthesis* is the generation of analog tones to represent digits in phone numbers. *Analysis* is the decoding of these tones to retrieve the digits. The program `touchtone`, which is available here, or which is included with my NCM App, demonstrates both these aspects of DTMF.

The telephone dialing pad acts as a 4-by-3 matrix. Associated with each row and column is a frequency. These basic frequencies are

fr = [697 770 852 941]; fc = [1209 1336 1477];

These frequencies were chosen so that none is an integer multiple of any other. In fact the ratios are 21/19. This avoids interfering overtones.

If `s` is a character that labels one of the buttons on the keypad, the corresponding row index `k` and column index `j` can be found with

switch s case '*', k = 4; j = 1; case '0', k = 4; j = 2; case '#', k = 4; j = 3; otherwise, d = s-'0'; j = mod(d-1,3)+1; k = (d-j)/3+1; end

An important parameter in digital sound is the sampling rate.

Fs = 32768

A vector of points in one-fourth of a second at this sampling rate is

t = 0:1/Fs:0.25

The tone generated by the button in position `(k,j)` is obtained by superimposing the two fundamental tones with frequencies `fr(k)` and `fc(j)`.

y1 = sin(2*pi*fr(k)*t); y2 = sin(2*pi*fc(j)*t); y = (y1 + y2)/2;

If your computer is equipped with a sound card, the statement

sound(y,Fs)

plays the tone.

For example, this is the display produced by `touchtone` for the "1" button. The top subplot depicts the two underlying frequencies and the bottom subplot shows a portion of the signal obtained by averaging the sine waves with those frequencies.

How is it possible to determine a phone number by listening to the signal generated? This involves the FFT, the Finite Fourier Transform, or, more likely, some specialized version of the FFT adapted to this task. I will discuss the FFT in my next post.

The data file `touchtone.mat` contains a recording of a telephone being dialed. The file is available from

http://www.mathworks.com/moler/ncm/touchtone.mat

Go to the `Home` tab above the MATLAB command window, click on the `Open` tab and insert this URL into the `File name` slot. This will load a structure `y` into your MATLAB workspace. Then the statement

y

will produce

y = sig: [1x74800 int8] fs: 8192

This shows that `y` has two fields, an integer vector `y.sig`, of length 74800, containing the signal, and a scalar `y.fs`, with the value 8192, which is the sample rate.

max(abs(y.sig))

reveals that the elements of the signal are bounded in absolute value by 127. So the statements

Fs = y.fs; y = double(y.sig)/128;

save the sample rate, rescale the vector and convert it to double precision. The statements

n = length(y); t = (0:n-1)/Fs

reproduce the sample times of the recording. The last component of `t` is `9.1307`, indicating that the recording lasts a little over 9 seconds. The next figure is a plot of the entire signal.

This recording is noisy. You can even see small spikes on the graph at the times the buttons were clicked. It is easy to see that 11 digits were dialed, but, on this scale, it is impossible to determine the specific digits.

Here is the magnitude of the FFT of the signal, which is the key to determining the individual digits.

p = abs(fft(y)); f = (0:n-1)*(Fs/n); plot(f,p); axis([500 1700 0 600])

The *x* -axis corresponds to frequency. The `axis` settings limit the display to the range of the DTMF frequencies. There are seven peaks, corresponding to the seven basic frequencies. This overall FFT shows that all seven frequencies are present someplace in the signal, but it does not help determine the individual digits.

The `touchtone` program also lets you break the signal into 11 equal segments and analyze each segment separately. The next figure is the display from the first segment.

For this segment, there are only two peaks, indicating that only two of the basic frequencies are present in this portion of the signal. These two frequencies come from the "1" button. You can also see that the waveform of a short portion of the first segment is similar to the waveform that our synthesizer produces for the "1" button. So we can conclude that the number being dialed in `touchtone` starts with a 1. We leave it as an exercise to continue the analysis with `touchtone` and identify the complete phone number.

`touchtone` has a hidden "Easter egg" that, as far as I know, has never been uncovered. If anyone finds it, submit a comment and let us know.

Cleve Moler, *Numerical Computing with MATLAB*, Electronic Edition, MathWorks, <http://www.mathworks.com/moler/index_ncm.html>,

Print Edition, SIAM Revised Reprint, SIAM, 2008, 334 pp., <http://bookstore.siam.org/ot87> .

Get
the MATLAB code

Published with MATLAB® R2014a

For several years we thought Hadamard matrices showed maximum element growth for Gaussian elimination with complete pivoting. We were wrong.... read more >>

]]>For several years we thought Hadamard matrices showed maximum element growth for Gaussian elimination with complete pivoting. We were wrong.

I want to continue the discussion from my previous blog post of element growth during Gaussian elimination. Suppose we are solving a system of linear equations involving a matrix $A$ of order $n$. Let $a_{i,j}^{(k)}$ denote the elements in the matrix after the $k$ th step of the elimination. Recall that the *growth factor* for the pivoting process we are using is the quantity

$$ \rho_n = { \max_{i,j,k} |a_{i,j}^{(k)}| \over \max_{i,j} |a_{i,j}| } $$

Complete pivoting is intended to control the growth factor by using both row and column interchanges at each step in the reduction to bring the largest element in the entire unreduced matrix into the pivot position. In his 1962 analysis of roundoff error in Gaussian elimination, J. H. Wilkinson proved that the growth factor for complete pivoting satisfies

$$ \rho_n \le g(n) $$

where the *growth function* is

$$ g(n) = (n \ 2 \ 3^{1/2} 4^{1/3} \cdots n^{1/(n-1)})^{1/2} \approx 1.8 \ n^{1/2} n^{1/4 \log{n}} $$

Wilkinson's analysis makes it clear that the growth can never actually be anywhere near this large.

We saw that the growth factor for partial pivoting is $2^{n-1}$. For complete pivoting, it's much, much less. To get a feeling for now much, express $g(n)$ as a MATLAB one-liner

g = @(n) sqrt(n*prod((2:n).^(1./(1:(n-1)))))

g = @(n)sqrt(n*prod((2:n).^(1./(1:(n-1)))))

Check out $n = 100$

g(100)

ans = 3.5703e+03

So here $g(n) \approx 35n$.

Jacques Hadamard was a French mathematician who lived from 1865 until 1963. He made contributions in many fields, ranging from number theory to partial differential equations. The matrices named after him have entries +1 and -1 and mutually orthogonal rows and columns. The $n$ -dimensional parallelotope spanned by the rows of a Hadamard matrix has the largest possible volume among all parallelotopes spanned by matrices with entries bounded by one. We are interested in Hadamard matrices in the MATLAB world because they are the basis for Hadamard transforms, which are closely related to Fourier transforms. They are also applicable to error correcting codes and in statistics to estimation of variance.

So MATLAB already has a `hadamard` function. Here is the 8-by-8 Hadamard matrix.

H = hadamard(8)

H = 1 1 1 1 1 1 1 1 1 -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 1 1 1 1 1 -1 -1 -1 -1 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1 -1 -1 1 1 1 -1 -1 1 -1 1 1 -1

Check that the rows are perpendicular to each other.

H'*H

ans = 8 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 8

The orthogonality of the rows leads to element growth during Gaussian elimination. If we call for the LU factorization of `H`, no pivoting actually takes places, but the same result would be produced by complete pivoting that settles ties in favor of the incumbent.

[L,U] = lu(H)

L = 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 U = 1 1 1 1 1 1 1 1 0 -2 0 -2 0 -2 0 -2 0 0 -2 -2 0 0 -2 -2 0 0 0 4 0 0 0 4 0 0 0 0 -2 -2 -2 -2 0 0 0 0 0 4 0 4 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 -8

This, then, is one matrix at $n = 8$ where

$$ \rho_n = n $$

For almost 30 years everybody believed that Hadamard matrices were the extreme cases and that the growth factor $\rho_n$ for complete pivoting with real matrices was equal to $n$.

At various times in the 1960s and '70s I personally did numerical experiments looking for matrices with large pivot growth. The computers I had at the time limited the order of my matrices to a few dozen. I never found any cases of $\rho_n > n$.

One of the people who worked on the problem was Leonard Tornheim at Chevron Research in California. He also did a lot of numerical experiments. He found a complex 3-by-3 matrix with $\rho_3 > 3$. He also proved that for real matrices, $\rho_3 = 2 \ 1/4$.

In 1968, Colin Cryer made an official conjecture that $\rho_n$ was equal to $n$ for real matrices.

A 1988 paper by Jane Day and Brian Peterson provides a good survey of what was known up to that time about element growth with complete pivoting.

Then, in 1991, Nick Gould, from Oxford, surprised us all by announcing the discovery of real matrices for which the complete pivoting $\rho_n$ is greater than $n$. For an $n$ -by- $n$ matrix, with $n$ between 13 and 16, Nick set up a large, sparse, nonlinear optimization problem involving roughly $n^3/3$ variables, and tackled it with the LANCELOT package that he and his colleagues Andy Conn and Philippe Toint had developed. Most of his paper is about a 13-by-13 matrix with

$$ \rho_{13} = 13.0205 $$

This just barely exceeds the $\rho_n = n$ conjecture. But he also found some slightly larger matrice that do a better job.

$$ \rho_{14} = 14.5949 $$

$$ \rho_{15} = 16.1078 $$

$$ \rho_{16} = 18.0596 $$

The 16-by-16 is particularly dramatic because it is not a Hadamard matrix.

So, as far as I know, this is where the matter still stands today. Nobody is interested in running larger experiments. I have no idea what $\rho_{100}$ or $\rho_{1000}$ might be. Of course, we don't really need to know because we don't use complete pivoting in practice.

This reminds me of a story that I like to tell. I'm not sure it's actually true.

Peter Businger was a grad student at Stanford in Math and Computer Science at the same time I was in the early '60s. He worked with Gene Golub to write the first published Algol procedure for the QR matrix factorization using Householder reflections. Peter left Stanford to join Joe Traub's group at Bell Labs that was producing what they called the Port library of mathematical software. They were importing programs from everybody else, running the routines on Bell Labs' machines, thoroughly testing the software, deciding which codes were best, and producing a consolidated library.

Peter was in charge of matrix computation. He checked out all the imported linear equation solvers, including mine. He decided none of them was satisfactory, because of this pivoting growth question. So he wrote a program of his own that did partial pivoting, and monitor the growth. If the growth was too large, the program switched to complete pivoting.

At the time, Bell Labs was trying to patent software, so Peter named his new subroutine "P^8". This stood for "Peter's Pseudo Perfect Perfect Partial Pivot Picker, Patent Pending". In my opinion, that is one of the great all time subroutine names.

The experience inspired Peter to go to law school at night and change jobs. He switched to the Bell Labs patent department.

I've lost contact with Peter, so I can't ask him if he really did name the routine P^8. If he didn't, he should have.

Les Foster, who I mentioned in the last post for finding examples of exponential growth with partial pivoting, promotes what he calls "Rook Pivoting". Search down the column for the largest element, as in partial pivoting, but then also search along that row for the last element. This is intermediate between partial and complete pivoting. Les is able to prove some results about pivot growth. But the algorithm does not generalize well to a block form.

I was present at a milestone in the history of Hadamard matrices. The orthogonality conditions readily imply that the order $n$ of a Hadamard matrix must be 1, 2, or a multiple of 4. But the question remains, does a Hadamard matrix of order $n = 4k$ exist for every $k$? This is an open question in number theory.

It is fairly easy to create Hadamard matrices of some sizes. A nifty way to generate a Hadamard matrix of order a power of two is the recursion.

H = 1; for n = 2, 4, 8, ... H = [H H H -H]; end

If `A` and `B` are Hadamard matrices, then so is

kron(A,B)

By 1961, with these and other similar constructions, it turned out that it was known how to construct Hadamard matrices for all orders $n \le 100$ that are multiples of four except $n = 92$. In 1961 I had a summer job at JPL, Caltech's Jet Propulsion Lab. My office was in a temporary trailer and my trailer mate was Len Baumert. Len proudly showed me a color graphic that he had just made and that he was proposing for the cover of *Scientific American*. It was a 92-by-92 matrix made up from 26-by-26 blocks of alternating light and dark cells representing +1 and -1. The graphic didn't make the cover of *Scientific American*, but I kept my copy for a long time.

Len had done a computation on JPL's machines that filled in the last value of $n$ less than 100. His paper with his colleagues Sol Golomb, a professor at USC, and Marshall Hall, Jr., a professor at Caltech, was published in the prestigious Bulletin of the AMS. It turns out that I had taken a course on difference sets, the mathematics involved in generating the matrix, from Hall at Caltech.

Here is a MATLAB picture of `Baumert92`. You can get the function from this link.

H = Baumert92; pcolor92(H);

Let's check that we get the expected pivot growth.

[L,U] = lu(H); unn = U(end,end)

unn = -92

This reference to the book by Trefethen and Bau should have been included in my previous post about partial pivoting. Lecture 22 has an explanation the stability of partial pivoting in practice.

Lloyd N. Trefethen and David Bau, III, *Numerical Linear Algebra*, <http://bookstore.siam.org/ot50>, SIAM, 1997, 362pp.

Peter Businger, "Monitoring the numerical stability of Gaussian elimination", <http://link.springer.com/article/10.1007/BF02165006>, Numerische Mathematik 16, 4, 1971, 360-361.

Jane Day and Brian Peterson, "Growth in Gaussian elimination", <http://www.jstor.org/stable/2322755>, American Mathematical Monthly, 95, 6, 1988, 489-513.

Nick Gould, "On growth in Gaussian elimination with complete pivoting", <http://www.numerical.rl.ac.uk/people/nimg/pubs/Goul91_simax.pdf> SIAM Journal on Matrix Analysis and Applications 12, 1991, 351-364.

Leslie Foster, "The growth factor and efficiency of Gaussian elimination with rook pivoting", <http://www.math.sjsu.edu/~foster/gerpwithcor.pdf>, Journal of Computational and Applied Mathematics, 86, 1997, 177-194.

Leslie Foster, "LURP: Gaussian elimination with rook pivoting", matlabcentral/fileexchange/37721-rook-pivoting, MATLAB Central File Exchange, 2012.

Leonard Baumert, S. W. Golomb, and Marshall Hall, Jr, "Discovery of an Hadamard Matrix of Order 92", <http://dx.doi.org/10.1090%2FS0002-9904-1962-10761-7>, Bulletin of the American Mathematical Society 68 (1962), 237-238.

Get
the MATLAB code

Published with MATLAB® R2014a

In rare cases, Gaussian elimination with partial pivoting is unstable. But the situations are so unlikely that we continue to use the algorithm as the foundation for our matrix computations.... read more >>

]]>In rare cases, Gaussian elimination with partial pivoting is unstable. But the situations are so unlikely that we continue to use the algorithm as the foundation for our matrix computations.

I almost hesitate to bring this up. Gaussian elimination with partial pivoting is potentially unstable. We know of a particular test matrix, and have known about it for years, where the solution to simultaneous linear equations computed by our iconic backslash operator is less accurate than we typically expect. The solution is contaminated by unacceptably large roundoff errors associated with large elements encountered during the elimination process. The offending matrix is generated by the following function, which is a special case of the `gfpp` function in Nick Higham's collection of MATLAB test matrices.

```
type gfpp
```

function A = gfpp(n) % A = gfpp(n) Pivot growth of partial pivoting. A = eye(n,n) - tril(ones(n,n),-1); A(:,n) = 1;

Here is the 8-by-8.

n = 8 A = gfpp(n)

n = 8 A = 1 0 0 0 0 0 0 1 -1 1 0 0 0 0 0 1 -1 -1 1 0 0 0 0 1 -1 -1 -1 1 0 0 0 1 -1 -1 -1 -1 1 0 0 1 -1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 -1 -1 -1 -1 1

Gaussian elimination with partial pivoting does not actually do any pivoting with this particular matrix. The first row is added to each of the other rows to introduce zeroes in the first column. This produces twos in the last column. As similar steps are repeated to create an upper triangular `U`, elements in the last column double with each step. The element growth in the final triangular factor `U` is `2^(n-1)`.

[L,U] = lu(A); U

U = 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 2 0 0 1 0 0 0 0 4 0 0 0 1 0 0 0 8 0 0 0 0 1 0 0 16 0 0 0 0 0 1 0 32 0 0 0 0 0 0 1 64 0 0 0 0 0 0 0 128

So if the matrix order `n` is set to the number of bits in the floating point word (plus one), the last diagonal element of `U` will grow to be `1/eps`.

n = 53 A = gfpp(n); [L,U] = lu(A); unn = U(n,n)

n = 53 unn = 4.5036e+15

If we try to use backslash to solve a linear system involving this matrix and a random right hand side, the back substitution will begin with `U(n,n)`. The roundoff errors can be expected to be as large as `eps(U(n,n))` and swamp the solution itself.

b = randn(n,1); x = A\b;

Ordinarily, we expect the solution computed by Gaussian elimination to have a residual on the order of roundoff error. But in this case, the residual is many orders of magnitude larger. Backslash has failed.

r = b - A*x; relative_residual = norm(r)/(norm(A)*norm(x))

relative_residual = 3.7579e-05

Anything we do to this matrix to provoke pivoting during the elimination will prevent the growth of the elements in `U`. One possibility is to swap the first two rows.

A = gfpp(n); A([1 2],:) = A([2 1],:); [L,U] = lu(A); unn = U(n,n)

unn = 2

That's much better. Now use backslash and look at the residual.

x = A\b; r = b - A*x; relative_residual = norm(r)/(norm(A)*norm(x))

relative_residual = 1.7761e-17

So the residual is now on the order of roundoff relative to `A` and `x`. Backslash works as expected.

Another way to provoke pivoting is to perturb the matrix with a little white noise.

A = gfpp(n); A = A + randn(n,n); [L,U] = lu(A); unn = U(n,n)

unn = -3.8699

So `U` is well behaved. Let's see about backslash.

x = A\b; r = b - A*x; relative_residual = norm(r)/(norm(A)*norm(x))

relative_residual = 1.2870e-16

Again the residual is on the order of roundoff and backslash works as expected.

Suppose we are using Gaussian elimination with any pivoting process to solve a system of linear equations involving a matrix $A$. Let $a_{i,j}^{(k)}$ denote the elements in the matrix after the $k$ th step of the elimination. The *growth factor* for that pivoting process is the quantity

$$ \rho = { \max_{i,j,k} |a_{i,j}^{(k)}| \over \max_{i,j} |a_{i,j}| } $$

This growth factor is a crucial quantity in Wilkinson's roundoff error analysis. If $\rho$ is not too large, then the elimination algorithm is stable. Unfortunately, the matrix we have just seen, `gfpp`, shows that the growth factor for partial pivoting is at least $2^{n-1}$.

At the time he was presenting his analysis in early 1970s, Wilkinson knew about this matrix. But he said it was a very special case. He reported:

It is our experience that any substantial increase in size of elements of successive An is extremely uncommon even with partial pivoting. ... No example which has arisen naturally has in my experience given an increase as large as 16.

During the development of LINPACK in the late 1970s, a colleague at Argonne, J. T. Goodman, and I did some experiments with random matrices to check for element growth with partial pivoting. We used several different element distributions and what were then matrices of modest order, namely values of $n$ between 10 and 50. The largest growth factor we found was for a matrix of 0's and 1's of order 40. The value was $\rho$ = 23. We never saw growth as large as linear, $\rho = n$, let alone exponential, $\rho = 2^{n-1}$.

Nick Trefethen and Rob Schreiber published an important paper, "Average-case Stability of Gaussian Elimination", in 1990. They present both theoretical and probabilistic models, and the results of extensive experiments. They show that for many different types of matrices, the typical growth factor for partial pivoting does not increase exponentially with matrix order $n$. It does not even grow linearly. It actually grows like $n^{2/3}$. Their final section, "Conclusions", begins:

Is Gaussian elimination with partial pivoting stable on average? Everything we know on the subject indicates that the answer is emphatically yes, and that one needs no hypotheses beyond statistical properties to account for the success of this algorithm during nearly half a century of digital computation.

The Higham Brothers, Nick and Des Higham, published a paper in 1989 about growth factors in Gaussian elimination. Most of their paper applies to any kind of pivoting and involves growth that is $O(n)$. But they do have one theorem that applies only to partial pivoting and that characterizes all matrices that have $\rho = 2^{n-1}$. Essentially these matrices have to have the behavior we have seen in `gfpp`, namely no actual pivoting and each step adds a row to all the later rows, doubling the values in the last column in the process.

In two different papers, Stephen Wright in 1993 and Les Foster in 1994, the authors present classes of matrices encountered in actual computational practice for which Gaussian elimination with partial pivoting blows up with exponential growth in the elements in the last few columns of `U`. Wright's application is a two-point boundary value problem for an ordinary differential equation. The discrete approximation produces a band matrix. Foster's application is a Volterra integral equation model for population dynamics. The discrete approximation produces a lower trapezoidal matrix. In both cases the matrices are augmented by several final columns. Certain choices of parameters causes the elimination to proceed with no pivoting and results in exponential in growth in the final columns. The behavior is similar to `gfpp` and the worst case growth in the Highams theorem.

Complete pivoting requires $O(n^3)$ comparisons instead of the $O(n^2)$ required by partial pivoting. More importantly, the repeated access to the entire matrix complete changes the memory access patterns. With modern computer storage hierarchy, including cache and virtual memory, this is an all-important consideration.

The growth factor $\rho$ with complete pivoting is interesting and I plan to investigate it in my next blog.

`lugui` is one of the demonstration programs included with Numerical Computing with MATLAB. I will also describe it in my next blog. For now, I will just use it to generate a graphic for this blog. Here is the result of a factorization, the lower triangular factor `L` in green and the upper triangular factor `U` in blue.

```
A = gfpp(6);
lugui(A,'partial')
```

Nicholas J. Higham and Desmond J. Higham, "Large Growth Factors in Gaussian Elimination with Pivoting", <http://www.maths.manchester.ac.uk/~higham/papers/hihi89.pdf>, SIAM J. Matrix Anal. Appl. 10, 155-164, 1989.

Lloyd N. Trefethen and Robert S. Schreiber, "Average-case Stability of Gaussian Elimination", https://courses.engr.illinois.edu/cs591mh/trefethen/trefethen_schreiber.pdf, SIAM J. Matrix Anal. Appl. 11, 335-360, 1990.

Stephen J. Wright, A Collection of Problems for Which Gaussian "Elimination with Partial Pivoting is Unstable", <ftp://info.mcs.anl.gov/pub/tech_reports/reports/P266.ps.Z>, SIAM J. Sci. Statist. Comput. 14, 231-238, 1993.

Leslie Foster, "Gaussian Elimination Can Fail In Practice", <http://www.math.sjsu.edu/~foster/geppfail.pdf>, SIAM J. Matrix Anal. Appl. 15, 1354-1362, 1994.

Get
the MATLAB code

Published with MATLAB® R2014a

Denormal floating point numbers and gradual underflow are an underappreciated feature of the IEEE floating point standard. Double precision denormals are so tiny that they are rarely numerically significant, but single precision denormals can be in the range where they affect some otherwise unremarkable computations. Historically, gradual underflow proved to be very controversial during the committee deliberations that developed the standard.... read more >>

]]>Denormal floating point numbers and gradual underflow are an underappreciated feature of the IEEE floating point standard. Double precision denormals are so tiny that they are rarely numerically significant, but single precision denormals can be in the range where they affect some otherwise unremarkable computations. Historically, gradual underflow proved to be very controversial during the committee deliberations that developed the standard.

My previous post was mostly about normalized floating point numbers. Recall that normalized numbers can be expressed as

$$ x = \pm (1 + f) \cdot 2^e $$

The *fraction* or *mantissa* $f$ satisfies

$$ 0 \leq f < 1 $$

$f$ must be representable in binary using at most 52 bits for double precision and 23 bits for single precision.

The exponent $e$ is an integer in the interval

$$ -e_{max} < e \leq e_{max} $$

where $e_{max} = 1023$ for double precision and $e_{max} = 127$ for single precision.

The finiteness of $f$ is a limitation on *precision*. The finiteness of $e$ is a limitation on *range*. Any numbers that don't meet these limitations must be approximated by ones that do.

Double precision floating point numbers are stored in a 64-bit word, with 52 bits for $f$, 11 bits for $e$, and 1 bit for the sign of the number. The sign of $e$ is accommodated by storing $e+e_{max}$, which is between $1$ and $2^{11}-2$.

Single precision floating point numbers are stored in a 32-bit word, with 23 bits for $f$, 8 bits for $e$, and 1 bit for the sign of the number. The sign of $e$ is accommodated by storing $e+e_{max}$, which is between $1$ and $2^{8}-2$.

The two extreme values of the exponent field, all zeroes and all ones, are special cases. All zeroes signifies a denormal floating point number, the subject of today's post. All ones, together with a zero fraction, denotes infinity, or `Inf`. And all ones, together with a a nonzero fraction, denotes Not-A-Number, or `NaN`.

My program `floatgui`, available here, shows the distribution of the positive numbers in a model floating point system with variable parameters. The parameter $t$ specifies the number of bits used to store $f$, so $2^t f$ is an integer. The parameters $e_{min}$ and $e_{max}$ specify the range of the exponent.

If you look carefully at the output from `floatgui` shown in the previous post you will see a gap around zero. This is especially apparent in the logarithmic plot, because the logarithmic distribution can never reach zero.

Here is output for slightly different parameters, $t = 3$, $e_{min} = -5$, and $e_{max} = 2$. Howrever, the gap around zero has been filled in with a spot of green. Those are the denormals.

Zoom in on the portion of these toy floating point numbers less than one-half. Now you can see the individual green denormals -- there are eight of them in this case.

Denormal floating point numbers are essentially roundoff errors in normalized numbers near the underflow limit, `realmin`, which is $2^{-e_{max}+1}$. They are equally spaced, with a spacing of `eps*realmin`. Zero is naturally included as the smallest denormal.

Suppose that `x` and `y` are two distinct floating point numbers near to, but larger than, `realmin`. It may well be that their difference, `x - y`, is smaller than `realmin`. For example, in the small `floatgui` system pictured above, `eps = 1/8` and `realmin = 1/32`. The quantities `x = 6/128` and `y = 5/128` are between `1/32` and `1/16`, so they are both above underflow. But `x - y = 1/128` underflows to produce one of the green denormals.

Before the IEEE standard, and even on today's systems that do not comply with the standard, underflows would simply be set to zero. So it would be possible to have the MATLAB expression

x == y

be false, while the expression

x - y == 0

be true.

On machines where underflow flushes to zero and division by zero is fatal, this code fragment can produce a division by zero and crash.

if x ~= y z = 1/(x-y); end

Of course, denormals can also be produced by multiplications and divisions that produce a result between `eps*realmin` and `realmin`. In decimal these ranges are

format compact format short e [eps*realmin realmin] [eps('single')*realmin('single') realmin('single')]

ans = 4.9407e-324 2.2251e-308 ans = 1.4013e-45 1.1755e-38

Denormal floating point numbers are stored without the implicit leading one bit,

$$ x = \pm f \cdot 2^{-emax+1}$$

The *fraction* $f$ satisfies

$$ 0 \leq f < 1 $$

And $f$ is represented in binary using 52 bits for double precision and 23 bits for single recision. Note that zero naturally occurs as a denormal.

When you look at a double precision denormal with `format hex` the situation is fairly clear. The rightmost 13 hex characters are the 52 bits of the fraction. The leading bit is the sign. The other 12 bits of the first three hex characters are all zero because they represent the biased exponent, which is zero because $emax$ and the exponent bias were chosen to complement each other. Here are the two largest and two smallest nonzero double precision denormals.

format hex [(1-eps); (1-2*eps); 2*eps; eps; 0]*realmin format long e ans

ans = 000fffffffffffff 000ffffffffffffe 0000000000000002 0000000000000001 0000000000000000 ans = 2.225073858507201e-308 2.225073858507200e-308 9.881312916824931e-324 4.940656458412465e-324 0

The situation is slightly more complicated with single precision because 23 is not a multiple of four. The fraction and exponent fields of a single precision floating point number -- normal or denormal -- share the bits in the third character of the hex display; the biased exponent gets one bit and the first three bits of the 23 bit fraction get the other three. Here are the two largest and two smallest nonzero single precision denormals.

format hex e = eps('single'); r = realmin('single'); [(1-e); (1-2*e); 2*e; e; 0]*r format short e ans

ans = 007fffff 007ffffe 00000002 00000001 00000000 ans = 1.1755e-38 1.1755e-38 2.8026e-45 1.4013e-45 0

Think of the situation this way. The normalized floating point numbers immediately to the right of `realmin`, between `realmin` and `2*realmin`, are equally spaced with a spacing of `eps*realmin`. If just as many numbers, with just the same spacing, are placed to the left of `realmin`, they fill in the gap between there and zero. These are the denormals. They require a slightly different format to represent and slightly different hardware to process.

The IEEE Floating Point Committee was formed in 1977 in Silicon Valley. The participants included representatives of semiconductor manufacturers who were developing the chips that were to become the basis for the personal computers that are so familiar today. As I said in my previous post, the committee was a remarkable case of cooperation among competitors.

Velvel Kahan was the most prominent figure in the committee meetings. He was not only a professor of math and computer science from UC Berkeley, he was also a consultant to Intel and involved in the design of their math coprocessor, the 8087. Some of Velvel's students, not only from the campus, but also who had graduated and were now working for some of the participating companies, were involved.

A proposed standard, written by Kahan, one of his students at Berkeley, Jerome Coonen, and a visiting professor at Berkeley, Harold Stone, known as the KCS draft, reflected the Intel design and was the basis for much of the committee's work.

The committee met frequently, usually in the evening in conference rooms of companies on the San Francisco peninsula. There were also meetings in Austin, Texas, and on the East Coast. The meetings often lasted until well after midnight.

Membership was based on regular attendance. I was personally involved only when I was visiting Stanford, so I was not an official member. But I do remember telling a colleague from Sweden who was coming to the western United States for the first time that there were three sites that he had to be sure to see: the Grand Canyon, Las Vegas, and the IEEE Floating Point Committee.

The denormals in the KCS draft were something new to most of the committee. Velvel said he had experimented with them at the University of Toronto, but that was all. Standards efforts are intended to regularize existing practice, not introduce new designs. Besides, implementing denormals would reguire additional hardware, and additional transistors were a valuable resource in emerging designs. Some experts claimed that including denormals would slow down all floating point arithmetic.

A mathematician from DEC, Mary Payne, led the opposition to KCS. DEC wanted a looser standard that would embrace the floating point that was already available on their VAX. The VAX format was similar to, but not the same as, the KCS proposal. And it did not include denormals.

Discussion went on for a couple of years. Letters of support for KCS from Don Knuth and Jim Wilkinson did not settle the matter. Finally, DEC engaged G. W. (Pete) Stewart, from the University of Maryland. In what must have been a surprise to DEC, Pete also said he thought that the KCS proposal was a good idea. Eventually the entire committee voted to accept a revised version.

Denormal floating point numbers are still the unwanted step children in today's floating point family. I think it is fair to say that the numerical analysis community has failed to make a strong argument for their importance. It is true that they do make some floating point error analyses more elegant. But with magnitudes around $10^{-308}$ the double precision denormals are hardly ever numerically significant in actual computation. Only the single precision denormals around $10^{-38}$ are potentially important.

Outside of MATLAB itself, we encounter processors that have IEEE format for the floating point numbers, but do not conform to the 754 standard when it comes to processing. These processors usually flush underflow to zero and so we can expect different numerical results for any calculations that might ordinarily produce denormals.

We still see processors today that handle denormals with microcode or software. Execution time of MATLAB programs that encounter denormals can degrade significantly on such processors.

The Wikipedia page on denormals has the macros for setting trap handlers to flush underflows to zero in C or Java programs. I hate to think what might happen to MATLAB Mex files with such macros. Kids, don't try this at home.

William Kahan, Home Page.

Charles Severance, An Interview with the Old Man of Floating-Point, Reminiscences for *IEEE Computer*, February 20, 1998.

Wikipedia page, Denormal number.

Get
the MATLAB code

Published with MATLAB® R2014a

This is the first part of a two-part series about the single- and double precision floating point numbers that MATLAB uses for almost all of its arithmetic operations. (This post is adapted from section 1.7 of my book Numerical Computing with MATLAB, published by MathWorks and SIAM.)... read more >>

]]>This is the first part of a two-part series about the single- and double precision floating point numbers that MATLAB uses for almost all of its arithmetic operations. (This post is adapted from section 1.7 of my book Numerical Computing with MATLAB, published by MathWorks and SIAM.)

If you look carefully at the definitions of fundamental arithmetic operations like addition and multiplication, you soon encounter the mathematical abstraction known as real numbers. But actual computation with real numbers is not very practical because it involves notions like limits and infinities. Instead, MATLAB and most other technical computing environments use floating point arithmetic, which involves a finite set of numbers with finite precision. This leads to the phenomena of *roundoff*, *underflow*, and *overflow*. Most of the time, it is possible to use MATLAB effectively without worrying about these details, but, every once in a while, it pays to know something about the properties and limitations of floating point numbers.

Thirty years ago, the situation was far more complicated than it is today. Each computer had its own floating point number system. Some were binary; some were decimal. There was even a Russian computer that used trinary arithmetic. Among the binary computers, some used 2 as the base; others used 8 or 16. And everybody had a different precision. In 1985, the IEEE Standards Board and the American National Standards Institute adopted the ANSI/IEEE Standard 754-1985 for Binary Floating-Point Arithmetic. This was the culmination of almost a decade of work by a 92-person working group of mathematicians, computer scientists, and engineers from universities, computer manufacturers, and microprocessor companies.

A revision of IEEE_754-1985, known as IEEE 754-2008, was published in 2008. The revision combines the binary standards in 754 and the decimal standards in another document, 854. But as far as I can tell, the new features introduced in the revision haven't had much impact yet.

William M. Kahan was the primary architect of IEEE floating point arithmetic. I've always known him by his nickname "Velvel". I met him in the early 1960's when I was a grad student at Stanford and he was a young faculty member at the University of Toronto. He moved from Toronto to UC Berkeley in the late 1960's and spent the rest of his career at Berkeley. He is now an emeritus professor of mathematics and of electrical engineering and computer science.

Velvel is an eminent numerical analyst. Among many other accomplishments, he is the developer along with Gene Golub of the first practical algorithm for computing the singular value decomposition.

In the late 1970's microprocessor manufacturers including Intel, Motorola, and Texas Instruments were developing the chips that were to become the basis for the personal computer and eventually today's electronics. In a remarkable case of cooperation among competitors, they established a committee to develop a common standard for floating point arithmetic.

Velvel was a consult to Intel, working with John Palmer, a Stanford grad who was the principal architect of the 8087 math coprocessor. The 8087 accompanied the 8086, which was to become the CPU of the IBM PC. Kahan and Palmer convinced Intel to allow them to release the specs for their math coprocessor to the IEEE committee. These plans became the basis for the standard.

Velvel received the ACM Turing Award in 1989 for "his fundamental contributions to numerical analysis". He was named an ACM Fellow in 1994, and was inducted into the National Academy of Engineering in 2005.

All computers designed since 1985 use IEEE floating point arithmetic. This doesn't mean that they all get exactly the same results, because there is some flexibility within the standard. But it does mean that we now have a machine-independent model of how floating point arithmetic behaves.

MATLAB has traditionally used the IEEE double precision format. In recent years, the single precision format has also been available. This saves space and can sometimes be faster. There is also an extended precision format, which is not precisely specified by the standard and which may be used for intermediate results by the floating point hardware. This is one source of the lack of uniformity among different machines.

Most nonzero floating point numbers are normalized. This means they can be expressed as

$$ x = \pm (1 + f) \cdot 2^e $$

The quantity $f$ is the *fraction* or *mantissa* and $e$ is the *exponent*. The fraction satisfies

$$ 0 \leq f < 1 $$

The fraction $f$ must be representable in binary using at most 52 bits for double precision and 23 bits for single recision. In other words, for double, $2^{52} f$ is an integer in the interval

$$ 0 \leq 2^{52} f < 2^{52} $$

For single, $2^{23} f$ is an integer in the interval

$$ 0 \leq 2^{23} f < 2^{23} $$

For double precision, the exponent $e$ is an integer in the interval

$$ -1022 \leq e \leq 1023 $$

For single, the exponent $e$ is in the interval

$$ -126 \leq e \leq 127 $$

The finiteness of $f$ is a limitation on *precision*. The finiteness of $e$ is a limitation on *range*. Any numbers that don't meet these limitations must be approximated by ones that do.

Double precision floating point numbers are stored in a 64-bit word, with 52 bits for $f$, 11 bits for $e$, and 1 bit for the sign of the number. The sign of $e$ is accommodated by storing $e+1023$, which is between $1$ and $2^{11}-2$. The 2 extreme values for the exponent field, $0$ and $2^{11}-1$, are reserved for exceptional floating point numbers that I will describe later.

Single precision floating point numbers are stored in a 32-bit word, with 23 bits for $f$, 8 bits for $e$, and 1 bit for the sign of the number. The sign of $e$ is accommodated by storing $e+127$, which is between $1$ and $2^{8}-2$. Again, the 2 extreme values for the exponent field are reserved for exceptional floating point numbers.

The entire fractional part of a floating point number is not $f$, but $1+f$. However, the leading $1$ doesn't need to be stored. In effect, the IEEE formats pack $33$ or $65$ bits of information into a $32$ or $64$-bit word.

My program `floatgui`, available here, shows the distribution of the positive numbers in a model floating point system with variable parameters. The parameter $t$ specifies the number of bits used to store $f$. In other words, $2^t f$ is an integer. The parameters $e_{min}$ and $e_{max}$ specify the range of the exponent, so $e_{min} \leq e \leq e_{max}$. Initially, `floatgui` sets $t = 3$, $e_{min} = -4$, and $e_{max} = 2$ and produces the following distribution.

Within each binary interval $2^e \leq x \leq 2^{e+1}$, the numbers are equally spaced with an increment of $2^{e-t}$. If $e = 0$ and $t = 3$, for example, the spacing of the numbers between $1$ and $2$ is $1/8$. As $e$ increases, the spacing increases.

It is also instructive to display the floating point numbers with a logarithmic scale. Here is the output from `floatgui` with `logscale` checked and $t = 5$, $e_{min} = -4$, and $e_{max} = 3$. With this logarithmic scale, it is more apparent that the distribution in each binary interval is the same.

A very important quantity associated with floating point arithmetic is highlighted in red by `floatgui`. MATLAB calls this quantity `eps`, which is short for *machine epsilon*.

`eps`is the distance from $1$ to the next larger floating point number.

For the `floatgui` model floating point system, `eps` = $2^{-t}$.

Before the IEEE standard, different machines had different values of `eps`. Now, for IEEE double precision the approximate decimal value of `eps` is

format short format compact eps_d = eps

eps_d = 2.2204e-16

And for single precision,

```
eps_s = eps('single')
```

eps_s = 1.1921e-07

Either `eps/2` or `eps` can be called the roundoff level. The maximum relative error incurred when the result of an arithmetic operation is rounded to the nearest floating point number is `eps/2`. The maximum relative spacing between numbers is `eps`. In either case, you can say that the roundoff level is about 16 decimal digits for double and about 7 decimal digits for single.

Actually `eps` is a function. `eps(x)` is the distance from `abs(x)` to the next larger in magnitude floating point number of the same precision as `x`.

A frequently occurring example is provided by the simple MATLAB statement

t = 0.1

The mathematical value $t$ stored in `t` is not exactly $0.1$ because expressing the decimal fraction $1/10$ in binary requires an infinite series. In fact,

$$ \frac{1}{10} = \frac{1}{2^4} + \frac{1}{2^5} + \frac{0}{2^6} + \frac{0}{2^7} + \frac{1}{2^8} + \frac{1}{2^9} + \frac{0}{2^{10}} + \frac{0}{2^{11}} + \frac{1}{2^{12}} + \cdots $$

After the first term, the sequence of coefficients $1, 0, 0, 1$ is repeated infinitely often. Grouping the resulting terms together four at a time expresses $1/10$ n a base 16, or *hexadecimal*, series.

$$ \frac{1}{10} = 2^{-4} \cdot \left(1 + \frac{9}{16} + \frac{9}{16^2} + \frac{9}{16^3} + \frac{9}{16^{4}} + \cdots\right) $$

Floating-point numbers on either side of $1/10$ are obtained by terminating the fractional part of this series after $52$ binary terms, or 13 hexadecimal terms, and rounding the last term up or down. Thus

$$ t_1 < 1/10 < t_2 $$

where

$$ t_1 = 2^{-4} \cdot \left(1 + \frac{9}{16} + \frac{9}{16^2} + \frac{9}{16^3} + \cdots + \frac{9}{16^{12}} + \frac{9}{16^{13}}\right) $$

$$ t_2 = 2^{-4} \cdot \left(1 + \frac{9}{16} + \frac{9}{16^2} + \frac{9}{16^3} + \cdots + \frac{9}{16^{12}} + \frac{10}{16^{13}}\right) $$

It turns out that $1/10$ is closer to $t_2$ than to $t_1$, so $t$ is equal to $t_2$. In other words,

$$ t = (1 + f) \cdot 2^e $$

where

$$ f = \frac{9}{16} + \frac{9}{16^2} + \frac{9}{16^3} + \cdots + \frac{9}{16^{12}} + \frac{10}{16^{13}} $$

$$ e = -4 $$

The MATLAB command

```
format hex
```

causes `t` to be displayed as

3fb999999999999a

The characters `a` through `f` represent the hexadecimal "digits" 10 through 15. The first three characters, `3fb`, give the hexadecimal representation of decimal $1019$, which is the value of the biased exponent $e+1023$ if $e$ is $-4$. The other 13 characters are the hexadecimal representation of the fraction $f$.

In summary, the value stored in `t` is very close to, but not exactly equal to, $0.1$. The distinction is occasionally important. For example, the quantity

0.3/0.1

is not exactly equal to $3$ because the actual numerator is a little less than $0.3$ and the actual denominator is a little greater than $0.1$.

Ten steps of length `t` are not precisely the same as one step of length 1. MATLAB is careful to arrange that the last element of the vector

0:0.1:1

is exactly equal to $1$, but if you form this vector yourself by repeated additions of $0.1$, you will miss hitting the final $1$ exactly.

What does the floating point approximation to the golden ratio, $\phi$, look like?

```
format hex
phi = (1 + sqrt(5))/2
```

phi = 3ff9e3779b97f4a8

The first hex digit, `3`, is `0011` in binary. The first bit is the sign of the floating point number; `0` is positive, `1` is negative. So `phi` is positive. The remaining bits of the first three hex digits contain $e+1023$. In this example, `3ff` in base 16 is $3 \cdot 16^2 + 15 \cdot 16 + 15 = 1023$ in decimal. So

$$ e = 0 $$

In fact, any floating point number between $1.0$ and $2.0$ has $e = 0$, so its `hex` output begins with `3ff`. The other 13 hex digits contain $f$. In this example,

$$ f = \frac{9}{16} + \frac{14}{16^2} + \frac{3}{16^3} + \cdots + \frac{10}{16^{12}} + \frac{8}{16^{13}} $$

With these values of $f$ and $e$,

$$ (1 + f) 2^e \approx \phi $$

Another example is provided by the following code segment.

```
format long
a = 4/3
b = a - 1
c = 3*b
e = 1 - c
```

a = 1.333333333333333 b = 0.333333333333333 c = 1.000000000000000 e = 2.220446049250313e-16

With exact computation, `e` would be zero. But with floating point, we obtain double precision `eps`. Why?

It turns out that the only roundoff occurs in the division in the first statement. The quotient cannot be exactly $4/3$, except on that Russian trinary computer. Consequently the value stored in `a` is close to, but not exactly equal to, $4/3$. The subtraction `b = a - 1` produces a quantity `b` whose last bit is 0. This means that the multiplication `3*b` can be done without any roundoff. The value stored in `c` is not exactly equal to $1$, and so the value stored in `e` is not 0. Before the IEEE standard, this code was used as a quick way to estimate the roundoff level on various computers.

The roundoff level `eps` is sometimes called ``floating point zero,'' but that's a misnomer. There are many floating point numbers much smaller than `eps`. The smallest positive normalized floating point number has $f = 0$ and $e = -1022$. The largest floating point number has $f$ a little less than $1$ and $e = 1023$. MATLAB calls these numbers `realmin` and `realmax`. Together with `eps`, they characterize the standard system.

Double Precision Binary Decimal eps 2^(-52) 2.2204e-16 realmin 2^(-1022) 2.2251e-308 realmax (2-eps)*2^1023 1.7977e+308

Single Precision Binary Decimal eps 2^(-23) 1.1921e-07 realmin 2^(-126) 1.1755e-38 realmax (2-eps)*2^127 3.4028e+38

If any computation tries to produce a value larger than `realmax`, it is said to *overflow*. The result is an exceptional floating point value called *infinity* or `Inf`. It is represented by taking $e = 1024$ and $f = 0$ and satisfies relations like `1/Inf = 0` and `Inf+Inf = Inf`.

If any computation tries to produce a value that is undefined even in the real number system, the result is an exceptional value known as Not-a-Number, or `NaN`. Examples include `0/0` and `Inf-Inf`. `NaN` is represented by taking $e = 1024$ and $f$ nonzero.

If any computation tries to produce a value smaller than `realmin`, it is said to *underflow*. This involves one of most controversial aspects of the IEEE standard. It produces exceptional *denormal* or *subnormal* floating point numbers in the interval between `realmin` and `eps*realmin`. The smallest positive double precision subnormal number is

format short smallest_d = eps*realmin smallest_s = eps('single')*realmin('single')

smallest_d = 4.9407e-324 smallest_s = 1.4013e-45

Any results smaller than this are set to 0.

The controversy surrounding underflow and denormals will be the subject of my next blog post.

Get
the MATLAB code

Published with MATLAB® R2014a

The nineteenth Householder Symposium, Householder XIX, was held June 8-13 at Sol Cress, a conference center near Spa, Belgium. If you have been following either the web or the newletter edition of Cleve's Corner you know that the Gatlinburg/Householder series of conferences have played an important role in both my professional life and the history of MATLAB. I attended what turned out to be the third conference in the series, in Gatlinburg, Tennesse, when I was a graduate student in 1964. I have been to all 17 of the conferences that have been held since 1964. Here is a link to my News and Notes article about the Gatlinburg/Householder conferences.... read more >>

]]>The nineteenth Householder Symposium, Householder XIX, was held June 8-13 at Sol Cress, a conference center near Spa, Belgium. If you have been following either the web or the newletter edition of Cleve's Corner you know that the Gatlinburg/Householder series of conferences have played an important role in both my professional life and the history of MATLAB. I attended what turned out to be the third conference in the series, in Gatlinburg, Tennesse, when I was a graduate student in 1964. I have been to all 17 of the conferences that have been held since 1964. Here is a link to my News and Notes article about the Gatlinburg/Householder conferences.

The Householder Symposium has been organized over the years by an evolving volunteer committee. The symposium is not affiliated with any professional society. The organizing committee for Householder XIX was:

- Ilse Ipsen (chair), North Carolina State University, USA
- Jim Demmel, University of California at Berkeley, USA
- Alan Edelman, Massachusetts Institute of Technology, USA
- Heike Fassbender, Technical University of Braunschweig, Germany
- Volker Mehrmann, Technical University of Berlin
- Jim Nagy, Emory University, USA
- Yousef Saad, University of Minnesota, USA
- Valeria Simoncini, University of Bologna, Italy
- Zdenek Strakos, Charles University in Prague, Czech Republic
- Andy Wathen, Oxford University, UK

The local organizers in Belgium were:

- Paul Van Dooren (chair), Universite catholique de Louvain
- Pierre-Antoine Absil, Universite catholique de Louvain
- Karl Meerbergen, Katholieke Universiteit Leuven
- Annick Sartenaer, Universite de Namur
- Marc Van Barel, Katholieke Universiteit Leuven
- Sabine Van Huffel, Katholieke Universiteit Leuven

In order to facilitate interaction, attendance is limited. The organizing committee accepts applications and evaluates resumes. This time only about half of the 300 applicants were selected. This has been a controversial aspect of the conference, but I am in favor of limiting the number of participants.

A little over 100 of the attendees made it to the group photo. A few are wearing neckties and their best dresses in the photo because the conference banquet was scheduled shortly thereafter.

A list of the attendees is available on the symposium web site.

Everybody who attended the conference had an opportunity to give a talk or present a poster during the five day conference. There were a total of 26 half-hour plenary talks. Each day there were two sets of three parallel sessions of less formal 20 minute talks. A couple of days had late afternoon or evening poster sessions.

The complete program is available here. The overall theme is numerical linear algebra. Applications include partial differetial equations, image processing, control theory, graph theory, model reduction, and gyroscopes.

The Householder Prize is given for the best PhD dissertation written during the three-year period preceeding the meeting. A list of the winners of the prize in previous years is available here The committee judging the dissertations for the prize this time was:

- Volker Mehrmann, Technical University of Berlin, Germany (chair)
- Michele Benzi, Emory University, USA
- Inderjit Dhillon, UT Austin, USA
- Howard Elman, University of Maryland, USA
- Francoise Tisseur, University of Manchester, UK
- Stephen Vavasis, University of Waterloo, Canada

Volker reported that 17 theses from 9 countries were submitted. From these, a short list of six finalists were selected:

- Grey Ballard, UC Berkeley
- Saifon Chaturantabut, Rice University
- Cedric Effenberger, EPF Lausanne
- Nicolas Gillis, UC Louvain
- Agnieszka Miedlar, TU Berlin
- Yuji Nakatsukasa, UC Davis

From this short list, the two winners of the 2014 Householder Prize were announced:

- Nicolas Gillis, UC Louvain
- Yuji Nakatsukasa, UC Davis

Nicolas is now an Assistant Professor at Universite de Mons in Belgium and Yuji is an Assistant Professor at the University of Tokyo.

The two winners gave talks about their thesis work in the last days of the meeting. They shared prize money of a little over $2000 obtained by passing a hat at the banquet dinner during Householder XVIII three years ago in Lake Tahoe. A hat was passed again at this meeting to fund the next Householder Prize. The location for the next Householder meeting has not yet been decided.

Traditionally the Wednesday afternoon of the Householder meeting is devoted to an excursion of some kind. The XIX excursion was to "Notre Dame du Val-Dideu", an abbey founded in 1216 by Cistercian monks. There are no longer any monks living at the abbey -- it is now best known as a brewery. Their beer is Val-Dieu. We toured the abbey and the tiny brewery, which is operated by just five people. We then had the opportunity to taste three different beers and their wonderful cheese. Belgian abbey beers have nine percent alcohol. It was a very pleasant afternoon.

Get
the MATLAB code

Published with MATLAB® R2014a

Stiffness is a subtle concept that plays an important role in assessing the effectiveness of numerical methods for ordinary differential equations. (This article is adapted from section 7.9, "Stiffness", in *Numerical Computing with MATLAB*.)... read more >>

Stiffness is a subtle concept that plays an important role in assessing the effectiveness of numerical methods for ordinary differential equations. (This article is adapted from section 7.9, "Stiffness", in *Numerical Computing with MATLAB*.)

Stiffness is a subtle, difficult, and important concept in the numerical solution of ordinary differential equations. It depends on the differential equation, the initial conditions, and the numerical method. Dictionary definitions of the word "stiff" involve terms like "not easily bent," "rigid," and "stubborn." We are concerned with a computational version of these properties.

A problem is stiff if the solution being sought varies slowly, but there are nearby solutions that vary rapidly, so the numerical method must take small steps to obtain satisfactory results.

Stiffness is an efficiency issue. If we weren't concerned with how much time a computation takes, we wouldn't be concerned about stiffness. Nonstiff methods can solve stiff problems; they just take a long time to do it.

Stiff ode solvers do more work per step, but take bigger steps. And we're not talking about modest differences here. For truly stiff problems, a stiff solver can be orders of magnitude more efficient, while still achieving a given accuracy.

A model of flame propagation provides an example. I learned about this example from Larry Shampine, one of the authors of the MATLAB ordinary differential equation suite. If you light a match, the ball of flame grows rapidly until it reaches a critical size. Then it remains at that size because the amount of oxygen being consumed by the combustion in the interior of the ball balances the amount available through the surface. The simple model is

$$ \dot{y} = y^2 - y^3 $$

$$ y(0) = \delta $$

$$ 0 \leq t \leq {2 \over \delta} $$.

The scalar variable $y(t)$ represents the radius of the ball. The $y^2$ and $y^3$ terms come from the surface area and the volume. The critical parameter is the initial radius, $\delta$, which is "small." We seek the solution over a length of time that is inversely proportional to $\delta$. You can see a dramatization by downloading `flame.m` from <http://www.mathworks.com/moler/ncmfilelist.html>.

At this point, I suggest that you start up MATLAB and actually run our examples. It is worthwhile to see them in action. I will start with `ode45`, the workhorse of the MATLAB ode suite. If $\delta$ is not very small, the problem is not very stiff. Try $\delta$ = 0.02 and request a relative error of $10^{-4}$.

```
delta = 0.02;
F = @(t,y) y^2 - y^3;
opts = odeset('RelTol',1.e-4);
ode45(F,[0 2/delta],delta,opts);
```

With no output arguments, `ode45` automatically plots the solution as it is computed. You should get a plot of a solution that starts at $y$ = 0.01, grows at a modestly increasing rate until $t$ approaches 50, which is $1/\delta$, then grows rapidly until it reaches a value close to 1, where it remains.

Now let's see stiffness in action. Decrease $\delta$ by more than two orders of magnitude. (If you run only one example, run this one.)

delta = 0.0001; ode45(F,[0 2/delta],delta,opts);

You should see something like this plot, although it will take several seconds to complete the plot. If you get tired of watching the agonizing progress, click the stop button in the lower left corner of the window. Turn on zoom, and use the mouse to explore the solution near where it first approaches steady state. You should see something like the detail obtained with this `axis` command.

axis([.995e4 1.03e4 0.9998 1.0002]) last_plot = getframe(gcf);

Notice that `ode45` is doing its job. It's keeping the solution within $10^{-4}$ of its nearly constant steady state value. But it certainly has to work hard to do it. If you want an even more dramatic demonstration of stiffness, decrease $\delta$ or decrease the tolerance to $10^{-5}$ or $10^{-6}$.

This problem is not stiff initially. It only becomes stiff as the solution approaches steady state. This is because the steady state solution is so "rigid." Any solution near $y(t) = 1$ increases or decreases rapidly toward that solution. (I should point out that "rapidly" here is with respect to an unusually long time scale.)

What can be done about stiff problems? You don't want to change the differential equation or the initial conditions, so you have to change the numerical method. Methods intended to solve stiff problems efficiently do more work per step, but can take much bigger steps. Stiff methods are *implicit*. At each step they use MATLAB matrix operations to solve a system of simultaneous linear equations that helps predict the evolution of the solution. For our flame example, the matrix is only 1 by 1, but even here, stiff methods do more work per step than nonstiff methods.

Let's compute the solution to our flame example again, this time with `ode23s`. The " *s* " in the name is for "stiff."

ode23s(F,[0 2/delta],delta,opts);

Here is the zoom detail.

axis([.995e4 1.03e4 0.9998 1.0002])

You can see that *ode23s* takes many fewer steps than *ode45*. This is actually an easy problem for a stiff solver. In fact, *ode23s* takes only 99 steps and uses just 412 function evaluations, while *ode45* takes 3040 steps and uses 20179 function evaluations. Stiffness even affects graphical output. The print files for the *ode45* figures are much larger than those for the *ode23s* figures.

Imagine you are returning from a hike in the mountains. You are in a narrow canyon with steep slopes on either side. An explicit algorithm would sample the local gradient to find the descent direction. But following the gradient on either side of the trail will send you bouncing back and forth across the canyon, as with #ode45#. You will eventually get home, but it will be long after dark before you arrive. An implicit algorithm would have you keep your eyes on the trail and anticipate where each step is taking you. It is well worth the extra concentration.

All numerical methods for stiff odes are *implicit*. The simplest example is the *backward Euler* method, which involves finding the unknown $v$ in

$$ v = y_n + h f(t_{n+1},v) $$

and then setting $y_{n+1}$ equal to that $v$. This is usually a nonlinear system of equations whose solution requires at least an approximation to the Jacobian, the matrix of partial derivatives

$$ J = {\partial F \over \partial y} $$

By default the partial derivatives in the Jacobian are computed by finite differences. This can be quite costly in terms of function evaluations. If a procedure for computing the Jacobian is available, it can be provided. Or, if the sparsity pattern is known, it can be specified. The blocks in a Simulink diagram, for example, are only sparsely connected to each other. Specifying a sparse Jacobian initiates sparse linear equation solving.

Newton's method for computing the $v$ in the backward Euler method is an iteration. Start perhaps with $v^0 = y_n$. Then, for $k = 0,1,...$, solve the linear system of equations

$$ (I - hJ) u^k = v^k - y_n - h f(t_{n+1},v^k) $$

for the correction $u^k$ . Set

$$ v^{k+1} = v^k + u^k $$

When you are satisfied that the $v^k$ have converged, let $y_{n+1}$ be the limit.

Stiff ODE solvers may not use Newton's method itself, but whatever method is used to find the solution, $y_{n+1}$, at the forward time step can ultimately be traced back to Newton's method.

`ode15s` employs two variants of a method that is quite different from the single step methods that I've described so far in this series on ode solvers. Linear multistep methods save solution values from several time steps and use all of them to advance to the next step.

Actually, `ode15s` can be compared to the other multistep method in the suite, `ode113`. One saves values of the solution, $y_n$, while the other saves values of the function, $F(t_n,y_n)$. One includes the unknown value at the new time step $y_{n+1}$ in the formulation, thereby making it implicit, while the other does not. Both methods can vary the order as well as the step size. As their names indicate, `ode15s` allows the order to vary between 1 and 5, while `ode113s` allows the order to vary between 1 and 13.

A property specified via `odeset` switches `ode15s` between two variants of a linear multistep method, BDF, Backward Differentiation Formulas, and NDF, Numerical Differentiation Formulas. BDFs were introduced for stiff odes in 1971 by Bill Gear. Gear's student, Linda Petzold, extended the ideas to DAEs, differential-algebraic equations, and produced DASSL, software whose successors are still in widespread use today. NDFs, which are the default for `ode15s`, include an additional term in the memory and consequently can take larger steps with the same accuracy, especially at lower order.

`ode23s` is a single step, implicit method described in the paper by Shampine and Reichelt referenced below. It uses a second order formula to advance the step and a third order formula to estimate the error. It recomputes the Jacobian with each step, thereby making it quite expensive in terms of function evaluations. If you can supply an analytic Jacobian then `ode23s` is a competitive choice.

`ode23t` and `ode23tb` are implicit methods based on the trapezoid rule and the second order BDF. The origins of the methods go back to the 1985 paper referenced below by a group at the old Bell Labs working on electronic device and circuit simulation. Mike Hosea and Larry Shampine made extensive modifications and improvements described in their 1996 paper when they implemented the methods in MATLAB.

Stiff ODE solvers are not actually using MATLAB's iconic backslash operator on a full system of linear equations, but they are using its component parts, LU decomposition and solution of the resulting triangular systems.

Let's look at the statistics generated by `ode23` when it solves the flame problem. We'll run it again, avoiding the plot by asking for output, but then ignoring the output, and just looking at the stats.

opts = odeset('stats','on','reltol',1.e-4); [~,~] = ode23s(F,[0 2/delta],delta,opts);

99 successful steps 7 failed attempts 412 function evaluations 99 partial derivatives 106 LU decompositions 318 solutions of linear systems

We see that at every step `ode23s` is computing a Jacobian, finding the LU decomposition of a matrix involving that Jacobian, and then using L and U to solve three linear systems.

Now how about the primary stiff solver, `ode15s`.

[~,~] = ode15s(F,[0 2/delta],delta,opts);

140 successful steps 39 failed attempts 347 function evaluations 2 partial derivatives 53 LU decompositions 342 solutions of linear systems

We see that `ode15s` takes more steps than `ode23s`, but requires only two Jacobians. It does only half as many LU decompositions, but then uses each LU decomposition for twice as many linear equation solutions.

We certainly can't draw any conclusions about the relative merits of these two solvers from this one example, especially since the Jacobian in this case is only a 1-by-1 matrix.

Cleve Moler, *Numerical Computing with MATLAB*, Electronic Edition, MathWorks, <http://www.mathworks.com/moler/index_ncm.html>,

Print Edition, SIAM Revised Reprint, SIAM, 2008, 334 pp., <http://www.ec-securehost.com/SIAM/ot87.html>.

Lawrence F. Shampine and Mark W. Reichelt, "The MATLAB ODE Suite", SIAM Journal on Scientific Computing, 18 (1997), pp.1-22, <http://www.mathworks.com/help/pdf_doc/otherdocs/ode_suite.pdf>

Lawrence F. Shampine, Mark W. Reichelt, and Jacek A. Kierzenka, "Solving Index-1 DAEs in MATLAB and Simulink", SIAM Review, 41 (1999), pp. 538-552. <http://epubs.siam.org/doi/abs/10.1137/S003614459933425X>

M. E. Hosea and L. F. Shampine, "Analysis and Implementawtion of TR-BDF2", Applied Numerical Mathematicss, 20 (1996), pp. 21-37, <http://www.sciencedirect.com/science/article/pii/0168927495001158>

R. E. Bank, W. M. Coughran, Jr., W. Fichtner., E. H. Grosse, D. J. Rose, and R. K. Smith, "Transient simulation of silicon devices and circuits", IEEE Transactions on Computer-Aided Design CAD-4 (1985), 4, pp. 436-451.

I want to repeat this plot because it represents the post on the lead-in page at MATLAB Central.

imshow(last_plot.cdata,'border','tight')

Get
the MATLAB code

Published with MATLAB® R2014a

The functions `ode23` and `ode45` are the principal MATLAB and Simulink tools for solving nonstiff ordinary differential equations.... read more >>

The functions `ode23` and `ode45` are the principal MATLAB and Simulink tools for solving nonstiff ordinary differential equations.

The two functions `ode23` and `ode45` are single step ODE solvers. They are also known as Runge-Kutta methods. Each step is almost independent of the previous steps. Two important pieces of information are passed from one step to the next. The step size $h$ expected to achieve a desired accuracy is passed from step to step. And, in a strategy known as FSAL, for First Same as Last, the final function value at the end of a successful step is used at the initial function value at the following step.

The BS23 algorithm is due to Larry Shampine and his student Przemyslaw Bogacki. Przemyslaw is now a Professor at Old Dominion University. The " *23* " in the function name indicates that two simultaneous single-step formulas, one of second order and one of third order, are involved.

The method has three stages, but there are four slopes $s_i$ because, after the first step, the $s_1$ for one step is the $s_4$ from the previous step. The essentials are

$$ s_1 = f(t_n,y_n) $$

$$ s_2 = f\left(t_n+{h \over 2},y_n+{h \over 2} s_1\right) $$

$$ s_3 = f\left(t_n+{3 \over 4} h, y_n+{3 \over 4} h s_2\right) $$

$$ t_{n+1} = t_n + h $$

$$ y_{n+1} = y_n + {h \over 9} (2 s_1 + 3 s_2 + 4 s_3) $$

$$ s_4 = f(t_{n+1},y_{n+1}) $$

$$ e_{n+1} = {h \over 72} (-5 s_1 + 6 s_2 + 8 s_3 - 9 s_4) $$

Here is a graphical view of the steps.

ode23steps

These graphics show the starting situation and the three stages. We start at a point $(t_n,y_n)$ with an initial slope $s_1 = f(t_n,y_n)$ and an estimate of a good step size, $h$. Our goal is to compute an approximate solution $y_{n+1}$ at $t_{n+1} = t_n+h$ that agrees with the true solution $y(t_{n+1})$ to within the specified tolerances.

The first stage uses the initial slope $s_1$ to take an Euler step halfway across the interval. The function is evaluated there to get the second slope, $s_2$. This slope is used to take an Euler step three-quarters of the way across the interval. The function is evaluated again to get the third slope, $s_3$. A weighted average of the three slopes is used to step all the way across the interval to get a tentative value for $y_{n+1}$.

$$ s = {1 \over 9} (2 s_1 + 3 s_2 + 4 s_3) $$

The function is evaluated once more to get $s_4$. The error estimate then uses all four slopes.

$$ e_{n+1} = {h \over 72} (-5 s_1 + 6 s_2 + 8 s_3 - 9 s_4) $$

If the error is within the specified tolerance, the step is successful, the tentative value of $y_{n+1}$ is accepted, and $s_4$ becomes the $s_1$ of the next step. If the error is too large, then the tentative $y_{n+1}$ is rejected and the step must be redone. In either case, the error estimate $e_{n+1}$ provides the basis for determining the step size $h$ for the next step.

Notice that $y_{n+1}$ is computed before $s_4$. The final function evaluation is used for the error estimate, but not for the step itself. But this $s_4$ is the $s_1$ in the next step.

The coefficients in the error estimate $e_{n+1}$ result from the difference between the third order formula that is used to compute $y_{n+1}$ and an independent second order formula that involves the same function values. That second order result is not actually computed because its value is not needed.

All the solvers in the suite provide an interpolant that can generate values approximating the solution to the differential equation to the desired accuracy anywhere in the interval without requiring further evaluation of the function defining the ode. In the case of `ode23` this interpolant happens to be "free". In turns out that the Hermite cubic polynomial defined by the four values

$$ y_n, \ F(t_n,y_n), \ y_{n+1}, \ F(t_{n+1}, \ y_{n+1}) $$

does the job. For other solvers in the suite, providing the accompanying interpolant is an important aspect of the algorithm derivation.

Before today's version of `ode45`, there was an earlier one. In a 1969 NASA report, Erwin Fehlberg introduced a so-called six stage Runge-Kutta method that requires six function evaluations per step. These function values can be combined with one set of coefficients to produce a fifth-order accurate approximation and with another set of coefficients to produce an independent fourth-order accurate approximation. Comparing these two approximations provides an error estimate and resulting step size control.

Notice that it takes six stages to get fifth order. It is not possible to get fifth order with only five function evaluations per step. The combinatorial complexity of the Taylor series in two variables for $F(t,y)$ overpowers the information available from the function evaluations. In fact, if you continue to investigate the development of Runge-Kutta methods, you will find that, for example, with ten stages it is only possible to achieve seventh order.

In the early 1970's Shampine and his colleague H. A. (Buddy) Watts at Sandia Laboratories published a Fortran code, RKF45, based on Fehlberg's algorithm. In 1977, we made RKF45 the ODE solver in our text book *Computer Methods for Mathematical Computations*, by Forsythe, Malcolm and Moler. This book is now out of print so I can't provide a decent link to it. But the Fortran source code for RKF45 is still available from netlib.

One thing that I will always remember about RKF45 is ${12 \over 13}$ The fourth function evaluation, $s_4$, is made at the point

$$ t_n + {12 \over 13} h $$

I doubt that you will ever come across ${12 \over 13}$ anyplace else in mathematical software.

RKF45 became the basis for the first version of ODE45 in MATLAB in the early 1980s and for early versions of Simulink. The Felhberg (4,5) pair did a terrific job for almost fifteen years until the late 1990s when Shampine and MathWorker Mark Reichelt modernized the suite and introduced a more efficient algorithm.

The new `ode45` introduced in the late 1990s is based on an algorithm of Dormand and Prince. It uses six stages, employs the FSAL strategy, provides fourth and fifth order formulas, has local extrapolation and a companion interpolant.

The new `ode45` is so effective that it is the only function in the suite where the default value of the `refine` argument is set to 4. This means that the step size the algorithm naturally wants to choose is so large that the output is more widely spaced than most people prefer. So the interpolant is called up to produce more finely spaced output at no additional cost measured in function evaluations.

The codes for the two routines `ode23` and `ode45` are very similar. The most important place they differ is the portion that sets the key parameters. Here are the parameters in `ode23`.

dbtype 200:209 ode23

200 % Initialize method parameters. 201 pow = 1/3; 202 A = [1/2, 3/4, 1]; 203 B = [ 204 1/2 0 2/9 205 0 3/4 1/3 206 0 0 4/9 207 0 0 0 208 ]; 209 E = [-5/72; 1/12; 1/9; -1/8];

The parameter `pow` is used in the step size calculation. We see that `ode23` is a third order method. The array `A` gives the fractions for each partial step. The array `B` provides the weights for the slopes. And the array `E` provides the coefficients in the error estimate. Here are the corresponding parameters for the Dormand-Prince algorithm used in `ode45`.

dbtype 201:213 ode45

201 % Initialize method parameters. 202 pow = 1/5; 203 A = [1/5, 3/10, 4/5, 8/9, 1, 1]; 204 B = [ 205 1/5 3/40 44/45 19372/6561 9017/3168 35/384 206 0 9/40 -56/15 -25360/2187 -355/33 0 207 0 0 32/9 64448/6561 46732/5247 500/1113 208 0 0 0 -212/729 49/176 125/192 209 0 0 0 0 -5103/18656 -2187/6784 210 0 0 0 0 0 11/84 211 0 0 0 0 0 0 212 ]; 213 E = [71/57600; 0; -71/16695; 71/1920; -17253/339200; 22/525; -1/40];

`ode23` is a three-stage, third-order, Runge-Kutta method. `ode45` is a six-stage, fifth-order, Runge-Kutta method. `ode45` does more work per step than `ode23`, but can take much larger steps. For differential equations with smooth solutions, `ode45` is often more accurate than `ode23`. In fact, it may be so accurate that the interpolant is required to provide the desired resolution. That's a good thing.

`ode45` is the anchor of the differential equation suite. The MATLAB documentation recommends `ode45` as the first choice. And Simulink blocks set `ode45` as the default solver. But I have a fondness for `ode23`. I like its simplicity. I particularly like it for graphics. The natural step size that `ode23` chooses is frequently just right for display purposes. For example, the plot of the Lorenz chaotic attractor at the end of my previous post is done with `ode23` choosing the step size.

So give `ode23` a try.

Get
the MATLAB code

Published with MATLAB® R2014a

MATLAB and Simulink have a powerful suite of routines for the numerical solution of ordinary differential equations. Today's post offers an introduction. Subsequent posts will examine several of the routines in more detail.... read more >>

]]>MATLAB and Simulink have a powerful suite of routines for the numerical solution of ordinary differential equations. Today's post offers an introduction. Subsequent posts will examine several of the routines in more detail.

MATLAB started its life as a "Matrix Laboratory." The capability for the numerical solution of ordinary differential equations was soon added. This, together with the block diagram programming environment, provides the multidomain simulation companion product Simulink. The linear equation solutions from the matrix library underlie the stiff solvers in the ode suite.

Larry Shampine is a highly regarded mathematician and computational scientist. He is a Professor, now Emeritus, at Southern Methodist University in Dallas. Before SMU, Larry was with Sandia National Laboratory in Albuquerque. He has been a consultant to the MathWorks for many years. He has written several books and many research and expository papers. His software, originally in Fortran and more recently in MATLAB, has been the basis for our ODE codes, as well as others in the industry. Two of Larry's PhD students, Jacek Kierzenka and Michael Hosea, are on the staff of the MathWorks.

There are seven routines in the MATLAB suite. Their names all begin with " `ode` ". This is followed by two or three digits, and then perhaps one or two letters. The digits indicate the *order*. Roughly, higher order routines work harder and deliver higher accuracy per step. A suffix "s", "t", or "tb" designates a method for *stiff* equations. Here is the list of the functions in the suite.

`ode45`--- The first choice for most nonstiff problems.`ode23`--- Less stringent accuracy requirements than`ode45`.`ode113`--- More stringent accuracy requirements than`ode45`.`ode15s`--- The first choice for most stiff problems.`ode23s`--- Less stringent accuracy requirements than`ode15s`.`ode23t`--- Moderately stiff problems without numerical damping.`ode23tb`--- Less stringent accuracy requirements than`ode15s`.

Here is a simple code, `ode2`, which I will use to discuss *order* and the digits in the suite naming convention. It uses a fixed step size and evaluates the function defining the differential equation twice per step. The first function evaluation at the start of the step provides the slope for a half-step to the midpoint of the interval. The second function evaluation at the midpoint provides the slope for the step all the way across the interval.

```
type ode2
```

function ode2(F,t0,h,tfinal,y0) % ODE2 A simple ODE solver. % ODE2(F,t0,h,tfinal,y0) uses a midpoint rule % with fixed step size h on the interval % t0 <= t <= tfinal % to solve % dy/dt = F(t,y) % with y(t0) = y0. y = y0; for t = t0 : h : tfinal-h s1 = F(t,y); ymid = y + h*s1/2; s2 = F(t+h/2, ymid); y = y + h*s2; disp(y) end

At first glance you might think that the "2" in the name `ode2` just comes from the fact that the function `F(t,y)` is evaluated twice each step. But the reason for the 2 is deeper than that. It is actually because this is a *second order* solver. Let's see what *second order* means. Try solving one of the world's easiest differential equations,

$$ dy/dt = 1 $$

With initial condition $y(0) = 0$, the solution is

$$ y(t) = t $$

I'm going to solve this on the interval $0 \le t \le 10$ with a step size $h = 1$, so the result displays as integers.

```
format short
F = @(t,y) 1;
ode2(F,0,1,10,0)
```

1 2 3 4 5 6 7 8 9 10

Well, we got that right. Now try

$$ dy/dt = 2t $$

With initial condition $y(0) = 0$, the solution is

$$ y(t) = t^2 $$

F = @(t,y) 2*t; ode2(F,0,1,10,0)

1 4 9 16 25 36 49 64 81 100

Again, we got that right. So try

$$ dy/dt = 3t^2 $$

Do we generate

$$ y(t) = t^3 $$

F = @(t,y) 3*t.^2; ode2(F,0,1,10,0)

0.7500 7.5000 26.2500 63 123.7500 214.5000 341.2500 510 726.7500 997.5000

No, we do not get $t^3$ exactly.

So, our simple function `ode2` computes the exact answer for an ODE whose solution is a polynomial in $t$ of degree 2, but not 3. This is what we mean by a solver of *second* order.

Here's a homework exercise. Write the other simple `ode2`, the one based on the trapezoid rule. Evaluate `F(t,y)` at the start of the interval. Then take a step all the way across. Evaluate `F(t,y)` a second time at the other end of the interval. Then use the average of the two slopes to take the actual step. You should find that the trapezoid version of `ode2` has the save behavior as the midpoint version. It solves polynomials of degree 1 and 2 exactly, but not polynomials of degree 3.

In both of these cases, it turns out that "2" is both the number of function evaluations per step and the order. This is not always the case. In later posts, we will see that as we strive for higher accuracy and more efficient methods, the number of function evaluations required increases faster than the order.

The classical Runge-Kutta method was developed independently by a famous German applied mathematician, Carl Runge, in 1895, and another German applied mathematician, who was not quite as famous, Wilhelm Kutta, in 1901. It was used extensively for hand computations and is still probably in widespread use on digital computers today. You can see from this MATLAB version, a function called `ode4`, that it evaluates `F(t,y)` four times per step.

```
type ode4
```

function ode4(F,t0,h,tfinal,y0) % ODE4 Classical Runge-Kutta ODE solver. % ODE4(F,t0,h,tfinal,y0) uses the classsical Runge-Kutta % method with fixed step size h on the interval % t0 <= t <= tfinal % to solve % dy/dt = F(t,y) % with y(t0) = y0. y = y0; for t = t0 : h : tfinal-h s1 = F(t,y); s2 = F(t+h/2, y+h*s1/2); s3 = F(t+h/2, y+h*s2/2); s4 = F(t+h, y+h*s3); y = y + h*(s1 + 2*s2 + 2*s3 + s4)/6; disp(y) end

You will find that this function integrates $1$, $t$, $t^2$ and $t^3$ exactly. But how about $t^4$?

format long e F = @(t,y) 5*t^4; ode4(F,0,1,10,0)

1.041666666666667e+00 3.208333333333334e+01 2.431250000000000e+02 1.024166666666667e+03 3.125208333333333e+03 7.776250000000000e+03 1.680729166666666e+04 3.276833333333333e+04 5.904937500000000e+04 1.000004166666667e+05

The exact solution would be the integers, $t^5$, but we didn't quite get them. So, classical Runge-Kutta computes the exact answer for an ODE whose solution is a polynomial in $t$ of degree 4, but not 5. This is what we mean by a solver of *fourth* order, and is the reason why this function is called `ode4`.

The name of `ode4` has only one digit, not two, and classical Runge-Kutta does not qualify for a spot in the MATLAB suite because the method has a fixed step size. There is no error estimate. Modern numerical methods, and modern mathematical software, include error estimates and automatic step size control. This is the subject of my next blogs.

I want to include some graphics in today's blog, so here let's use `ode23` to plot the three components of the Lorenz chaotic differential equation described in my previous blog post.

```
type lorenzplot
lorenzplot
```

function lorenzplot % LORENZPLOT Plot the three components of the solution % to the Lorenz equations. sigma = 10; beta = 8/3; rho = 28; eta = sqrt(beta*(rho-1)); A = [ -beta 0 eta 0 -sigma sigma -eta rho -1 ]; v0 = [rho-1 eta eta]'; y0 = v0 + [3 2 -4]'; tspan = [0 30]; [t,y] = ode23(@lorenzeqn, tspan, y0); plot(t,[y(:,1)+15 y(:,2) y(:,3)-40]); axis([0 30 -80 80]) set(gca,'ytick',[-40 0 40],'yticklabel','y3|y2|y1') xlabel('t') title('Lorenz Attractor') % ------------------------------------ function ydot = lorenzeqn(t,y) A(1,3) = y(2); A(3,1) = -y(2); ydot = A*y; end end

Get
the MATLAB code

Published with MATLAB® R2014a

Changing the value of a parameter in the equations that produce the famous Lorenz chaotic attractor yields nonlinear ordinary differential equations that have periodic solutions.... read more >>

]]>Changing the value of a parameter in the equations that produce the famous Lorenz chaotic attractor yields nonlinear ordinary differential equations that have periodic solutions.

(This section is adapted from chapter 7 of my book Numerical Computing with MATLAB, published by MathWorks and SIAM.)

The Lorenz chaotic attractor was first described in 1963 by Edward Lorenz, an M.I.T. mathematician and meteorologist who was interested in fluid flow models of the earth's atmosphere. An excellent reference is the book by Colin Sparrow cited at the end of this blog.

I have chosen to express the Lorenz equations in a somewhat unusual way involving a matrix-vector product:

$$ \dot{y} = A y $$

Here $y$ is a vector-valued function of $t$ with three components and $A$ is a 3-by-3 matrix that depends on $y$. Seven of the nine elements in $A$ are constant, but the other two depend on $y_2(t)$:

$$A = [\ -\beta \ \ 0 \ \ y_2 \ ; \ \ 0 \ \ -\sigma \ \ \sigma \ ; \ \ -y_2 \ \ \rho \ \ -1 \ ]$$

The first component of the solution, $y_1(t)$, is related to the convection in the atmospheric flow, while the other two components are related to horizontal and vertical temperature variation. The parameter $\sigma$ is the Prandtl number, $\rho$ is the normalized Rayleigh number, and $\beta$ depends on the geometry of the domain. The most popular values of the parameters are $\sigma = 10$, $\beta = 8/3$, and $\rho = 28$. These are outside the ranges associated with the earth's atmosphere.

The deceptively simple nonlinearity introduced by the presence of $y_2$ in the system matrix $A$ changes everything. There are no random aspects to these equations, so the solutions $y(t)$ are completely determined by the parameters and the initial conditions, but their behavior is very difficult to predict. For some values of the parameters, the orbit of $y(t)$ in three-dimensional space is known as a *strange attractor*. It is bounded, but not periodic and not convergent. It never intersects itself. It ranges chaotically back and forth around two different points, or attractors. For other values of the parameters, the solution might converge to a fixed point, diverge to infinity, or oscillate periodically.

Think of $\eta = y_2$ as a free parameter, restrict $\rho$ to be greater than one, and study the matrix

$$A = [\ -\beta \ \ 0 \ \ \eta \ ; \ \ 0 \ \ -\sigma \ \ \sigma \ ; \ \ \eta \ \ \rho \ \ -1 \ ]$$

It turns out that $A$ is singular if and only if

$$ \eta = \pm \sqrt{\beta (\rho-1)} $$

The corresponding null vector, normalized so that its second component is equal to $\eta$, is

$$ [\ \rho-1 \ \ \eta \ \ \eta \ ]^T $$

With the two different signs for $\eta$, this defines two points in three-dimensional space. These points are fixed points for the differential equation. If either of these points is taken as the initial condition, then the initial derivative $\dot{y}(0) = 0$ and so $y(t)$ never changes.

Fix $\sigma = 10$, $\beta = 8/3$, and investigate the effect of the parameter $\rho$. For $\rho$ less than about 24.7, the fixed points are stable. If the initial value is close enough, the orbit will spiral in to the fixed point. However, for larger values of $\rho$ these points are unstable. If $y(t)$ does not start at one of these points, it will never reach either of them. For most values of $\rho$ greater than 24.7, including the popular $\rho = 28$, the orbit is chaotic.

In his comprehensive study of the Lorenz equations, Colin Sparrow found four exceptional values of $\rho$ where a stable periodic orbit arises out of the chaos. If the initial point happens to be anywhere on the orbit, the solution will return to that point in finite time. If the initial point is not on the orbit, the trajectory will converge towards the periodic orbit. This behavior is unusual for nonlinear equations.

Each orbit has what might be called a characteristic *signature*. Each of the four orbits has a different signature. Use "+" and "-" to denote the sign of $\eta$ in the null vector. In the three-dimensional plots below, the red dots are these attractors. The point with the "+" is on the upper left, while the "-" is on the lower right.

This is the most complex orbit. Its signature is +--+--. It circles the "+" fixed point once and then the "-" fixed point twice. Then it circles the "+" fixed point once again and the "-" point twice again on a slightly different trajectory before repeating the orbit.

The signature is ++-. Circle "+" twice, then "-" once.

The signature is ++--. Circle each fixed point twice before heading towards the other.

The simplest orbit. The signature is +-. No fuss, no bother.

In case you haven't seen it before, here is a plot for the value of $\rho$ that is usually studied, $\rho = 28$. This is the well known Lorenz butterfly. To really appreciate it, you should literally get hold of a three-dimensional plot and view it from different angles. This trajectory goes on forever, never intersecting itself, and never getting too close to, or too far away from, the two unstable fixed points. Most values of $\rho$ near $\rho = 28$ produce similar chaotic behavior.

If you want to see these orbits in action, in 3-D, download and run lorenzgui.m.

Colin Sparrow, *The Lorenz Equations: Bifurcations, Chaos, and Strange Attractors*, Springer, 1982, <http://link.springer.com/book/10.1007%2F978-1-4612-5767-7>

James Gleick on Chaos: Making a New Science, YouTube interview, https://www.youtube.com/watch?v=3orIIcKD8p4

James Gleick, *Chaos: Making a New Scince*, Penguin Books, 1987, <http://www.penguin.com/book/chaos-by-james-gleick/9780143113454>,

Get
the MATLAB code

Published with MATLAB® R2014a