The floating point arithmetic format that occupies 128 bits of storage is known as *binary128* or *quadruple precision*. This blog post describes an implementation of quadruple precision programmed entirely in the MATLAB language.... read more >>

The floating point arithmetic format that occupies 128 bits of storage is known as *binary128* or *quadruple precision*. This blog post describes an implementation of quadruple precision programmed entirely in the MATLAB language.

The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as *binary32* and *binary64*, or more frequently as *single* and *double precision*. For many years MATLAB used only double precision and it remains our default format. Single precision has been added gradually over the last several years and is now also fully supported.

A revision of IEEE 754, published in 2008, defines two more floating point formats. One, *binary16* or *half precision*, occupies only 16 bits and was the subject of my previous blog post. It is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only "half" precision, its use for actual computation is problematic.

The other new format introduced in IEEE 754-2008 is *binary128* or *quadruple precision*. It is intended for situations where the accuracy or range of double precision is inadequate.

I see two descriptions of quadruple precision software implementations on the Web.

I have not used either package, but judging by their Web pages, they both appear to be complete and well supported.

The MATLAB Symbolic Math Toolbox provides `vpa`, arbitrary precision decimal floating point arithmetic, and `sym`, exact rational arithmetic. Both provide accuracy and range well beyond quadruple precision, but do not specifically support the 128-bit IEEE format.

My goal here is to describe a prototype of a MATLAB object, `fp128`, that implements quadruple precision with code written entirely in the MATLAB language. It is not very efficient, but is does allow experimentation with the 128-bit format.

There are other floating point formats beyond double precision. *Long double* usually refers to the 80-bit extended precision floating point registers available with the Intel x86 architecture and described as *double extended* in IEEE 754. This provides the same exponent range as quadruple precision, but much less accuracy.

*Double double* refers to the use of a pair of double precision values. The exponent field and sign bit of the second double are ignored, so this is effectively a 116-bit format. Both the exponent range and the precision are more than double but less than quadruple.

The format of a floating point number is characterized by two parameters, `p`, the number of bits in the fraction and `q`, the number of bits in the exponent. I will compare four precisions, *half*, *single*, *double*, and *quadruple*. The four pairs of characterizing parameters are

p = [10, 23, 52 112];

q = [5, 8, 11, 15];

With these values of `p` and `q`, and with one more bit for the sign, the total number of bits in the word, `w`, is a power of two.

```
format shortg
w = p + q + 1
```

w = 16 32 64 128

**Normalized numbers**

Most floating point numbers are *normalized*, and are expressed as

$$ x = \pm (1+f)2^e $$

The *fraction* $f$ is in the half open interval

$$ 0 \leq f < 1 $$

The binary representation of $f$ requires at most `p` bits. In other words $2^p f$ is an integer in the range

$$ 0 \leq 2^p f < 2^p $$

The *exponent* $e$ is an integer in the range

$$ -b+1 \leq e \leq b $$

The quantity $b$ is both the largest exponent and the `bias`.

$$ b = 2^{q-1} - 1 $$

b = 2.^(q-1)-1

b = 15 127 1023 16383

The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored. That leading $1$ is known as the *hidden bit*.

**Subnormal**

There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in `q` bits. The smallest is

$$ e + b = 0 $$

The corresponding floating point numbers do not have a hidden leading bit. These are the *subnormal* or *denormal* numbers.

$$ x = \pm f 2^{-b} $$

**Infinity and Not-A-Number**

The largest possible biased exponent is

$$ e + b = 2^q-1 $$.

Quantities with this exponent field represent *infinities* and *NaN*, or *Not-A-Number*.

The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases. Exceptional exponents are only $2$ values out of $2^q$. For quadruple precision this is $2/2^{15}$, which is less than a one one-thousandth of one percent.

Encode the sign bit with `s = 0` for nonnegative and `s = 1` for negative. And encode the exponent with an offsetting bias, `b`. Then a floating point number can be packed in `w` bits with

x = [s e+b 2^p*f]

**epsilon**

If a real number cannot be expressed with a binary expansion requiring at most `p` bits, it must be approximated by a floating point number that does have such a binary representation. This is *roundoff error*. The important quantity characterizing precision is *machine epsilon*, or `eps`. In MATLAB, `eps(x)` is the distance from `x` to the next larger (in absolute value) floating point number (of that class). With no argument, `eps` is simply the difference between `1` and the next larger floating point number.

```
format shortg
eps = 2.^(-p)
```

eps = 0.00097656 1.1921e-07 2.2204e-16 1.9259e-34

This tells us that quadruple precision is good for about 34 decimal digits of accuracy, double for about 16 decimal digits, single for about 7, and half for about 3.

**realmax**

If a real number, or the result of an arithmetic operation, is too large to be represented, it *overflows* and is replaced by *infinity*. The largest floating point number that does not overflow is `realmax`. When I try to compute quadruple `realmax` with double precision, it overflows. I will fix this up in the table to follow.

realmax = 2.^b.*(2-eps)

realmax = 65504 3.4028e+38 1.7977e+308 Inf

**realmin**

*Underflow* and representation of very small numbers is more complicated. The smallest normalized floating point number is `realmin`. When I try to compute quadruple `realmin` it underflows to zero. Again, I will fix this up in the table.

realmin = 2.^(-b+1)

realmin = 6.1035e-05 1.1755e-38 2.2251e-308 0

**tiny**

But there are numbers smaller than `realmin`. IEEE 754 introduced the notion of *gradual underflow* and *denormal* numbers. In the 2008 revised standard their name was changed to *subnormal*.

Think of roundoff in numbers near underflow. Before 754, floating point numbers had the disconcerting property that `x` and `y` could be unequal, but their difference could underflow, so `x-y` becomes `0`. With 754 the gap between `0` and `realmin` is filled with numbers whose spacing is the same as the spacing between `realmin` and `2*realmin`. I like to call this spacing, and the smallest subnormal number, `tiny`.

tiny = realmin.*eps

tiny = 5.9605e-08 1.4013e-45 4.9407e-324 0

**flintmax**

It is possible to do integer arithmetic with floating point numbers. I like to call such numbers *flints*. When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is `flintmax`.

flintmax = 2./eps

flintmax = 2048 1.6777e+07 9.0072e+15 1.0385e+34

Technically all the floating point numbers larger than `flintmax` are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between `0` and `flintmax` are allowed to be called flints.

Let's collect all these anatomical characteristics together in a new MATLAB `table`. I have now edited the output and inserted the correct quadruple precision values.

T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'half','single','double','quadruple'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); type Table.txt

half single double quadruple __________ __________ ___________ __________ w 16 32 64 128 p 10 23 52 112 q 5 8 11 15 b 15 127 1023 16383 eps 0.00097656 1.1921e-07 2.2204e-16 1.9259e-34 realmax 65504 3.4028e+38 1.7977e+308 1.190e+4932 realmin 6.1035e-05 1.1755e-38 2.2251e-308 3.362e-4932 tiny 5.9605e-08 1.4013e-45 4.9407e-324 6.475e-4966 flintmax 2048 1.6777e+07 9.0072e+15 1.0385e+34

I am currently working on code for an object, `@fp128`, that could provide a full implementation of quadruple-precision arithmetic. The methods available so far are

methods(fp128)

Methods for class fp128: abs eq le mtimes realmax subsref cond fp128 lt ne realmin svd diag frac lu norm shifter sym disp ge max normalize sig times display gt minus normalize2 sign tril double hex mldivide plus sqrt triu ebias hypot mpower power subsasgn uminus eps ldivide mrdivide rdivide subsindex

The code that I have for quadrule precision is much more complex than the code that I have for half precision. There I am able to "cheat" by converting half precision numbers to doubles and relying on traditional MATLAB arithmetic. I can't do that for quads.

The storage scheme for quads is described in the help entry for the constructor.

```
help @fp128/fp128
```

fp128 Quad precision constructor. z = fp128(x) has three fields. x = s*(1+f)*2^e, where z.s, one uint8, s = (-1)^sg = 1-2*sg, sg = (1-s)/2. z.e, 15 bits, biased exponent, one uint16. b = 2^14-1 = 16383, eb = e + b, 1 <= eb <= 2*b for normalized quads, eb = 0 for subnormal quads, eb = 2*b+1 = 32767 for infinity and NaN. z.f, 112 bits, nonnegative fraction, 4-vector of uint64s, each with 1/4-th of the bits, 0 <= f(k) < 2^28, 4*28 = 112. z.f represents sum(f .* pow2), pow2 = 2.^(-28*(1:4)) Reference page in Doc Center doc fp128

Breaking the 112-bit fraction into four 28-bit pieces makes it possible to do arithmetic operations on the pieces without worrying about integer overflow. The core of the `times` code, which implements `x.*y`, is the convolution of the two fractional parts.

dbtype 45:53 @fp128/times

45 % The multiplication. 46 % z.f = conv(x.f,y.f); 47 % Restore hidden 1's. 48 xf = [1 x.f]; 49 yf = [1 y.f]; 50 zf = zeros(1,9,'uint64'); 51 for k = 1:5 52 zf(k:k+4) = zf(k:k+4) + yf(k)*xf; 53 end

The result of the convolution, `zf`, is a `uint64` vector of length nine with 52-bit elements. It must be renormalized to the fit the `fp128` storage scheme.

Addition and subtraction involve addition and subtraction of the fractional parts after they have been shifted so that the corresponding exponents are equal. Again, this produces temporary vectors that must be renormalized.

Scalar division, `y/x`, is done by first computing the reciprocal of the denominator, `1/x`, and then doing one final multiplication, `1/x * y`. The reciprocal is computed by a few steps of Newton iteration, starting with a scaled reciprocal, `1/double(x)`.

The output for each example shows the three fields in hexadecimal -- one sign field, one biased exponent field, and one fraction field that is a vector with four entries displayed with seven hex digits. This is followed by a 36 significant digit decimal representation.

**One**

```
clear
format long
one = fp128(1)
```

one = 0 3fff 0000000 0000000 0000000 0000000 1.0

**eps**

eps = eps(one)

eps = 0 3f8f 0000000 0000000 0000000 0000000 0.000000000000000000000000000000000192592994438723585305597794258492732

**1 + eps**

one_plus_eps = one + eps

one_plus_eps = 0 3fff 0000000 0000000 0000000 0000001 1.00000000000000000000000000000000019

**2 - eps**

two_minus_eps = 2 - eps

two_minus_eps = 0 3fff fffffff fffffff fffffff fffffff 1.99999999999999999999999999999999981

**realmin**

rmin = realmin(one)

rmin = 0 0001 0000000 0000000 0000000 0000000 3.3621031431120935062626778173217526e-4932

**realmax**

rmax = realmax(one)

rmax = 0 7ffe fffffff fffffff fffffff fffffff 1.18973149535723176508575932662800702e4932

**Compute 1/10 with double, then convert to quadruple.**

dble_tenth = fp128(1/10)

dble_tenth = 0 3ffb 9999999 99999a0 0000000 0000000 0.100000000000000005551115123125782702

**Compute 1/10 with quadruple.**

quad_tenth = 1/fp128(10)

quad_tenth = 0 3ffb 9999999 9999999 9999999 9999999 0.0999999999999999999999999999999999928

**Double precision pi converted to quadruple.**

dble_pi = fp128(pi)

dble_pi = 0 4000 921fb54 442d180 0000000 0000000 3.14159265358979311599796346854418516

`pi` accurate to quadruple precision.

```
quad_pi = fp128(sym('pi'))
```

quad_pi = 0 4000 921fb54 442d184 69898cc 51701b8 3.1415926535897932384626433832795028

The 4-by-4 magic square from Durer's Melancholia II provides my first matrix example.

clear M = fp128(magic(4));

Let's see how the 128-bit elements look in hex.

```
format hex
M
```

M = 0 4003 0000000 0000000 0000000 0000000 0 4001 4000000 0000000 0000000 0000000 0 4002 2000000 0000000 0000000 0000000 0 4001 0000000 0000000 0000000 0000000 0 4000 0000000 0000000 0000000 0000000 0 4002 6000000 0000000 0000000 0000000 0 4001 c000000 0000000 0000000 0000000 0 4002 c000000 0000000 0000000 0000000 0 4000 8000000 0000000 0000000 0000000 0 4002 4000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4002 e000000 0000000 0000000 0000000 0 4002 a000000 0000000 0000000 0000000 0 4002 0000000 0000000 0000000 0000000 0 4002 8000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000

Check that the row sums are all equal. This matrix-vector multiply can be done exactly with the flints in the magic square.

e = fp128(ones(4,1)) Me = M*e

e = 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 Me = 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000

I've overloaded `mldivide`, so I can solve linear systems and compute inverses. The actual computation is done by `lutx`, a "textbook" function that I wrote years ago, long before this quadruple-precision project, followed by the requisite solution of triangular systems. But now the MATLAB object system insures that every individual arithmetic operation is done with IEEE 754 quadruple precision.

Let's generate a 3-by-3 matrix with random two-digit integer entries.

A = fp128(randi(100,3,3))

A = 0 4002 0000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4004 b000000 0000000 0000000 0000000 0 4005 3800000 0000000 0000000 0000000 0 4005 7800000 0000000 0000000 0000000 0 4002 a000000 0000000 0000000 0000000 0 4004 c800000 0000000 0000000 0000000 0 4004 7800000 0000000 0000000 0000000 0 4000 0000000 0000000 0000000 0000000

I am going to use `fp128` backslash to invert `A`. So I need the identity matrix in quadruple precision.

I = fp128(eye(size(A)));

Now the overloaded backslash calls `lutx`, and then solves two triangular systems to produce the inverse.

X = A\I

X = 0 3ff7 2fd38ea bcfb815 69cdccc a36d8a5 1 3ff9 c595b53 8c842ee f26189c a0770d4 0 3ffa c0bc8b7 4adcc40 4ea66ca 61f1380 1 3ff7 a42f790 e4ad874 c358882 7ff988e 0 3ffa 12ea8c2 3ef8c17 01c7616 5e03a5a 1 3ffa 70d4565 958740b 78452d8 f32d866 0 3ff9 2fd38ea bcfb815 69cdccc a36d8a7 0 3ff3 86bc8e5 42ed82a 103d526 a56452f 1 3ff6 97f9949 ba961b3 72d69d9 4ace666

Compute the residual.

```
AX = A*X
R = I - AX;
format short
RD = double(R)
```

AX = 0 3fff 0000000 0000000 0000000 0000000 0 3f90 0000000 0000000 0000000 0000000 1 3f8d 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3f8d 8000000 0000000 0000000 0000000 1 3f8c 0000000 0000000 0000000 0000000 1 3f8d 0000000 0000000 0000000 0000000 0 3ffe fffffff fffffff fffffff ffffffb RD = 1.0e-33 * 0 0 0.0241 -0.3852 0 0.0481 0.0481 -0.0722 0.4815

Both `AX` and `R` are what I expect from arithmetic that is accurate to about 34 decimal digits.

Although I get a different random `A` every time I publish this blog post, I expect that it has a modest condition number.

kappa = cond(A)

kappa = 0 4002 7e97c18 91278cd 8375371 7915346 11.9560249020758193065358323606886569

Since `A` is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix. The elements of the resulting `Z` are integers, possibly bruised with quadruple precision fuzz.

```
format hex
Z = X\I
```

Z = 0 4002 0000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4004 b000000 0000000 0000000 0000004 0 4005 37fffff fffffff fffffff ffffffc 0 4005 77fffff fffffff fffffff ffffffe 0 4002 a000000 0000000 0000000 0000001 0 4004 c7fffff fffffff fffffff ffffffc 0 4004 77fffff fffffff fffffff ffffffc 0 3fff fffffff fffffff fffffff ffffffc

I have just nonchalantly computed `cond(A)`. Here is the code for the overloaded `cond`.

```
type @fp128/cond.m
```

function kappa = cond(A) sigma = svd(A); kappa = sigma(1)/sigma(end); end

So it is correctly using the singular value decomposition. I also have `svd` overloaded. The SVD computation is handled by a 433 line M-file, `svdtx`, that, like `lutx`, was written before `fp128` existed. It was necessary to modify five lines in `svdtx`. The line

u = zeros(n,ncu);

had to be changed to

u = fp128(zeros(n,ncu));

Similarly for `v`, `s`, `e` and `work`. I should point out that the preallocation of the arrays is inherited from the LINPACK Fortran subroutine DSVDC. Without it, `svdtx` would not have required any modification to work correctly in quadruple precision.

Let's compute the full SVD.

[U,S,V] = svd(A)

U = 1 3ffe 57d9492 76f3ea4 dc14bb3 15d42c1 1 3ffe 75a77c4 8c7b469 2cac695 59be7fe 1 3ffc 0621737 9b04c78 1c2109d 8736b46 1 3ffb 38214c0 d75c84c 4bcf5ff f3cffd7 1 3ffb a9281e3 e12dd3a d632d61 c8f6e60 0 3ffe fbbccdc a571fa1 f5a588b fb0d806 1 3ffe 79587db 4889548 f09ae4b cd0150c 0 3ffe 59fae16 17bcabb 6408ba4 7b2a573 0 3ff8 cde38fc e952ad5 8b526c2 780c2e5 S = 0 4006 1f3ad79 d0b9b08 18b1444 030e4ef 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 4004 a720ef6 28c6ec0 87f4c54 82dda2a 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 4002 8061b9a 0e96d8c c2ef745 9ea4c9a V = 1 3ffb db3df03 9b5e1b3 5bf4478 0e42b0d 1 3ffe b540007 4d4bc9e dc9461a 0de0481 1 3ffe 03aaff4 d9cea2c e8ee2bc 2eba908 0 3ffe fa73e09 9ef8810 a03d2eb 46ade00 1 3ffa b316e2f fe9d3ae dfa9988 fbca927 1 3ffc 184af51 f25fece 97bc0da 5ff13a2 1 3ffb 706955f a877cbb b63f6dd 4e2150e 0 3ffe 08fc1eb 7b86ef7 4af3c6c 732aae9 1 3ffe b3aaead ef356e2 7cd2937 94b22a7

Reconstruct `A` from its quadruple precision SVD. It's not too shabby.

USVT = U*S*V'

USVT = 0 4001 fffffff fffffff fffffff fffffce 0 4001 7ffffff fffffff fffffff fffffc7 0 4004 b000000 0000000 0000000 000000a 0 4005 37fffff fffffff fffffff ffffff1 0 4005 77fffff fffffff fffffff ffffff6 0 4002 9ffffff fffffff fffffff fffffd2 0 4004 c7fffff fffffff fffffff ffffff1 0 4004 77fffff fffffff fffffff ffffff4 0 3fff fffffff fffffff fffffff ffffe83

An interesting example is provided by a classic test matrix, the 8-by-8 Rosser matrix. Let's compare quadruple precision computation with the exact rational computation provided by the Symbolic Math Toolbox.

First, generate quad and sym versions of `rosser`.

R = fp128(rosser); S = sym(rosser)

S = [ 611, 196, -192, 407, -8, -52, -49, 29] [ 196, 899, 113, -192, -71, -43, -8, -44] [ -192, 113, 899, 196, 61, 49, 8, 52] [ 407, -192, 196, 611, 8, 44, 59, -23] [ -8, -71, 61, 8, 411, -599, 208, 208] [ -52, -43, 49, 44, -599, 411, 208, 208] [ -49, -8, 8, 59, 208, 208, 99, -911] [ 29, -44, 52, -23, 208, 208, -911, 99]

`R` is symmetric, but not positive definite, so its LU factorization requires pivoting.

```
[L,U,p] = lutx(R);
format short
p
```

p = 1 2 3 7 6 8 4 5

`R` is singular, so with exact computation `U(n,n)` would be zero. With quadruple precision, the diagonal of `U` is

format long e diag(U)

ans = 611.0 836.126022913256955810147299509001582 802.209942588471107300640276546225738 99.0115741407236314604636000423592687 -710.481057851148425133280246646085002 579.272484693223512196223933017062413 -1.2455924519190846395771824210276321 0.000000000000000000000000000000215716190833522835766351129431653015

The relative size of the last diagonal element is zero to almost 34 digits.

double(U(8,8)/U(1,1))

ans = 3.530543221497919e-34

Compare this with symbolic computation, which, in this case, can compute an LU decomposition with exact rational arithmetic and no pivoting.

[L,U] = lu(S); diag(U)

ans = 611 510873/611 409827400/510873 50479800/2049137 3120997/10302 -1702299620/3120997 255000/40901 0

As expected, with symbolic computation `U(8,8)` is exactly zero.

How about SVD?

r = svd(R)

r = 1020.04901842999682384631379130551006 1020.04901842999682384631379130550858 1019.99999999999999999999999999999941 1019.90195135927848300282241090227735 999.999999999999999999999999999999014 999.999999999999999999999999999998817 0.0980486407215169971775890977220345302 0.0000000000000000000000000000000832757192990287779822645036097560521

The Rosser matrix is atypical because its characteristic polynomial factors over the rationals. So, even though it is of degree 8, the singular values are the roots of quadratic factors.

s = svd(S)

s = 10*10405^(1/2) 10*10405^(1/2) 1020 10*(1020*26^(1/2) + 5201)^(1/2) 1000 1000 10*(5201 - 1020*26^(1/2))^(1/2) 0

The relative error of the quadruple precision calculation.

double(norm(r - s)/norm(s))

ans = 9.293610246879066e-34

About 33 digits.

Finally, verify that we've been working all this time with `fp128` and `sym` objects.

whos

Name Size Bytes Class Attributes A 3x3 3531 fp128 AX 3x3 3531 fp128 I 3x3 3531 fp128 L 8x8 8 sym M 4x4 6128 fp128 Me 4x1 1676 fp128 R 8x8 23936 fp128 RD 3x3 72 double S 8x8 8 sym U 8x8 8 sym USVT 3x3 3531 fp128 V 3x3 3531 fp128 X 3x3 3531 fp128 Z 3x3 3531 fp128 ans 1x1 8 double e 4x1 1676 fp128 kappa 1x1 563 fp128 p 8x1 64 double r 8x1 3160 fp128 s 8x1 8 sym

Get
the MATLAB code

Published with MATLAB® R2017a

The floating point arithmetic format that requires only 16 bits of storage is becoming increasingly popular. Also known as *half precision* or *binary16*, the format is useful when memory is a scarce resource.... read more >>

The floating point arithmetic format that requires only 16 bits of storage is becoming increasingly popular. Also known as *half precision* or *binary16*, the format is useful when memory is a scarce resource.

The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as *binary32* and *binary64*, or more frequently as *single* and *double precision*. For many years MATLAB used only double precision and it remains our default format. Single precision has been added gradually over the last several years and is now also fully supported.

A revision of IEEE 754, published in 2008, defines a floating point format that occupies only 16 bits. Known as *binary16*, it is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only "half" precision, its use for actual computation is problematic. An interesting discussion of its utility as an image processing format with increased dynamic range is provided by Industrial Light and Magic. Hardware support for half precision is now available on many processors, including the GPU in the Apple iPhone 7. Here is a link to an extensive article about half precision on the NVIDIA GeForce GPU.

The format of a floating point number is characterized by two parameters, `p`, the number of bits in the fraction and `q`, the number of bits in the exponent. I will consider four precisions, *quarter*, *half*, *single*, and *double*. The quarter-precision format is something that I just invented for this blog post; it is not standard and actually not very useful.

The four pairs of characterizing parameters are

p = [4, 10, 23, 52];

q = [3, 5, 8, 11];

With these values of `p` and `q`, and with one more bit for the sign, the total number of bits in the word, `w`, is a power of two.

w = p + q + 1

w = 8 16 32 64

**Normalized numbers**

Most floating point numbers are *normalized*, and are expressed as

$$ x = \pm (1+f)2^e $$

The *fraction* $f$ is in the half open interval

$$ 0 \leq f < 1 $$

The binary representation of $f$ requires at most `p` bits. In other words $2^p f$ is an integer in the range

$$ 0 \leq 2^p f < 2^p $$

The *exponent* $e$ is an integer in the range

$$ -b+1 \leq e \leq b $$

The quantity $b$ is both the largest exponent and the `bias`.

$$ b = 2^{q-1} - 1 $$

b = 2.^(q-1)-1

b = 3 15 127 1023

The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored. That leading $1$ is known as the *hidden bit*.

**Subnormal**

There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in `q` bits. The smallest is

$$ e + b = 0 $$

The corresponding floating point numbers do not have a hidden leading bit. These are the *subnormal* or *denormal* numbers.

$$ x = \pm f 2^{-b} $$

**Infinity and Not-A-Number**

The largest possible biased exponent is

$$ e + b = 2^q-1 $$.

Quantities with this exponent field represent *infinities* and *NaN*, or *Not-A-Number*.

The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases. Exceptional exponents are only $2$ values out of $2^q$. For double precision this is $2/2^{11}$, which is less than a tenth of a percent, but for half precision it is $2/2^5$, which is more than 6 percent. And fully one-fourth of all my toy quarter precision floating point numbers are exceptional.

Encode the sign bit with `s = 0` for nonnegative and `s = 1` for negative. And encode the exponent with an offsetting bias, `b`. Then a floating point number can be packed in `w` bits with

x = [s e+b 2^p*f]

**epsilon**

If a real number cannot be expressed with a binary expansion requiring at most `p` bits, it must be approximated by a floating point number that does have such a binary representation. This is *roundoff error*. The important quantity characterizing precision is *machine epsilon*, or `eps`. In MATLAB, `eps(x)` is the distance from `x` to the next larger (in absolute value) floating point number. With no argument, `eps` is simply the difference between `1` and the next larger floating point number.

```
format shortg
eps = 2.^(-p)
```

eps = 0.0625 0.00097656 1.1921e-07 2.2204e-16

This tells us that double precision is good for about 16 decimal digits of accuracy, single for about 7 decimal digits, half for about 3, and quarter for barely more than one.

**realmax**

If a real number, or the result of an arithmetic operation, is too large to be represented, it *overflows* and is replaced *infinity*. The largest floating point number that does not overflow is

realmax = 2.^b.*(2-eps)

realmax = 15.5 65504 3.4028e+38 1.7977e+308

**realmin**

*Underflow* and representation of very small numbers is more complicated. The smallest normalized floating point number is

realmin = 2.^(-b+1)

realmin = 0.25 6.1035e-05 1.1755e-38 2.2251e-308

**tiny**

But there are numbers smaller than `realmin`. IEEE 754 introduced the notion of *gradual underflow* and *denormal* numbers. In the 2008 revised standard their name was changed to *subnormal*.

Think of roundoff in numbers near underflow. Before 754 floating point numbers had the disconcerting property that `x` and `y` could be unequal, but their difference could underflow so `x-y` becomes `0`. With 754 the gap between `0` and `realmin` is filled with numbers whose spacing is the same as the spacing between `realmin` and `2*realmin`. I like to call this spacing, and the smallest subnormal number, `tiny`.

tiny = realmin.*eps

tiny = 0.015625 5.9605e-08 1.4013e-45 4.9407e-324

**flintmax**

It is possible to do integer arithmetic with floating point numbers. I like to call such numbers *flints*. When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is `flintmax`.

flintmax = 2./eps

flintmax = 32 2048 1.6777e+07 9.0072e+15

Technically all the floating point numbers larger than `flintmax` are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between `0` and `flintmax` are allowed to be called flints.

Let's collect all these anatomical characteristics together in a new MATLAB `table`.

T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'quarter','half','single','double'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); disp(T)

quarter half single double ________ __________ __________ ___________ w 8 16 32 64 p 4 10 23 52 q 3 5 8 11 b 3 15 127 1023 eps 0.0625 0.00097656 1.1921e-07 2.2204e-16 realmax 15.5 65504 3.4028e+38 1.7977e+308 realmin 0.25 6.1035e-05 1.1755e-38 2.2251e-308 tiny 0.015625 5.9605e-08 1.4013e-45 4.9407e-324 flintmax 32 2048 1.6777e+07 9.0072e+15

Version 3.1 of Cleve's Laboratory includes code for objects `@fp8` and `@fp16` that begin to provide full implementations of quarter-precision and half-precision arithmetic.

The methods currently provided are

methods(fp16)

Methods for class fp16: abs eps isfinite mrdivide rem subsref binary eq le mtimes round svd ctranspose fix lt ne sign tril diag fp16 lu norm single triu disp ge max plus size uminus display gt minus realmax subsasgn double hex mldivide realmin subsindex

These provide only partial implementations because the arithmetic is not done on the compact forms. We cheat. For each individual scalar operation, the operands are unpacked from their short storage into old fashioned doubles. The operation is then carried out by existing double precision code and the results returned to the shorter formats. This simulates the reduced precision and restricted range, but requires relatively little new code.

All of the work is done in the constructors `@fp8/fp8.m` and `@fp16/fp16.m` and what we might call the "deconstructors" `@fp8/double.m` and `@fp16/double.m`. The constructors convert ordinary floating point numbers to reduced precision representations by packing as many of the 32 or 64 bits as will fit into 8 or 16 bit words. The deconstructors do the reverse by unpacking things.

Once these methods are available, almost everything else is trivial. The code for most operations is like this one for the overloaded addition.

```
type @fp16/plus.m
```

function z = plus(x,y) z = fp16(double(x) + double(y)); end

The Wikipedia page about half-precision includes several 16-bit examples with the sign, exponent, and fraction fields separated. I've added a couple more.

0 01111 0000000000 = 1 0 00101 0000000000 = 2^-10 = eps 0 01111 0000000001 = 1+eps = 1.0009765625 (next smallest float after 1) 1 10000 0000000000 = -2 0 11110 1111111111 = 65504 (max half precision) = 2^15*(2-eps) 0 00001 0000000000 = 2^-14 = r ~ 6.10352e-5 (minimum positive normal) 0 00000 1111111111 = r*(1-eps) ~ 6.09756e-5 (maximum subnormal) 0 00000 0000000001 = r*eps ~ 5.96046e-8 (minimum positive subnormal) 0 00000 0000000000 ~ r*eps/2 (underflow to zero) 0 00000 0000000000 = 0 1 00000 0000000000 = -0 0 11111 0000000000 = infinity 1 11111 0000000000 = -infinity 0 11111 1111111111 = NaN 0 01101 0101010101 = 0.333251953125 ~ 1/3

This provides my test suite for checking `fp16` operations on scalars.

clear zero = fp16(0); one = fp16(1); eps = eps(one); r = realmin(one); tests = {'1','eps','1+eps','-2','2/r*(2-eps)', ... 'r','r*(1-eps)','r*eps','r*eps/2', ... 'zero','-zero','1/zero','-1/zero','zero/zero','1/3'};

Let's run the tests.

for t = tests(:)' x = eval(t{:}); y = fp16(x); z = binary(y); w = double(y); fprintf(' %18s %04s %19.10g %19.10g %s\n', ... z,hex(y),double(x),w,t{:}) end

0 01111 0000000000 3C00 1 1 1 0 00101 0000000000 1400 0.0009765625 0.0009765625 eps 0 01111 0000000001 3C01 1.000976563 1.000976563 1+eps 1 10000 0000000000 C000 -2 -2 -2 0 11110 1111111111 7BFF 65504 65504 2/r*(2-eps) 0 00001 0000000000 0400 6.103515625e-05 6.103515625e-05 r 0 00000 1111111111 03FF 6.097555161e-05 6.097555161e-05 r*(1-eps) 0 00000 0000000001 0001 5.960464478e-08 5.960464478e-08 r*eps 0 00000 0000000001 0001 5.960464478e-08 5.960464478e-08 r*eps/2 0 00000 0000000000 0000 0 0 zero 0 00000 0000000000 0000 0 0 -zero 0 11111 0000000000 7C00 Inf Inf 1/zero 1 11111 0000000000 FC00 -Inf -Inf -1/zero 1 11111 1111111111 FFFF NaN NaN zero/zero 0 01101 0101010101 3555 0.3333333333 0.3332519531 1/3

Most of the methods in `@fp8` and `@fp16` handle matrices. The 4-by-4 magic square from Durer's Melancholia II provides my first example.

```
clear
format short
M = fp16(magic(4))
```

M = 16 2 3 13 5 11 10 8 9 7 6 12 4 14 15 1

Let's see how the packed 16-bit elements look in binary.

B = binary(M)

B = 4×4 string array Columns 1 through 3 "0 10011 0000000000" "0 10000 0000000000" "0 10000 1000000000" "0 10001 0100000000" "0 10010 0110000000" "0 10010 0100000000" "0 10010 0010000000" "0 10001 1100000000" "0 10001 1000000000" "0 10001 0000000000" "0 10010 1100000000" "0 10010 1110000000" Column 4 "0 10010 1010000000" "0 10010 0000000000" "0 10010 1000000000" "0 01111 0000000000"

Check that the row sums are all equal. This matrix-vector multiply can be done exactly with the flints in the magic square.

e = fp16(ones(4,1)) Me = M*e

e = 1 1 1 1 Me = 34 34 34 34

I've overloaded `mldivide`, so I can solve linear systems and compute inverses. The actual computation is done by `lutx`, a "textbook" function that I wrote years ago, long before this half-precision project. But now the MATLAB object system insures that every individual arithmetic operation is done on unpacked `fp16` numbers.

Let's generate a 5-by-5 matrix with random two-digit integer entries.

A = fp16(randi(100,5,5))

A = 76 71 83 44 49 75 4 70 39 45 40 28 32 77 65 66 5 96 80 71 18 10 4 19 76

I am going to use `fp16` backslash to invert `A`. So I need the identity matrix in half precision.

I = fp16(eye(5))

I = 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1

Now the overloaded backslash calls `lutx` to compute the inverse.

X = A\I

X = -0.0044 0.0346 0.0125 -0.0254 -0.0046 0.0140 -0.0116 0.0026 -0.0046 -0.0002 0.0071 -0.0176 -0.0200 0.0238 0.0008 -0.0060 -0.0020 0.0200 0.0006 -0.0125 0.0003 -0.0052 -0.0072 0.0052 0.0174

Compute the residual.

AX = A*X R = I - AX

AX = 1.0000 -0.0011 -0.0002 -0.0007 -0.0001 -0.0001 0.9990 -0.0001 -0.0007 -0.0001 -0.0001 -0.0005 1.0000 -0.0003 -0.0002 -0.0002 -0.0011 -0.0001 0.9995 -0.0002 -0.0000 -0.0001 0.0001 -0.0001 1.0000 R = 0 0.0011 0.0002 0.0007 0.0001 0.0001 0.0010 0.0001 0.0007 0.0001 0.0001 0.0005 0 0.0003 0.0002 0.0002 0.0011 0.0001 0.0005 0.0002 0.0000 0.0001 -0.0001 0.0001 0

Both `AX` and `R` are what I expect from arithmetic that is accurate to only about three decimal digits.

Although I get a different random `A` every time I publish this blog post, I expect that it has a modest condition number.

kappa = cond(A)

kappa = 15.7828

Since `A` is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix.

Z = X\I

Z = 76.1250 71.0000 83.1250 44.1250 49.1250 75.1250 4.0234 70.1875 39.1250 45.1250 40.0625 28.0000 32.0625 77.0000 65.0625 66.1250 5.0234 96.1875 80.1250 71.1250 18.0156 10.0000 4.0156 19.0156 76.0000

I have just nonchalantly computed `cond(A)`. But `cond` isn't on the list of overload methods for `fp16`. I was pleasantly surprised to find that `matlab\matfun\cond.m` quietly worked on this new datatype. Here is the core of that code.

dbtype cond 34:43, dbtype cond 47

34 if p == 2 35 s = svd(A); 36 if any(s == 0) % Handle singular matrix 37 c = Inf(class(A)); 38 else 39 c = max(s)./min(s); 40 if isempty(c) 41 c = zeros(class(A)); 42 end 43 end 47 end

So it is correctly using the singular value decomposition, and I have `svd` overloaded. The SVD computation is handled by a 433 line M-file, `svdtx`, that, like `lutx`, was written before `fp16` existed.

Let's compute the SVD again.

[U,S,V] = svd(A)

U = -0.5210 -0.4841 0.6802 -0.0315 0.1729 -0.4260 -0.2449 -0.3572 -0.4561 -0.6504 -0.4058 0.4683 0.1633 0.6284 -0.4409 -0.5786 0.0268 -0.5620 0.1532 0.5703 -0.2174 0.6968 0.2593 -0.6104 0.1658 S = 267.5000 0 0 0 0 0 71.1875 0 0 0 0 0 55.5000 0 0 0 0 0 37.3750 0 0 0 0 0 16.9531 V = -0.4858 -0.3108 -0.0175 -0.3306 -0.7471 -0.2063 -0.2128 0.9238 0.2195 0.1039 -0.5332 -0.5205 -0.2920 -0.0591 0.5967 -0.4534 0.2891 -0.2050 0.7993 -0.1742 -0.4812 0.7095 0.1384 -0.4478 0.2126

Reconstruct `A` from its half precision SVD. It's not too shabby.

USVT = U*S*V'

USVT = 75.9375 71.0000 83.0625 44.0313 49.0000 75.0000 4.0117 70.0625 38.9688 45.0000 40.0313 28.0469 32.0313 77.0625 65.0625 66.0000 4.9688 96.0625 80.0000 71.0000 18.0313 10.0234 4.0156 19.0313 76.0000

Finally, verify that we've been working all this time with `fp16` objects.

whos

Name Size Bytes Class Attributes A 5x5 226 fp16 AX 5x5 226 fp16 B 4x4 1576 string I 5x5 226 fp16 M 4x4 208 fp16 Me 4x1 184 fp16 R 5x5 226 fp16 S 5x5 226 fp16 U 5x5 226 fp16 USVT 5x5 226 fp16 V 5x5 226 fp16 X 5x5 226 fp16 Z 5x5 226 fp16 e 4x1 184 fp16 kappa 1x1 8 double

I introduced a `calculator` in my blog post about Roman numerals. Version 3.1 of Cleve's Laboratory also includes a fancier version of the calculator that computes in four different precisions -- quarter, half, single, and double -- and displays the results in four different formats -- decimal, hexadecimal, binary, and Roman.

I like to demonstrate the calculator by clicking on the keys

1 0 0 0 / 8 1 =

because the decimal expansion is a repeating `.123456790`.

Thanks to MathWorkers Ben Tordoff, Steve Eddins, and Kiran Kintali who provided background and pointers to work on half precision.

Get
the MATLAB code

Published with MATLAB® R2017a

A MATLAB object for arithmetic with Roman numerals provides an example of object oriented programming. I had originally intended this as my April Fools post, but I got fascinated and decided to make it the subject of a legitimate article.... read more >>

]]>A MATLAB object for arithmetic with Roman numerals provides an example of object oriented programming. I had originally intended this as my April Fools post, but I got fascinated and decided to make it the subject of a legitimate article.

I've always been interested in Roman numerals. In my former life as a professor, when I taught the beginning computer programming course, one of my projects involved Roman numerals.

Doing arithmetic with Roman numerals is tricky. What is IV + VI ? Stop reading this blog for a moment and compute this sum.

Did you find that IV + VI = X ? How did you do that? You probably converted IV and VI to decimal, did the addition using decimal arithmetic, then converted the result back to Roman. You computed 4 + 6 = 10. That is also how my `roman` object works. I have no idea how the Romans did it without decimal arithmetic to rely on.

Recall that Roman numerals are formed from seven letters

I, V, X, L, C, D, and M.

Their values, expressed in decimal, are

1, 5, 10, 50, 100, 500, and 1000.

A Roman numeral is just a string of these letters, usually in decreasing order, like MMXVII. The value of the string is the sum of the values of the individual letters, that is 1000+1000+10+5+1+1, which is this year, 2017. But sometimes the letters are out of order. If one letter is followed by another with higher value, then the value of the first letter is subtracted, rather than added, to the sum. Two years from now will be MMXIX, which is 1000+1000+10-1+10 = 2019.

I decided to jazz things up a bit by extending roman numerals to negative and fractional values. So I allow for a unary minus sign at the beginning of the string, and I allow lower case letters for fractions. The value of a lower case letter is the value of the corresponding upper case letter divided by 1000. Here are a few examples.

vi = 6/1000

-c = -100/1000 = -0.1

mmxvii = 2017/1000 = 2.017

These extentions introduce some aspects of floating point arithmetic to the system. Upper case letters evaluate to integers, equally spaced with an increment of one. Lower case letters evaluate to fractional values less than one (if you leave off 'm'), with an increment of 1/1000.

Here is a function that accepts any string, looks for the fourteen letters, and sums their positive or negative values.

```
type roman_eval_string
```

function n = roman_eval_string(s) % Convert a string to the .n component of a Roman numeral. D = 'IVXLCDM'; v = [1 5 10 50 100 500 1000]; D = [D lower(D)]; v = [v v/1000]; n = 0; t = 0; for d = s k = find(d == D); if ~isempty(k) u = v(k); if t < u n = n - t; else n = n + t; end t = u; end end n = n + t; if ~isempty(s) && s(1)=='-' n = -n; end end

Fractional values were obtained by adding just these two lines on code.

D = [D lower(D)]; v = [v v/1000];

Negative values come from the sign test at the end of the function.

Let's try it.

```
n = roman_eval_string('MMXIX')
```

n = 2019

This will evaluate any string. No attempt is made to check for "correct" strings.

```
n = roman_eval_string('DDDCDVIVIIIIIVIIIIC')
```

n = 2019

The subtraction rule is not always used. Clocks with Roman numerals for the hours sometimes denote 4 o'clock by IIII and sometimes by IV. So representations are not unique and correctness is elusive.

four = roman_eval_string('IIII') four = roman_eval_string('IV')

four = 4 four = 4

Objects were introduced with MATLAB 5 in 1996. My first example of a MATLAB object was this `roman` object. I now have a directory `@roman` on my path. It includes all of the functions that define *methods* for the `roman` object. First and foremost is the *constructor*.

```
type @roman/roman
```

function r = roman(a) %ROMAN Roman numeral class constructor. % r = ROMAN(a) converts a number or a string to a Roman numeral. % % A roman object retains its double precision numeric value. % The string representation of classic Roman numerals use just upper case % letters. Our "floating point" numerals use both upper and lower case. % % I 1 i 1/1000 = .001 % V 5 v 5/1000 = .002 % X 10 x 10/1000 = .01 % L 50 l 50/1000 = .05 % C 100 c 100/1000 = .1 % D 500 d 500/1000 = .5 % M 1000 m 1000/1000 = 1 % % The value of a string is the sum of the values of its letters, % except a letter followed by one of higher value is subtracted. % % Values >= decimal 4000 are represented by 'MMMM'. % Decimal 0 is represented by blanks. % % Blog: http://blogs.mathworks.com/cleve/2017/04/24. % See also: calculator. if nargin == 0 r.n = []; r = class(r,'roman'); return elseif isa(a,'roman') r = a; return elseif isa(a,'char') a = roman_eval_string(a); end r.n = a; r = class(r,'roman'); end % roman

If the input `a` is already a `roman` object, the constructor just returns it. If `a` is a string, such as 'MMXIX', the constructor calls `roman_eval_string` to convert `a` to a number `n`.

Finally the constuctor creates a `roman` object `r` containing `a` in its only field, the numeric value `r.n`. Consequently, we see that a `roman` object is just an ordinary double precision floating point number masquerading in this Homeric garb.

For example

r = roman(2019)

r = 'MMXIX'

Why did `roman(2019)` print `MMXIX` in that last example? That's because the object system calls upon `@roman/display`, which in turn calls `@roman/char`, to produce the output printed in the command window. Here is the crucial function `@roman/char` that converts the numerical field to its Roman representation.

```
type @roman/char
```

function sea = char(r) % char Generate Roman representation of Roman numeral. % c = CHAR(r) converts an @roman number or matrix to a % cell array of character strings. rn = r.n; [p,q] = size(rn); sea = cell(p,q); for k = 1:p for j = 1:q if isempty(rn(k,j)) c = ''; elseif isinf(rn(k,j)) || rn(k,j) >= 4000 c = 'MMMM'; else % Integer part n = fix(abs(rn(k,j))); f = abs(rn(k,j)) - n; c = roman_flint2rom(n); % Fractional part, thousandths. if f > 0 fc = roman_flint2rom(round(1000*f)); c = [c lower(fc)]; end % Adjust sign if rn(k,j) < 0 c = ['-' c]; end end sea{k,j} = c; end end end % roman/char

The heavy lifting is done by this function which generates the character representation of an integer.

```
type roman_flint2rom
```

function c = roman_flint2rom(x) D = {'','I','II','III','IV','V','VI','VII','VIII','IX' '','X','XX','XXX','XL','L','LX','LXX','LXXX','XC' '','C','CC','CCC','CD','D','DC','DCC','DCCC','CM' '','M','MM','MMM',' ',' ',' ',' ',' ',' '}; n = max(fix(x),0); i = 1; c = ''; while n > 0 c = [D{i,rem(n,10)+1} c]; n = fix(n/10); i = i + 1; end end

The functions `roman_eval_string` and `roman_flint2rom` are essentially inverses of each other. One converts a string of letters to a number and the other converts a number to a string of letters.

Converting a string to a numeric value and then converting it back to a string enforces a canonical representation of the result. So a nonconventional Roman numeral gets rectified.

```
r = roman('MMXVIIII')
```

r = 'MMXIX'

The crucial intermediate quantity in the previous example was the numeric value 2019. That can be uncovered with a one-liner.

```
type @roman/double
```

function n = double(r) %DOUBLE Convert Roman numeral to double. % n = double(r) is the numeric value of a Roman numeral. n = r.n; end % roman/double

year = double(r)

year = 2019

Here are all the operations that I can currently do with the `roman` class. I've overloaded just a handful to provide a proof of concept.

methods(r)

Methods for class roman: char display minus mrdivide plus disp double mldivide mtimes roman

Binary arithmetic operations on `roman` objects are easy. Make sure both operands are `roman` and then do the arithmetic on the numeric fields.

```
type @roman/plus
```

function r = plus(p,q) p = roman(p); q = roman(q); r = roman(p.n + q.n); end % roman/plus

So this is why IV + VI = X is just 4 + 6 = 10 under the covers.

r = roman('IV') + roman('VI')

r = 'X'

Did you notice that the output method `char` will handle matrices? Let's try one. Magic squares have integer elements. Here is the 4-by-4 from Durer's Melancholia II.

M = roman(magic(4))

M = 'XVI' 'II' 'III' 'XIII' 'V' 'XI' 'X' 'VIII' 'IX' 'VII' 'VI' 'XII' 'IV' 'XIV' 'XV' 'I'

Check that its row sums are all the same.

e = roman(ones(4,1)) Me = M*e

e = 'I' 'I' 'I' 'I' Me = 'XXXIV' 'XXXIV' 'XXXIV' 'XXXIV'

I've overloaded `mldivide`, so I can solve linear systems and compute inverses. All the elements of the 4-by-4 inverse Hilbert matrix are integers, but some are larger than 4000, so I'll scale the matrix by a factor of 2.

X = invhilb(4)/2 A = roman(X)

X = 8 -60 120 -70 -60 600 -1350 840 120 -1350 3240 -2100 -70 840 -2100 1400 A = 'VIII' '-LX' 'CXX' '-LXX' '-LX' 'DC' '-MCCCL' 'DCCCXL' 'CXX' '-MCCCL' 'MMMCCXL' '-MMC' '-LXX' 'DCCCXL' '-MMC' 'MCD'

Inverting and rescaling `A` should produce the Hilbert matrix itself, where all of the elements are familiar fractions. I'll need the identity matrix, suitably scaled.

I = roman(eye(4))/2

I = 'd' '' '' '' '' 'd' '' '' '' '' 'd' '' '' '' '' 'd'

Now I call employ backslash to compute the inverse. Do you recognize the familiar fractions?

H = A\I

H = 'm' 'd' 'cccxxxiii' 'ccl' 'd' 'cccxxxiii' 'ccl' 'cc' 'cccxxxiii' 'ccl' 'cc' 'clxvii' 'ccl' 'cc' 'clxvii' 'cxliii'

Here is some homework: why is `H(1,1)` represented by `'m'`, when it should be `'I'`?

Finally, check the residual. It's all zero -- to the nearest thousandths.

R = I - A*H

R = '' '' '' '-' '' '' '' '' '' '' '' '' '' '-' '-' ''

Two gizmos that exhibit the `roman` object are included in Version 3.0 of Cleve's Laboratory. One is a calculator.

calculator(2017)

The clock captures the date and time whenever I publish this blog.

roman_clock_snapshot

OK, let's quit foolin' around and get back to serious business.

Get
the MATLAB code

Published with MATLAB® R2017a

A report about a possible bug in `format bank` and a visit to a local hardware store made me realize that doing decimal arithmetic with binary floating point numbers is like tightening a European bolt with an American socket wrench.... read more >>

A report about a possible bug in `format bank` and a visit to a local hardware store made me realize that doing decimal arithmetic with binary floating point numbers is like tightening a European bolt with an American socket wrench.

It's mid-April and so those of us who file United States income taxes have that chore to do. Years ago, I used MATLAB to help with my taxes. I had a program named `form1040.m` that had one statement for each line on the tax form. I just had to enter my income and deductions. Then MATLAB would do all the arithmetic.

If we're really meticulous about our financial records, we keep track of things to the nearest penny. So line 28 of `form1040` might have been something like

interest = 48.35

interest = 48.3500

I didn't like those trailing zeros in the output. So I introduced

```
format bank
```

into MATLAB. Now

interest

interest = 48.35

is printed with just two digits after the decimal point.

`format bank` has turned out to be useful more broadly and is still in today's MATLAB.

We recently had a user ask about the rounding rule employed by `format bank`. What if a value fails halfway between two possible outputs? Which is chosen and why?

Here's the example that prompted the user's query. Start with

```
format short
```

so we can see four decimal places.

x = (5:10:45)'/1000 y = 1+x z = 2+x

x = 0.0050 0.0150 0.0250 0.0350 0.0450 y = 1.0050 1.0150 1.0250 1.0350 1.0450 z = 2.0050 2.0150 2.0250 2.0350 2.0450

These values appear to fall halfway between pairs of two-digit decimal fractions. Let's see how the ties are broken.

```
format bank
x
y
z
```

x = 0.01 0.01 0.03 0.04 0.04 y = 1.00 1.01 1.02 1.03 1.04 z = 2.00 2.02 2.02 2.04 2.04

Look at the last digits. What mysterious hand is at work here? Three of the `x`'s, `x(1)`, `x(3)`, and `x(4)`, have been rounded up. None of the `y`'s have been rounded up. Two of the `z`'s, `z(2)` and `z(4)`, have been rounded up.

Email circulated internally at MathWorks for a few days after this was reported suggesting various explanations. Is it a bug in `format bank`? In the I/O library? Does it depend upon which compiler was used to build MATLAB? Do we see the same behavior on the PC and the Mac? Has it always been this way? These are the usual questions that we ask ourselves when we see curious behavior.

Do you know what's happening?

Well, none of the suspects I just mentioned is the culprit. The fact is none of the numbers fall exactly on a midpoint. Think binary, not decimal. A value like 0.005 expressed as a decimal fraction cannot be represented exactly as a binary floating point number. Decimal fractions fall in between the binary fractions, and rarely fall precisely half way.

To understand what is going on, the Symbolic Toolbox function

sym(x,'e')

is your friend. The `'e'` flag is provided for this purpose. The `help` entry says

'e' stands for 'estimate error'. The 'r' form is supplemented by a term involving the variable 'eps' which estimates the difference between the theoretical rational expression and its actual floating point value. For example, sym(3*pi/4,'e') is 3*pi/4-103*eps/249.

To see how this works for the situation encountered here.

symx = sym(x,'e') symy = sym(y,'e') symz = sym(z,'e')

symx = eps/2133 + 1/200 3/200 - eps/400 eps/160 + 1/40 (3*eps)/200 + 7/200 9/200 - (3*eps)/400 symy = 201/200 - (12*eps)/25 203/200 - (11*eps)/25 41/40 - (2*eps)/5 207/200 - (9*eps)/25 209/200 - (8*eps)/25 symz = 401/200 - (12*eps)/25 (14*eps)/25 + 403/200 81/40 - (2*eps)/5 (16*eps)/25 + 407/200 409/200 - (8*eps)/25

The output is not as clear as I would like to see it, but I can pick off the sign of the error terms and find

x + - + + - y - - - - - z - + - + -

Or, I can compute the signs with

```
format short
sigx = sign(double(symx - x))
sigy = sign(double(symy - y))
sigz = sign(double(symz - z))
```

sigx = 1 -1 1 1 -1 sigy = -1 -1 -1 -1 -1 sigz = -1 1 -1 1 -1

The sign of the error term tells us whether the floating point numbers stored in `x`, `y` and `z` are larger or smaller than the anticipated decimal fractions. After this initial input, there is essentially no more roundoff error. `format bank` will round up or down from the decimal value accordingly. Again, it is just doing its job on the input it is given.

Photo credit: http://toolguyd.com/

A high-end socket wrench set includes both metric (left) and fractional inch (right) sizes. Again, think decimal and binary. Metric sizes of nuts and bolts and wrenches are specified in decimal fractions, while the denominators in fractional inch sizes are powers of two.

Conversion charts between the metric and fractional inch standards abound on the internet. Here is a link to one of them: Wrench Conversion Chart.

But we can easily compute our own conversion chart. And in the process compute `delta`, the relative error made when the closest binary wrench is used on a decimal bolt.

make_chart

Inch Metric delta 1/64 0.016 1/32 0.031 3/64 0.047 1mm 0.039 -0.191 1/16 0.063 5/64 0.078 2mm 0.079 0.008 3/32 0.094 7/64 0.109 1/8 0.125 3mm 0.118 -0.058 9/64 0.141 5/32 0.156 4mm 0.157 0.008 11/64 0.172 3/16 0.188 13/64 0.203 5mm 0.197 -0.032 7/32 0.219 15/64 0.234 6mm 0.236 0.008 1/4 0.250 9/32 0.281 7mm 0.276 -0.021 5/16 0.313 8mm 0.315 0.008 11/32 0.344 9mm 0.354 0.030 3/8 0.375 13/32 0.406 10mm 0.394 -0.032 7/16 0.438 15/32 0.469 12mm 0.472 0.008 1/2 0.500 9/16 0.563 14mm 0.551 -0.021 5/8 0.625 16mm 0.630 0.008 11/16 0.688 18mm 0.709 0.030 3/4 0.750 13/16 0.813 20mm 0.787 -0.032 7/8 0.875 22mm 0.866 -0.010 15/16 0.938 24mm 0.945 0.008 1 1.000

Let's plot those relative errors. Except for the small sizes, where this set doesn't have enough wrenches, the relative error is only a few percent. But that's still enough to produce a damaging fit on a tight nut.

bar(k,d) axis([0 25 -.05 .05]) xlabel('millimeters') ylabel('delta')

You might notice that my conversion chart, like all such charts, and like the wrenches themselves, has a little bit of floating point character. The spacing of the entries is not uniform. The spacing between the binary values is 1/64, then 1/32, then 1/16. The spacing of the metric values is 1mm at the top of the chart and 2mm later on.

Suppose we want to tighten a 10mm nut and all we have are binary wrenches. The diameter of the nut in inches is

meter = 39.370079; d = 10*meter/1000

d = 0.3937

Consulting our chart, we see that a 13/32 wrench is the best fit, but it's a little too large. 10mm lies between these two binary values.

[floor(32*d) ceil(32*d)] b1 = 12/32; b2 = 13/32; [b1 d b2]'

ans = 12 13 ans = 0.3750 0.3937 0.4063

The fraction of the interval is

f = (d - b1)/(b2 - b1)

f = 0.5984

10mm is about 60% of the way from 12/32 inches to 13/32 inches.

Now let's turn to floating point numbers. What happens when execute this MATLAB statement?

x = 1/10

x = 0.1000

One is divided by ten and the closest floating point number is stored in `x`. The same value is produced by the statement

x = 0.1

x = 0.1000

The resulting `x` lies in the interval between 1/16 and 1/8. The floating point numbers in this interval are uniformly spaced with a separation of

e = eps(1/16)

e = 1.3878e-17

This is `2^-56`. Let

e = sym(e)

e = 1/72057594037927936

This value `e` plays the role that 1/64 plays for my wrenches.

The result of `x = 1/10` lies between these two binary fractions.

b1 = floor(1/(10*e))*e b2 = ceil(1/(10*e))*e

b1 = 7205759403792793/72057594037927936 b2 = 3602879701896397/36028797018963968

The tiny interval of length `e` is

c = [b1 1/10 b2]'

c = 7205759403792793/72057594037927936 1/10 3602879701896397/36028797018963968

In decimal

vpa(c)

ans = 0.099999999999999991673327315311326 0.1 0.10000000000000000555111512312578

Where in this interval does 1/10 lie?

f = double((1/10 - b1)/e)

f = 0.6000

So 1/10 is about 60% of the way from `b1` to `b2` and so is closer to `b2` than to `b1`. The two statements

x = 1/10 x = double(b2)

x = 0.1000 x = 0.1000

store exactly the same value in `x`.

The 13/32 wrench is the closest tool in the binary collection to the 10mm nut.

Get
the MATLAB code

Published with MATLAB® R2017a

I've long known that my Erdös Number is 3. This means that the length of the path on the graph of academic coauthorship between me and mathematician Paul Erdös is 3. Somewhat to my surprise, I recently discovered that I can also trace a chain of coauthorship to Donald J. Trump. My Trump number is 5.... read more >>

]]>I've long known that my Erdös Number is 3. This means that the length of the path on the graph of academic coauthorship between me and mathematician Paul Erdös is 3. Somewhat to my surprise, I recently discovered that I can also trace a chain of coauthorship to Donald J. Trump. My Trump number is 5.

Paul Erdös. Photo: http://www.iitvidya.com/paul-erdos.

The *collaborative distance* between two authors is the length of the path of coauthorship of scientific papers, books, and articles connecting the two. If A and B are coauthors, then the collaborative distance between them is 1. Furthermore, if B and C are also coauthors, then the collaborative distance between A and C is 2. And so on. If there is no chain of coauthorship, then the collaborative distance is infinite.

Paul Erdös (1911-1996) was the world's most prolific modern mathematician. He wrote 1,523 papers with 511 distinct coauthors. These 511 people have Erdös number equal to 1. And these authors have, in turn, written papers with over 11,000 other people. That means that over 11,000 people have Erdös number of 2. My estimate is that a few hundred thousand people have Erdös number of 3. I'm one of them.

There are two length three paths of coauthorship between me and Erdös. Both go through my thesis advisor, George Forsythe. I wrote a blog post about Forsythe a few years ago. Forsythe has an Erdös number of 2 in two different ways because he wrote papers with his thesis advisor, William Feller, and with Ernst Straus, both of whom had worked directly with Erdös.

An Erdös Number calculator is available at MathSciNet Collaboration Distance. More than you every wanted to know about Erdös numbers is available at the Oakland University Erdös Number Project.

Steve Johnson is a buddy of mine who also has Erdös number of 3. His path goes through Jeff Ullman and Ron Graham to Paul Erdös. But that's not why I bring him up today. In the 1970's Steve was part of the group at Bell Labs that developed Unix. He wrote the Unix tool Yacc (Yet Another Compiler Compiler), as well as the original C compiler, PCC, (Portable C Compiler). He is coauthor, along with three other Unix guys, of a paper about C in the 1978 issue of the Bell System Technical Journal that was devoted entirely to Unix.

Steve and I wrote a paper about compiling MATLAB, although the references to that paper on the Internet have my named spelled incorrectly. It is through this paper that I was surprised to find that my collaborative distance from Donald J. Trump is only 5.

Finite Trump numbers are possible because Trump's famous book, "The Art of the Deal", was actually ghost-written by a free-lance writer named Tony Schwartz. And Schwartz has coauthored many articles and books with other people. Some of these articles might not exactly be classified as academic papers, but what the heck.

The coauthorship path between me and Trump is through articles by John Winslow Morgan, a professor in the Harvard Business School who specializes in the history of technology. He has written articles with Schwarz and with Dennis Ritchie, one of the originators of Unix.

So, the path of length 5 is Moler - Johnson - Ritchie - Morgan - Schwartz - Trump.

**Erdös Number**

Forsythe, G. E. and Straus, E. G. *On best conditioned matrices.* Proc. Amer. Math. Soc. 6, (1955). 340–345.

Erdös, P., Lovász, L., Simmons, A., and Straus, E. G. *Dissection graphs of planar point sets.* A survey of combinatorial theory. (Proc. Internat. Sympos., Colorado State Univ., Fort Collins, Colo., 1971), 139–149. North-Holland, Amsterdam, 1973.

Feller, William, and George E. Forsythe. *New matrix transformations for obtaining characteristic vectors*. Quarterly of Applied Mathematics 8.4 (1951), 325-331.

Erdös, Paul, William Feller, and Harry Pollard. *A property of power series with positive coefficients.* Bull. Amer. Math. Soc 55.2 (1949): 201-204.

Forsythe, G. E. and C. B. Moler, Computer Solution of Linear Algebraic Systems, (Series in Automatic Computation) XI + 148, Prentice Hall, Englewood Cliffs, N.J. 1967.

Forsythe, George E., Malcolm, Michael A. and Moler, Cleve B., Computer Methods for Mathematical Computations, (Series in Automatic Computation) XI + 259, Prentice Hall, Englewood Cliffs, N.J. 1977.

**Trump Number**

Johnson, S. C. and C. Mohler (Moler), *Compiling MATLAB*, Proceedings of the USENIX Symposium on Very High Level Languages (VHLL), 119-27, Santa Fe, New Mexico, October 1994. USENIX Association.

Ritchie, D. M., Johnson, S. C., Lesk, M. E. and Kernighan, B. W., *UNIX Time-Sharing System: The C Programming Language.* Bell System Technical Journal, 57: 1991–2019, 1978.

Ritchie, D. M. and Morgan, J. W, *The Origins of UNIX*, Harvard Business Review, 48: 28-35, 1985.

Morgan, John Winslow and Schwartz, Tony, *Does UNIX Have A Future?*, MIT Technology Review, 53: 1-8, 1992.

Trump, Donald J. and Schwartz, Tony, The Art of the Deal, Ballantine Books, (paperback), 384 pp., 2004.

Get
the MATLAB code

Published with MATLAB® R2017a

A binary tree is an elegant way to represent and process Morse code. The new MATLAB graph object provides an elegant way to manipulate binary trees. A new app, `morse_tree`, based on this approach, is now available in version 2.40 of Cleve's Laboratory.... read more >>

A binary tree is an elegant way to represent and process Morse code. The new MATLAB graph object provides an elegant way to manipulate binary trees. A new app, `morse_tree`, based on this approach, is now available in version 2.40 of Cleve's Laboratory.

I have a chapter on Morse Code and binary trees in *Experiments with MATLAB*. Some of this blog post is taken from that chapter. But today's implentation is much more, as I said, elegant. In *EXM* I represented a binary tree as a cell array of cell arrays. Now I am using clever indexing into a linear character string, together with the MATLAB graph object. The resulting code is much shorter and more readable.

Here's all we have to do to get a picture of a binary tree. First we create the adjacency matrix and view its spy plot.

n = 31; j = 2:n; k = floor(j/2); A = sparse(k,j,1,n,n); spy(A)

Then we create and plot the directed graph. The MATLAB `digraph` object, and its supporting `plot` method, are doing all of the work.

G = digraph(A); Gp = plot(G); set(gca,'xtick',[],'ytick',[])

This is the complete binary tree with five levels, and consequently $2^5-1$ nodes. (In defiance of nature, computer scientists put the *root* of a tree at the top.)

Follow the nodes in numerical order, from top to bottom and left to right. You are doing a *breadth first traversal* of the structure. The *successors* of the node with index *j* have indices *2j* and *2j+1*. This makes old-fashioned linear indexing possible.

I want to save the coordinates of the nodes for use in the layout in my next plot.

x = Gp.XData; y = Gp.YData;

Morse code is no longer important commercially, but it still has some avid fans among hobbyists. Morse code was invented over 150 years ago, not by Samuel F. B. Morse, but by his colleague, Alfred Vail. It has been in widespread use ever since. The code consists of short *dots*, '.', and longer *dashes*, '-', separated by short and long spaces. You may be familiar with the international distress signal, '... --- ...', the code for "SOS", abbreviating "Save Our Ships" or perhaps "Save Our Souls". But did you notice that some modern cell phones signal '... -- ...', the code for "SMS", indicating activity of the "Short Message Service".

Until 2003, a license to operate an amateur radio in the US required minimal proficiency in Morse code. When I was in junior high school, I learned Morse code to get my ham license, and I've never forgotten it.

According to Wikipedia, in 2004, the International Telecommunication Union formally added a code for the ubiquitous email character, @, to the international Morse code standard. This was the first addition since World War I.

I could provide a table showing that '.-' is the code for A, '-...' the code for B, and so on. But I'm not going to do that, and my MATLAB program does not have such a table. As far as I am concerned, Morse code is *defined* by a MATLAB statement containing the 26 upper case letters of the English language, an asterisk, and four blanks.

```
morse = '*ETIANMSURWDKGOHVF L PJBXCYZQ ';
```

The asterisk at the start serves as the root of the tree. The four blanks will correspond to four missing nodes in the bottom level of the tree. So the binary tree is not going to be complete. The automatic layout in the `graph/plot` method does not draw a perfect tree. This is why I saved the node coordinates from the plot of the full tree. Let's remove the rows and columns of these potentially unlabeled nodes from the adjacency matrix, and the coordinates.

```
m = find(morse == ' ');
A(:,m) = [];
A(m,:) = [];
x(m) = [];
y(m) = [];
```

Convert the character array into a cell array of individual chars.

```
nodes = num2cell(morse);
nodes(m) = '';
```

Create the directed graph with these node labels.

G = digraph(A,nodes);

Plot the graph, using the layout saved from the full tree.

plot(G,'xdata',x,'ydata',y, ... 'showarrows','off'); set(gca,'xtick',[],'ytick',[])

If you follow the links downward and emit a *dot* when you go left and a *dash* when you go right, you have Morse code. For example start at the root, go left once, then right once. You will reach A and will have generated '.-', the code for A.

Morse code is already present in the defining character vector `morse`, even before we create and plot a graph. The character with index `5` is `A`. The characters with indices `2*5` and `2*5+1` are `R` and `W`.

j = 5; morse(j) morse(2*j:2*j+1)

ans = 'A' ans = 'RW'

Consequently, the Morse code for R and W is '.-.' and '.--'. This works everywhere in the first half of `morse`. The characters in the second half don't (yet) have successors.

Morse code can be extended by replacing the blanks in the basic tree with characters that are not in the 26 letter English alphabet and adding two more levels to the tree. Several different extensions have been in use at different times and different regions of the world. The Wikipedia page for Morse code includes a binary tree with some commonly used extensions. The fifth level includes the 10 decimal digits, several non-Latin chacters, and a few punctuation marks. The sixth level includes several punctuation marks.

Here is a screen shot from the new app, `morse_tree`, that is available in version 2.40 of Cleve's Laboratory. Clicking on the toggle labeled "extend" shows two more levels of the tree and some extensions, although not many as Wikipedia's tree. Clicking on a node in our tree highlights that node while the sound of the corresponding code is played. This screen shot shows A highlighted.

Text entered in the box below the tree will be encoded and the sound played.

The speed of the generated code is set by the control currently labeled "5 wpm", for five words per minute.

morse_tree_extended

Here's a snap quiz: What is Morse code for the @ sign? Download `morse_tree` to check your answer.

Get
the MATLAB code

Published with MATLAB® R2017a

The Lake Arrowhead Coauthor Graph came out of the Householder XII conference in 1993 at the UCLA conference center in the mountains north of San Bernardino. John Gilbert now remembers it as one of the first computational social network analyses he had ever seen. Today I revisit it using the new MATLAB graph object.... read more >>

]]>The Lake Arrowhead Coauthor Graph came out of the Householder XII conference in 1993 at the UCLA conference center in the mountains north of San Bernardino. John Gilbert now remembers it as one of the first computational social network analyses he had ever seen. Today I revisit it using the new MATLAB graph object.

I wrote a blog post about the Lake Arrowhead Coauthor Graph a couple of years ago.

At the Householder XII conference in 1993 Nick Trefethen posted a flip chart with Gene Golub's name in the center. He invited everyone present to add their name to the chart and draw lines connecting their name with the names of all their coauthors. The diagram grew denser throughout the week. At the end of the conference it was a graph with 104 vertices (or people) and 211 edges.

Nick framed the original chart and has it on the wall of his office at Oxford. Another Nick, Nick Hale, took this photo.

John Gilbert, Rob Schreiber and I entered the names and coauthor connections into MATLAB, creating an adjacency matrix `A`. For a long time the data was available as a `.mat` file from John's old web site at Xerox PARC. But John left PARC years ago, so I am now making the data available in human-readable form at the end of this blog post. There are two files.

A = arrowhead_adjacency; names = arrowhead_names;

The ordering of the authors in the original data is determined by how we happened to read them off the flip chart. Golub is first, and has the most connections to other authors. A symmetric reverse Cuthill-McGee ordering of the rest of the nodes puts coauthors near each other and reduces the band width of the adjacency matrix.

r = [1 symrcm(A(2:end,2:end))+1]; A = A(r,r); names = names(r); spy(A)

Let's make a MATLAB graph and plot it with the `circle` layout. By default, the `graph/plot` method omits the node names if there are more than 100 of them. This removes clutter and is usually a good idea. The resulting unannotated circle layout is a pretty picture, but not very informative.

The RCM ordering places most of the graph edges, except those connecting to Golub, near the circumference of the circle.

G = graph(A,names); plot(G,'layout','circle') axis square

Move Golub to the center.

p = plot(G,'Layout','circle'); idGolub = findnode(G,'Golub'); p.XData(idGolub) = 0; p.YData(idGolub) = 0; axis square

Placing the rest of the names around the outside of the circle requires some custom processing.

axis off t = text(p.XData,p.YData,names); theta = linspace(0,360,numel(t)+1); for k = 1:numel(t) t(k).Rotation = theta(k); t(k).String = [blanks(4) t(k).String]; t(k).FontSize = 8; end

This plot can now be compared with the plot we originally produced at Householder XII.

In this two-dimensional circle plot it's nearly impossible to follow the links from one author to his or her coauthors. 'Moler' is at about 5 o'clock on the circle. What is the degree of my node? Who are my coauthors?

A three-dimensional layout and judicious use of the mouse in conjunction with `cameratoolbar` makes it possible to examine this graph in detail.

p = plot(G,'layout','force3','nodelabelmode','auto'); % Setting the nodelabelmode says include all of the names. axis off vis3d set(gca,'clipping','off') zoom(2) gif_frame('arrowhead_name.gif') gif_frame(2)

Let's highlight me and my coauthors.

me = findnode(G,'Moler') friends = neighbors(G,me)' color = get(gca,'colororder'); highlight(p,[me friends],'nodecolor',color(2,:),'markersize',6) highlight(p,me,friends,'edgecolor',color(2,:),'linewidth',2) gif_frame(2)

me = 88 friends = 61 62 63 68 75 77 89 98

Make my node the center of this little universe.

set(p,'xdata',p.XData - p.XData(me), ... 'ydata',p.YData - p.YData(me), ... 'zdata',p.ZData - p.ZData(me)) gif_frame(2)

Now zoom and rotate. This is much better done in person with `cameratoolbar`. The best we can do in this blog is an animated gif.

% Zoom for k = 1:4 zoom(sqrt(2)) gif_frame end % Rotate [a,e] = view; for k = 1:4 view(a,e-12*k) gif_frame end gif_frame(5)

Here is the animated .gif centered on me and my coauthors.

And here we zoom in on Paul VanDooren by replacing

me = findnode(G,'Moler')

with

me = findnode(G,'VanDooren')

function A = arrowhead_adjacency k = [ 1 1 1 1 1 3 3 1 1 1 1 2 3 3 38 1 2 ... 32 42 3 32 42 43 44 44 1 46 1 1 37 13 49 1 32 1 ... 44 53 1 53 44 45 54 1 1 20 41 44 1 39 52 60 15 58 ... 61 3 63 41 43 1 15 35 51 61 41 43 59 65 53 3 58 63 ... 42 1 41 44 1 46 47 1 10 15 32 35 62 66 2 19 39 58 ... 61 66 69 37 53 1 9 72 42 8 21 32 42 77 1 7 15 46 ... 47 49 6 42 66 1 3 25 41 43 59 65 67 80 33 52 59 27 ... 28 46 74 31 51 46 1 22 23 24 25 35 41 52 81 3 21 55 ... 60 44 58 62 4 17 12 19 23 72 77 89 1 23 86 1 23 86 ... 91 4 72 90 19 90 48 1 18 25 77 82 86 17 52 66 86 89 ... 16 32 34 58 66 80 81 1 34 74 77 14 82 77 79 49 102 26 ... 32 39 44 48 51 52 56 63 69 74 78 81 83 90]; j = [ 2 3 5 10 11 29 30 32 34 35 36 36 38 40 40 41 42 ... 42 43 44 44 44 44 45 46 47 47 48 49 49 50 50 51 51 53 ... 54 54 55 55 56 56 57 58 59 59 59 59 60 60 60 61 62 62 ... 62 63 64 65 65 66 66 66 66 66 67 67 67 67 68 69 69 69 ... 70 71 71 71 72 72 72 73 73 73 73 73 73 73 74 74 74 74 ... 74 74 74 75 75 76 76 76 77 78 78 78 78 78 79 79 79 79 ... 79 79 80 80 80 81 81 81 81 81 81 81 81 81 82 82 82 83 ... 83 83 83 84 84 85 86 86 86 86 86 86 86 86 86 87 87 87 ... 87 88 88 88 89 89 90 90 90 90 90 90 91 91 91 92 92 92 ... 92 93 93 93 94 94 95 96 96 96 96 96 96 97 97 97 97 97 ... 98 98 98 98 98 98 98 99 99 99 99 100 100 101 102 103 103 104 ... 104 104 104 104 104 104 104 104 104 104 104 104 104 104]; U = sparse(k,j,1,104,104); A = logical(U + U'); end

function names = arrowhead_names names = {'Golub', 'Wilkinson', 'TChan', 'He', 'Varah', 'Kenney', ... 'Ashby', 'LeBorne', 'Modersitzki', 'Overton', 'Ernst', 'Borges', ... 'Kincaid', 'Crevelli', 'Boley', 'Anjos', 'Byers', 'Benzi', 'Kaufman', ... 'Gu', 'Fierro', 'Nagy', 'Harrod', 'Pan', 'Funderlic', 'Edelman', ... 'Cullum', 'Strakos', 'Saied', 'Ong', 'Wold', 'VanLoan', ... 'Chandrasekaran', 'Saunders', 'Bojanczyk', 'Dubrulle', 'Marek', ... 'Kuo', 'Bai', 'Tong', 'George', 'Moler', 'Gilbert', 'Schreiber', ... 'Pothen', 'NTrefethen', 'Nachtigal', 'Kahan', 'Varga', 'Young', ... 'Kagstrom', 'Barlow', 'Widlund', 'Bjorstad', 'OLeary', 'NHigham', ... 'Boman', 'Bjorck', 'Eisenstat', 'Zha', 'VanHuffel', 'Park', 'Arioli', ... 'MuntheKaas', 'Ng', 'VanDooren', 'Liu', 'Smith', 'Duff', 'Henrici', ... 'Tang', 'Reichel', 'Luk', 'Hammarling', 'Szyld', 'Fischer', ... 'Stewart', 'Bunch', 'Gutknecht', 'Laub', 'Heath', 'Ipsen', ... 'Greenbaum', 'Ruhe', 'ATrefethen', 'Plemmons', 'Hansen', 'Elden', ... 'BunseGerstner', 'Gragg', 'Berry', 'Sameh', 'Ammar', 'Warner', ... 'Davis', 'Meyer', 'Nichols', 'Paige', 'Gill', 'Jessup', 'Mathias', ... 'Hochbruck', 'Starke', 'Demmel'}; end

Special thanks to Christine Tobler and Cosmin Ionita at MathWorks for help with today's post.

Get
the MATLAB code

Published with MATLAB® R2017a

The adjacency matrix of a hypercube demonstrates the new MATLAB graph object.... read more >>

]]>The adjacency matrix of a hypercube demonstrates the new MATLAB graph object.

My previous blog post was about the Patience Chinese Rings Puzzle. The puzzle's state lives in a six-dimensional space. There are 64 vertices, numbered 0 from to 63. Each vertex is connected to six other vertices, the ones whose numbers, expressed in binary, differ from its own by one bit. That connectivity arrangement -- those vertices and the connections between them -- form a *graph* known as a *hypercube*.

Recent versions of MATLAB include two new objects, `graph` and `digraph`. Here is an abbreviated `help` entry for `graph`.

>> help graph graph Undirected Graph G = graph(A) uses the square symmetric matrix A as an adjacency matrix and constructs a weighted graph with edges corresponding to the nonzero entries of A. If A is logical then no weights are added.

G = graph(A,NAMES) additionally uses the cell array of strings NAMES as the names of the nodes in G. NAMES must have as many elements as size(A,1).

And here is a function that generates the adjacency matrix for a hypercube.

```
type hypercube
```

function A = hypercube(d) % hypercube(d) is a logical adjacency matrix for a d-dimensional hypercube. pow2 = 2.^(0:d-1); f2b = @(j) floor(rem(j./pow2,2)); % flint to binary b2f = @(b) b*pow2'; % binary to flint n = 2^d; A = zeros(n,n,'logical'); for j = 0:n-1 % Convert column index to binary. b = f2b(j); % Flip bits to get corresponding row indices. for i = 1:d b(i) = ~b(i); k = b2f(b); b(i) = ~b(i); A(k+1,j+1) = 1; end end end

I first encountered hypercubes thirty years ago. For a couple of years in the late 1980s I was involved with one of the world's first commercial parallel computers, the IPSC, the *Intel Personal Super Computer*. Here is a link to my blog post in 2013.

The IPSC was constructed from single board computers, similar to the mother boards in the PCs of the day. We had three models -- *d5*, *d6* and *d7* -- with 32, 64 or 128 processors. Since it was impractical to connect every processor directly to every other processor, the connection network formed a hypercube. For a machine with $n = 2^d$ processors, that's $n \log_2{n}$ instead of $n^2$ connections. For the d6 that's 384 connections instead of 4096.

The d6 model has the same graph as the patience puzzle that I wrote about in my previous post. (Actually, the IPSC's graph is a directed graph, a *digraph*, because the connections are two-way. But undirected graphs have simpler plots.)

A 3-dimensional hypercube is not a conventional cube. The 3-d hypercube is only the vertices and edges of the cube. The cube itself includes its surfaces and interior. If you had a wire-frame cube made from rubber bands and squished it around without destroying the connectivity, you wouldn't have a cube any more, but you would still have a model of a hypercube.

Here is the adjacency matrix of a 3-d hypercube. (It's so small that it's not necessary to make use of the sparsity.)

A = hypercube(3)

A = 8×8 logical array 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0

One view of this matrix is its `spy` plot.

spy(A)

To get other views, let's use the new MATLAB graph object. We'll label the nodes with '0' through '7'.

nodes = num2cellstr(0:7) G = graph(A,nodes)

nodes = 1×8 cell array '0' '1' '2' '3' '4' '5' '6' '7' G = graph with properties: Edges: [12×1 table] Nodes: [8×1 table]

By itself, a graph doesn't have any geometric structure. In order to plot it you specify a *layout*, or coordinates, for the nodes. Coming up with good layouts is an art, as well as a science. Here are the currently available layouts.

'auto' (Default) Automatic choice of layout method based on the structure of the graph. 'circle' Circular layout. 'force' Force-directed layout. Uses attractive and repulsive forces on nodes. 'layered' Layered layout. Places nodes in a set of layers. 'subspace' Subspace embedding layout. Uses projection onto an embedded subspace. 'force3' 3-D force-directed layout. 'subspace3' 3-D Subspace embedding layout.

The default layout of the 3-d hypercube graph is a perspective view of an ordinary cube.

```
plot(G)
axis square
```

Let's move up to four dimensions.

A = hypercube(4); spy(A)

I think the default layout of the 4-d hypercube is attractive.

```
nodes = num2cellstr(0:15)';
G = graph(A,nodes);
plot(G)
axis square
```

But `'circle'` is another attractive layout. Notice that '3' is not connected to '4' because 0011 and 0100 differ by more than a single bit. The blue dot in the center is not a node -- it is just the point where many edges cross.

plot(G,'layout','circle') axis square

Here is yet another view.

plot(G,'layout','layered') axis square

I'm still not satisfied. I want one cube inside of another. I can supply my own coordinates for the vertices.

```
type cubelet
```

function x = cubelet % x = cubelet % 8-by-3 array of vertices of a unit cube. x = [-1 -1 -1 1 -1 -1 -1 1 -1 1 1 -1 -1 -1 1 1 -1 1 -1 1 1 1 1 1]; end

x = cubelet; x = [3*x; x]; plot(G,'xdata',x(:,1),'ydata',x(:,2),'zdata',x(:,3)) axis equal off view(-20,11)

This served as a logo for the IPSC.

The Laplacian of a graph is a matrix that has the degree of the nodes on the diagonal and -1's off the diagonal marking the edges. For the 3-d hypercube the degrees are all 4. So here the Laplacian.

L = full(laplacian(G)); frmt = [' ' repmat('%3d',1,16) '\n']; fprintf('L = \n') fprintf(frmt,L)

L = 4 -1 -1 0 -1 0 0 0 -1 0 0 0 0 0 0 0 -1 4 0 -1 0 -1 0 0 0 -1 0 0 0 0 0 0 -1 0 4 -1 0 0 -1 0 0 0 -1 0 0 0 0 0 0 -1 -1 4 0 0 0 -1 0 0 0 -1 0 0 0 0 -1 0 0 0 4 -1 -1 0 0 0 0 0 -1 0 0 0 0 -1 0 0 -1 4 0 -1 0 0 0 0 0 -1 0 0 0 0 -1 0 -1 0 4 -1 0 0 0 0 0 0 -1 0 0 0 0 -1 0 -1 -1 4 0 0 0 0 0 0 0 -1 -1 0 0 0 0 0 0 0 4 -1 -1 0 -1 0 0 0 0 -1 0 0 0 0 0 0 -1 4 0 -1 0 -1 0 0 0 0 -1 0 0 0 0 0 -1 0 4 -1 0 0 -1 0 0 0 0 -1 0 0 0 0 0 -1 -1 4 0 0 0 -1 0 0 0 0 -1 0 0 0 -1 0 0 0 4 -1 -1 0 0 0 0 0 0 -1 0 0 0 -1 0 0 -1 4 0 -1 0 0 0 0 0 0 -1 0 0 0 -1 0 -1 0 4 -1 0 0 0 0 0 0 0 -1 0 0 0 -1 0 -1 -1 4

The Laplacian could also be obtained from the original adjacency matrix.

L = diag(sum(A)) - A;

The coordinates of possible layouts for the plot of the graph can be obtained by picking three of the eigenvectors of the Laplacian. Here are all of the eigenvalues and three of the eigenvectors. (The eigenvalues are all integers.)

```
[V,E] = eig(L);
evals = round(diag(E)');
fprintf('evals = \n')
fprintf(frmt,evals)
evecs3 = V(:,1:3)
```

evals = 0 2 2 2 2 4 4 4 4 4 4 6 6 6 6 8 evecs3 = -0.2500 0.0216 0.0025 -0.2500 0.1620 0.4067 -0.2500 0.0245 -0.1449 -0.2500 0.1649 0.2593 -0.2500 0.2546 -0.2521 -0.2500 0.3950 0.1521 -0.2500 0.2575 -0.3996 -0.2500 0.3979 0.0046 -0.2500 -0.3979 -0.0046 -0.2500 -0.2575 0.3996 -0.2500 -0.3950 -0.1521 -0.2500 -0.2546 0.2521 -0.2500 -0.1649 -0.2593 -0.2500 -0.0245 0.1449 -0.2500 -0.1620 -0.4067 -0.2500 -0.0216 -0.0025

And here is the resulting plot.

G = graph(A,nodes); plot(G,'xdata',V(:,1),'ydata',V(:,2),'zdata',V(:,3)) axis square off set(gca,'clipping','off') zoom(2) view(-75,45)

The state space of both the puzzle that got me started on this blog post and one of the models of the parallel computer that consumed me thirty years ago is six dimensional.

A = hypercube(6); spy(A)

A plot of a cube inside of a cube inside of a cube inside of a cube is a bit messy. The circle layout is an interesting design, but it is hard to see which nodes are connected to which.

nodes = num2cellstr(0:63)'; G = graph(A,nodes); plot(G,'layout','circle') axis square

Here is a three-dimensional layout.

plot(G,'layout','subspace3') axis off square set(gca,'clipping','off') zoom(2) view(-108,-21)

The Bucky Ball provides an elegant example of a graph. Here's the `help` entry.

```
help bucky
```

BUCKY Connectivity graph of the Buckminster Fuller geodesic dome. B = BUCKY is the 60-by-60 sparse adjacency matrix of the connectivity graph of the geodesic dome, the soccer ball, and the carbon-60 molecule. [B,V] = BUCKY also returns xyz coordinates of the vertices.

When we put `bucky` into MATLAB years ago we went to a whole lot of trouble to compute those vertex coordinates. (To see how much trouble, `type bucky`.) But now, with the three dimensional `force3` layout, the `graph/plot` function does that work. Putting a little spin on it makes it interesting to watch.

```
type bucky_gif
```

B = graph(bucky); plot(B,'layout','force3') axis square off vis3d set(gca,'clipping','off') zoom(2) [a,e] = view; gif_frame('bucky_spin.gif') for d = 10:10:360 view(a+d,e+d) gif_frame end gif_frame('reset')

It's more fun to use the `cameratoolbar` and put the spin on the ball yourself.

Get
the MATLAB code

Published with MATLAB® R2017a

MIT's Professor Daniel Frey recently introduced me to an ancient mechanical puzzle known as "Chinese Rings", "Patience", or "Baguenaudier." I have modified Daniel's simulator to produce a new app. The state space of the puzzle forms a hypercube.... read more >>

]]>MIT's Professor Daniel Frey recently introduced me to an ancient mechanical puzzle known as "Chinese Rings", "Patience", or "Baguenaudier." I have modified Daniel's simulator to produce a new app. The state space of the puzzle forms a hypercube.

There are dozens of YouTube videos showing solutions and speed contests for the tavern puzzle called "Patience". The best video is by Ty Yann, "Baguenaudier Chinese Rings." Here is the link to his puzzle tutorial.

Wikipedia tell us that *Baguenaudier* is French for "time-waster".

Prof. Frey has written a nice MATLAB simulator for his course this spring at MIT, 2.086, "Numerical Computation for Mechanical Engineers." My new app is based on his program.

The puzzle has six steel hooks, often made from nails, attached to a base. A steel ring is attached to each hook. A long movable piece known as the shuttle is initially threaded around the nails and through the rings. The objective is to free the shuttle. It ain't easy.

Image credit: Daniel Frey, MIT

My new interactive app, `patience`, is included with version 2.2 of Cleve's Laboratory in MATLAB Central. I hope you will try to solve the puzzle yourself before you resort to the "no patience" automated solve button. The following animated gif shows several steps at the start of the solution, then skips many intermediate steps before showing the last several.

The position of the six hooks, up or down, provides a six-bit binary number that characterizes the state of the mechanism. Initially, the rings are all engaged, so the state in binary is `000000`, which, of course, is a decimal 0. The objective is to free all the rings, giving a binary `111111`, or decimal 63. The app shows the decimal value at each step in the figure title.

The hooks are numbered from 1 through 6 starting on the left. So, we will use "big endian" binary, with the least significant bit at the left.

This one line in `patience.m`, which refers to the state vector `S` and a hook number `n`, controls the possible moves.

dbtype 52:53 patience

52 % Is the move allowed ? 53 ok = (n==1) || (n==2)&&~S(1) || (n>2)&&(~S(n-1)&&all(S(1:n-2)));

The rule says

- Hook 1 can always be toggled.
- Hook 2 can be toggled if hook 1 is up.
- Any hook numbered 3 through 6 can be toggled if the hook immediately to its left is up and all the others before it are down.

That's all you need to know to operate the puzzle.

It turns out that at any step at most two, and sometimes only one, of the six hooks is a legal move.

For example, if the state is

S = [1 1 0 0 1 0]

there are two possible moves. You can click on hook 1 to give

S = [0 1 0 0 1 0]

Or, you can click on hook 4 to give

S = [1 1 0 1 1 0]

The optimum path from a given starting state, such as 0, to the goal, 63, can be found by a brute force recursive search that tries all possible moves at each step. The code is at the end of this post. Here is the shortest path from 0 to 63.

path = shortest_path(0)

path = Columns 1 through 13 0 2 3 11 10 8 9 13 12 14 15 47 46 Columns 14 through 26 44 45 41 40 42 43 35 34 32 33 37 36 38 Columns 27 through 39 39 55 54 52 53 49 48 50 51 59 58 56 57 Columns 40 through 43 61 60 62 63

Notice that a shortest path never repeats any state because if we find ourselves reaching a state that we've already visited, we could shorten that path by skipping the extra steps.

It turns out that the only starting state that visits all vertices is `31 = [1 1 1 1 1 0]`, which corresponds to starting with the sixth hook up and all the others down.

Each step changes the state by a power of two. You can recover the sequence of mouse clicks that would generate this path by taking the log2 of the changes.

hooks = log2(abs(diff(path))) + 1

hooks = Columns 1 through 13 2 1 4 1 2 1 3 1 2 1 6 1 2 Columns 14 through 26 1 3 1 2 1 4 1 2 1 3 1 2 1 Columns 27 through 39 5 1 2 1 3 1 2 1 4 1 2 1 3 Columns 40 through 42 1 2 1

Let's plot the decimal value of the state for the shortest path from 0 to 63. Here is the progress of this value over the 42 steps. The even numbered steps involve toggling the first hook, which changes the value of the state by plus or minus one. The sixth hook is moved only once, at step 11, changing the value by +32. The fifth hook contributes its +16 at step 28.

plot_path

Here are the lengths of the shortest paths from each starting position.

L = zeros(64,1); for n = 0:63 L(n+1) = length(shortest_path(n)); end bar(L)

The puzzle's state lives in a six-dimensional space. There are 64 vertices, numbered 0 from to 63. Each vertex is connected to the six vertices whose numbers, expressed in binary, differ from its own by one bit. That connectivity arrangement -- those vertices and the connections between them -- form a *graph* known as a *hypercube*.

Recent versions of MATLAB include methods for a new object, the *graph*. I plan to discuss hypercubes and graphs in my next blog post.

Here is the code for the recursive search that finds the shortest path. The code for the complete `patience` app is now included with Cleve's Laboratory in MATLAB Central.

```
type shortest_path
```

function path = shortest_path(state,m,path) % shortest_path, or shortest_path(state), where state is a decimal state % value between 0 and 63, is the shortest path of allowable puzzle moves % from that state to the objective, 63. % % shortest_path(S,m,path) is the recursive call. if nargin == 0 state = 0; % Default end if nargin <= 1 % Initial call m = 0; % Previous n path = []; end if state == 63 % Terminate recursion path = state; return end if state == 31 m = 0; end % Search for shortest path pow2 = 2.^(0:5); lmin = inf; for n = [1:m-1 m+1:6] Sn = mod(floor(state./pow2),2); % Convert state to binary Sn(n) = ~Sn(n); % Flip one bit ok = (n==1) || (n==2)&&~Sn(1) || ... (n>2)&&(~Sn(n-1)&&all(Sn(1:n-2))); % Allowable move if ok staten = Sn*pow2'; % Recursive call pn = shortest_path(staten,n,[path staten]); if length(pn) < lmin lmin = length(pn); pbest = pn; end end end path = [state pbest]; end % shortest_path

Thanks to Daniel Frey.

Get
the MATLAB code

Published with MATLAB® R2017a

"ULP" stands for "unit in the last place." An *ulps plot* samples a fundamental math function such as $\sin{x}$, or a more esoteric function like a Bessel function. The samples are compared with more accurate values obtained from a higher precision computation. A plot of the accuracy, measured in ulps, reveals valuable information about the underlying algorithms.... read more >>

"ULP" stands for "unit in the last place." An *ulps plot* samples a fundamental math function such as $\sin{x}$, or a more esoteric function like a Bessel function. The samples are compared with more accurate values obtained from a higher precision computation. A plot of the accuracy, measured in ulps, reveals valuable information about the underlying algorithms.

`libm` is the library of elementary math functions that supports the C compiler. `fdlibm` is "freely distributable" source for `libm` developed and put into the public domain over 25 years ago by K. C. Ng and perhaps a few others at Sun Microsystems. I wrote about `fdlibm` in our newsletter in 2002, The Tetragamma Function and Numerical Craftsmanship.

Mathematically `fdlibm` shows immaculate craftsmanship. We still use it today for our elementary transcendental functions. And I suspect all other mathematical software projects do as well. If they don't, they should.

`ulps(x)` is the distance from `x` to the next larger floating point number. It's the same as `eps(x)`.

To assess the accuracy of a computed value

y = f(x)

compare it with the more accurate value obtained from the Symbolic Math Toolbox

```
Y = f(sym(x,'f'))
```

The `'f'` flag says to convert `x` to a `sym` exactly, without trying to guess that it is an inverse power of 10 or the `sqrt` of something. The relative error in `y`, measured in units in the last place, is then

u = (y - Y)/eps(abs(Y))

Since this is *relative* error, it is a stringent measure near the zeros of `f(x)`.

If `y` is the floating point number obtained by correctly rounding `Y` to double precision, then

-0.5 <= u <= 0.5

This is the best that can be hoped for. Compute the exact mathematical value of `f(x)` and make a single rounding error to obtain the final result. Half-ulp accuracy is difficult to obtain algorithmically, and too expensive in execution time. All of the functions in MATLAB that are derived from `fdlibm` have better than one-ulp accuracy.

-1.0 < u < 1.0

Each of the following plots involves 100,000 random arguments `x`, uniformly distributed in an appropriate interval.

We see about 0.8 ulp accuracy from this sample. That's typical.

Argument reduction is the first step in computing $\sin{x}$. An integer multiple $n$ of $\pi/2$ is subtracted from the argument to bring it into the interval

$$ -\frac{\pi}{4} \le x - n \frac{\pi}{2} \le \frac{\pi}{4} $$

Then, depending upon whether $n$ is odd or even, a polynomial approximation of degree 13 to either $\sin$ or $\cos$ gives the nearly correctly rounded result for the reduced argument, and hence for the original argument. The ulps plot shows a slight degradation in accuracy at odd multiples of $\pi/4$, which are the extreme points for the polynomial approximations.

It is important to note that the accuracy is better than one ulp even near the end-points of the sample interval, $0$ and $2\pi$. This is where $\sin{x}$ approaches $0$ and the approximation must follow carefully so that the relative error remains bounded.

Again, roughly 0.8 ulp accuracy.

Similar argument reduction results in similar behavior near odd multiples of $\pi/4$. In between these points, at $\pi/2$ and $3\pi/2$, $\tan{x}$ has a pole and the approximation must follow suit. The algorithm uses a reciprocal and the identity

$$ \tan x = -1/\tan{(x+\frac{\pi}{2})} $$

This comes close to dividing by zero as you approach a pole, but the resulting approximation remains better than one ulp.

Good to within slightly more than 0.8 ulp. The underlying approximation is a piecewise polynomial with breakpoints at a few multiples of 1/16 that are evident in the plot and marked on the axis.

About the same accuracy as the previous plots.

The argument reduction involves the key value

$$ r = \ln{2} \approx 0.6931 $$

and the identity

$$ \exp{x} = 2^n \exp{(x-n r)} $$

The resulting ulps plot shows the extremes of the error at odd multiples of $r/2$.

Now for two functions that are not in `fdlibm`. If you follow this blog, you might have noticed that I am a big fan of the Lambert W function. I blogged about it a couple of years ago, The Lambert W Function. The Wikipedia article is excellent, Lambert W Function. And, you cn just Google "lambert w function" for many more interesting links.

The Lambert W function is not available in MATLAB itself. The Symbolic Math Toolbox has an @double method that accesses the symbolic code for type double arguments. Or, my blog post mentioned above has some simple (but elegant, if I do say so myself) code.

Neither code is one-ulp accurate. The primary branch of the function has a zero at the origin. As we get near that zero, the relative error measured in *ulps* is unbounded. The absolute accuracy is OK, but the relative accuracy is not. In fact, you might see a billion ulps error. That's a *gigaulp*, or *gulp* for short.

As `lambertw(x)` crosses a power of two, the unit in the last place, `eps(lambertw(x))`, jumps by a factor of two. Three of these points are marked by ticks on the x-axis in the ulps plot.

for a = [1/8 1/4 1/2] z = fzero(@(x) lambertw(x)-a,.5) end

z = 0.1416 z = 0.3210 z = 0.8244

Our codes for Bessel functions have fairly good, although not one-ulp, absolute accuracy. But they do not have high relative accuracy near the zeros. Here are the first five zeros, and an ulps plot covering the first two, for $J_0(x)$, the zero-th order Bessel function of the first kind.

for a = (1:5)*pi z = fzero(@(x) besselj(0,x), a) end

z = 2.4048 z = 5.5201 z = 8.6537 z = 11.7915 z = 14.9309

Here is the inverse of the error function. It looks very interesting, but I haven't investigated it yet.

I have recently updated Cleve's Laboratory in the MATLAB Central file exchange. The latest version includes `ulpsapp.m`, which generates the ulps plots in this blog.

Get
the MATLAB code

Published with MATLAB® R2017a