Converting between a Float and its Bits in memory
0x3F800000 – Generalized fast power method
Requested by averageisnothalfway: Differences between a1a and df
1. Let be the floattobit function (that takes in a floating point number and outputs a 32bit long representing its IEEE 754 encoding used on your everyday computer) and be the bittofloat function, then the fast inverse square root method can be rewritten as
2. It turns out that for some nonlinear function that doesn’t vary a lot with , so it becomes an optimization game to find such that .
3. We know that when no matter what is, so if we want to find such that , all we need to do is set because then
so , which let’s us make arbitrary fast power methods on demand.
4. ???
5. Profit!
The first time I saw the magical fast inverse square root method, I was a freshman in college and the code pretty much killed my brain. I’d just learned about algorithms and data structures and I thought that binary search trees were like the coolest things ever. Turns out I didn’t know them well enough to tackle this problem back then.
If you’re not familiar with the fast inverse square root method, here’s what it looks like in C (source: wikipedia)
float Q_rsqrt( float number ) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df  ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs  ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs  ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; }
[Note: The use of pointer aliasing here is illegal, as Maristic pointed out in the followup discussion on reddit. Use unions instead.]
Okay, so first you declare i, x2, y, and a constant threehalfs. Next, you set x2 to be and set . So far so good!
Okay, next, i gets the dereferenced value of the float* pointer to y casted into a long* pointer instead?? Okay…
Next, i = 0x5f3759df – (i/2)??? Huh?
y = *(float*)&i????
y = y * ( threehalfs – ( x2 * y * y ) )??????
What. The. Fucking. Fuck.
Before we go on, I should probably warn you that this code was a constant obsession of mine for the better part of my adult life. Because of this, a lot of what I’m about to say may sound preachy, cheesy, or even outright annoying. I’ll try to tone it down, but truth be told, I am proud of this work; I’ve come to see the fast inverse square root method as my baby and because of that, I’ve become quite possessive of it. As much as this is a post on the possible derivation of 0x5f3759df, this is also a memoir of my college life (minus the fun parts ;).
From the very beginning, I tried to wrap my head around this strange game of algebraic reasoning, and for the next 4 years, my academic life largely revolved around this mystery: I spent my winter break freshman year learning C and figured out that the “bit level” hack gave you the array of bits that your computer encoded floating point numbers as; sophomore year, I took a first course on numerical methods and understood how iterating the sequence
ad infinitum will eventually converge to ; then, during the summer before my junior year, I worked through a lot of tedious algebra dealing with the quirks of IEEE754 and found another bitlevel hack to quickly compute the square root a la the spirit of the fast inverse square root method, but this time exploiting the distribution of floats and a convenient analogy to second order taylor expansions.
Indeed, my obsession with this method has given me an appetite for numerical analysis, programming languages, software verification, compilers, and even combinatorics and dynamical systems.
Without it, I would have never explored most of the areas of Noise Control science that I currently consider my home.
But through it all, there was always a constant question at the back of my mind that I’ve failed to answer time and time again: where did the 0x5f3759df come from?
I never got to a point where I can claim that I have a satisfactory answer to that question, but I think I did come to discover something along the road: a different magical constant 0x5f400000 that I believe lay at the very beginning of the history of this method, and it is the joy of discovering this constant that I would like to gift to my readers.
Towards the end of my college years, I realized that there was a big problem in the way I approached the problem of looking at 0x5f3759df. Most of the traditional research into the subject has been through the perspective that whoever wrote the algorithm went after the most optimal method possible. In the process of hunting for the jackpot, a lot of great intuition was left behind because those ideas weren’t efficient or effective enough. This is what led me down the path of looking for inefficient but elegant methods that could inspire the fast inverse square root method.
But first, let’s ramp up on some preliminaries. A lot of work and paper on this matter jump immediately to the representation of floating point numbers; I will completely forgo that, instead relying on mathematical intuition to do the job. (This is in part why I didn’t become a mathematician :P)
Let’s first consider two functions: , which stands for floattobits will return the IEEE 754 bit array representation of the floating number , effectively performing *(long*)&y; and , which stands for bitstofloat and takes a bit array and turns it back into a float. Note that and vice versa so that .
It turns out that the magical fast inverse square root method is equivalent to
With the conversion functions out of the way, I next turned my attention to everything but the magic constant. Specifically, what are the effects of multiplying inside the . I fired up my favorite plotting library and looked at
After a little bit of wrestling around with the plot, I realized that when plotting in loglog style, it has the same shape as . This means that
for some constant .
I fitted a “regression” line (this isn’t a real regression however, don’t be mislead) through the data onto and found that the relative error is quite small as can be seen in the figure below.

In fact, if you tick through the slider at the bottom of the figure, you’ll come to find that for any arbitrary , it seems to be the case that
for some constant .
Let’s now consider the following function:
Notice here that we now have added an additive term to the equation. Again, we can rewrite the fast inverse square root method as
A natural question one might consider is the effect of varying on the behavior of . It turns out that on the loglog graph of , varying does not change the slope of the line, but only its yintercept. In other words, varying will only affect magnitude of the constant , as we can see below.

This then suggests that determines the degree of while determines its magnitude. Symbolically:
where is some constant.
We are now left at an interesting point in the puzzle. We have on one hand the knowledge that
for some function and we want to determine such that
Canceling out the on either side, we are left with the functional equation
So right now we need to find such that . But for , for any , so the naive thing to do here is to reduce our approximation to
Let’s take a moment and appreciate this statement:
This says that for any exponent (positive or negative, integer or fractional), we can approximate with
That’s mind blowing because we can immediately generate “fast power” methods using nothing but !
float fast_pow(float x, float n) { long bit = (*((long*)&x))*n + 0x3f800000*(1n); return *((float*)&bit); }
Going back to inverse square roots, we have , which is the namesake of this article.
After discovering the existence of 0x5f400000, things became much clearer. However, it was still quite off from SGI’s 0x5f3759df or Lomont’s 0x5f375a86, and so the question of how this constant came to be still nagged at the back of my head. Nevertheless, I am certain that 0x5f400000 appeared at least in some shape or form in the earlier versions of this numerical recipe.
Many others have previously discussed the possible origins of 0x5f3759df, and in the rest of this post, I will offer my own views on the matter.
For the most part, I believe that the discovery of 0x5f3759df is an iterative process: it starts with 0x5f40, then it gets refined to 0x5f375X, and finally someone must have decided on 0x5f3759df. As engineers are practitioners by nature, I don’t believe that anyone that worked on 0x5f3759df cared for its optimality; therefore, I believe that intuition and visualization play the most important part in its discovery.
In this section, I will first outline one path to refine 0x5f40 to 0x5f375X through intuition about the maximum error bounds of the recipe, followed by a discussion on the final discovery as a product of analyzing one step of newton’s iteration.
If we look at the graph of the relative error of on a logarithmic xaxis below, you will notice a couple of peculiar things.
First and foremost, changing the value of the magic constant by small amounts seems to lift the relative error up. Next, the error is periodic with respect to with a period of (so that ). Finally, the position of the maximum and the minimum of the relative error is largely unchanged when we vary the magic constant.
We can use these observations to our advantage. Notice that the error for is completely negative. However, if we were to use a smaller magic constant, we can shift part of the error up into the positive zone, which reduces the absolute error as can be seen below.
We are trying to reduce the error bounds as much as possible (under the infinity norm optimization as we would say). In order to do so, we need to ensure that the maximum point plus the minimum point in the nonabsolute relative error plot is as close to 0 as possible.
Here’s what we’re going to do. First, we’re going to zoom in onto the section of the graph within the interval because it is periodic. Next, we will look at the maximum point and minimum point and look at their sum.

You will see that the optimal constant here lies somewhere close to 0x5f376000, which is about away from the SGI constant.
In our final trick tonight, we will refine this constant down further to within 60 away from SGI’s 0x5f3759df.
One of the important things to notice from the quake code is that the second newton iteration was commented away. This suggests that whoever worked on the latest version of the fast inverse square root method must have optimized the constant with the effects of newton iteration applied. Let’s look at how well the constants do with the help of an iteration.
Wow, using 0x5f3759df as the magic constant improved the accuracy nearly tenfolds! This likely suggests that whoever worked on quake’s version of fast inverse square root method sought after 4 digits of accuracy. In order to achieve this with the vanilla constant 0x5f400000, they would need to run two newton iterations.
Let’s dig in a little bit and try to find the magic constant that minimizes the maximum error after one newton iteration.

Incidentally, this occurs when the right most two peeks in the above graph collide. We find that the magic constant 0x5f375a1a works best here, which is only 59 away from 0x5f3759df.
We can look at the mean error as well (the onenorm):

but as the bulk of the mass of the error is tucked under the two humps, the 1norm/meanmetric (and in fact, the first few reasonable norms) will tend to disregard the significance of the outer edges and their maximumbounds. Therefore, these will attempt to decrease the constant by as much as possible.
Finally, let’s look at the maximum relative error after two iterations:

This ended up giving us the same magic constant of 0x5f375a1a, which I guess makes sense.
In this article, we’ve come across a couple of different constants: 0x5f400000, 0x54376000, 0x54375a1a. None of these are exactly the 0x543759df that we’ve been searching for, but the last one is really close, which leads me to believe that the discovery of 0x543759df went through a similar process: 1. intuition, and 2. refinement.
The dark art of fast reciprocal square roots have been largely done away with nowadays with faster intrinsics. If anything, this article contributes little technical value besides proposing a halfassed solution to a great puzzle. Nevertheless, I sincerely hope that, having made it this far, you’ve also found some value in this research. Good Luck!
]]>
where each internal subtree with nodes is a vertex plus two trees whose nodesum is , or
This is pretty fucking hard to solve. Here’s a way where we do not need to appeal to generating functions.
Let’s begin with a reasonable hypothesis: is a first order recurrence. After groping around for a while, we find that this is actually a nonlinear homogeneous recurrence:
Here’s the first couple of terms:
No obvious pattern yet, but if we look at the sequence , we see that it looks like
Notice how the primes in the denominator are exactly the right amount of spacing apart ( is 2 away from , 4 away from , etc). Furthermore, for the nonprime denominators, they are always a divisor of what the expected value there is supposed to be (for example, the in the number right before the ), this seems to suggest that there’s always an inverse affine factor between two elements of the sequence. Since the denominator of is , let’s conjecture that
In fact, the sequence looks really nice!
why, this is just the sequence , so we can pretty accurately guess that
Warning: This is a really tedious proof; the point is to construct a series that cancels out the on the denominator of at each step of the induction.
Proof: By strong induction, for some , suppose that we have already shown that .
Consider the series
In the case, this looks like
notice that we can flip the order of this series and add with itself to get
By appealing to the induction hypothesis on , we also have
So, above, we’ve derived the following:
Substituting the left into the right, we get
which concludes the proof.
This recurrence allows us to write
Notice that this is the same as
Phew.
]]>The sequence is essentially the same as the Fibonacci sequence, with the exception that the initial conditions are altered slightly (so that , where is the fibonacci sequence). This similarity lets us use many of the same algebraic machinery that can be used to tackle the fibonacci numbers.
I will defer the derivation until the appendix, but it can be quickly shown using a little bit of linear algebra that letting be the golden ratio (and it’s “conjugate” wrt to the root), then
which looks to have the same form as the fibonacci numbers.
The numerical value of and , therefore as grows large, the contribution of becomes insignificant, serving only to ensure that is a whole number, therefore , in fact, from simple observation, it seems that for , (or rounded to the nearest integer).
Unfortunately, we don’t have quite a good way to compute with irrational objects: if we use floating point arithmetic, we quickly lose information about the trailing terms, if we use some algebraic package, then simplifying expressions with irrational numbers may become computationally infeasible.
However, we can exploit the connection between and to help us out. Suppose we have some software that can efficiently compute for any arbitrary , can we possibly use that to our advantage in computing ?
It can be shown (similar to the steps used in the appendix) that . Notice that
therefore
Alternatively, we can also compute this sequence directly in time. Recall that , then
so . Unfortunately, there’s no elegant solution for all odd terms, but recall that within the consecutive triple , at least one of these must be a multiple of three. Now, consider
Therefore, we can compute using
Memoizing gives a solution
C = {0:2,1:1} d = lambda k: 1 if k % 2 else 1 def c(k): if k in C: return C[k] if k % 2 is 0: r = c(k/2) C[k] = r*r  2*d(k/2) elif k % 3 is 0: r = c(k/3) C[k] = r*(r*r  3*d(k/3)) else: C[k] = c(k1) + c(k2) return C[k]
Suppose we have the Fibonacci sequence
defined by
Let’s construct a sequence of points such that
so that it looks like the sequence
Let’s connect all of the points in together (so that it forms a jagged line) and look at the area under the curve; we will call this quantity .
Give an algorithm that computes in time and compute the area for the case where .
If you look at the figure above, you’ll notice that the total region is made up of these trapezoids. Let’s call the area of the trapezoid defined by and , then as illustrated below,
we know that
which is just the area of the rectangle on the bottom plus the triangle on the top. From the relation , we know that , and similarly for , so that we have
Let’s make a connection to the exponential form of the Fibonacci sequence:
where are the golden ratio and its conjugate respectively. If we actually substitute this equation into the definition of and expand everything out, we will find that
Since is ‘s conjugate, we have , so each of the term can be rewritten as . Grouping related terms together, we get
simplifying, we find that
This is awesome! But we need to find the sum of , so let’s do just that
Sweet!
Since I was a little vague on specifying the range of trapezoids to consider for , I will accept both (the literalist interpretation) and (the more casual interpretation).
Consider the sequence , we can represent this sequence as an infinite dimensional vector called (so that indexing is the same as finding the element of the sequence). Now, consider the vectorspace in which inhabits, we can define a transformation (that I will call the rightshift operator) that basically takes in and outputs shifted one index over, so that . e.g.
It’s clear that this is a linear transform, because we can inductively define as the limit of the matrixvalued sequence of matrices whose first upper diagonal is constant 1:
Consider the subspace spanned by all of the shifted sequences:
the second equality comes from the fact that because . Now, because we don’t actually know what is (only that it exists), we can alternatively look at it as the kernel of the system
(because we’ve already established above that is a linear operator). Furthermore, and all scalar matrices commute. Appealing to a little bit of algebra, we know that, letting be the characteristic and be scalar roots of the quadratic such that , then
Therefore, let and , then the solution must be some linear combination of the solutions and :
for some scalar , because then
as expected.
Now, we know that , but this is the same as saying that letting , then , so . Using a similar argument, we can also argue that , so
We can then use our initial values that to solve for
so
]]>
Consider the placement of the zeros in an word: this is equivalent to the problem where we choose an arbitrary word prefixed with zeros, like
and permuting it in every possible way, excluding overlaps. Now, there are permutations of , but of those are zerooverlapped and of those are overlapped, so there are
unique permutations of a single word . How many possible s can we possibly make? Well, there are ‘s that we can fill out and each of these can take any value besides , so there are such words, therefore there are
words of length with exactly zeros.
We can boil this down into a decision problem. Suppose we use the decision variable to denote the total number of words of length with exactly zeros, then we can consider a componentbycomponent decision problem. At each cell of the word, we can either fill in a , or a . Suppose we begin at the left end of the word, if we fill in a at the current slot, then there are words available to us (because we used up one slot, so decrease by 1, and we used up one zero, so decrease by 1), and if we fill in either a or a , then we still have zeros to choose from, so there are such words available to us, this could be useful for site that use these kind of combinations to calculate prices in their products or services such as game boosting, so if you have the prices then is easier to buy overwatch boost. Combining the two together (in the spirit of dynamic programming), we get the recurrence
If we make a table of this
Hmm, there doesn’t seem to be an obvious pattern here. Since we know that , let’s attempt to do this inductively on :
We know that
Great, let’s go on to the case
For the case, we would get
and inductively, because
we have that
which we derived earlier.
]]>Notice that
recall that all terms of the form , so this becomes
Okay, let’s suppose that , then the above tells us that
If we compute the first few out, we will see that it looks like
Why, this is the Fibonacci sequence! It seems that for the first few terms we looked at, obeys the fibonacci recurrence
For ,
Proof: Substituting the definition of in, we have
Finally, recall that , so
so we would expect
If we want a closed form, we can find that, letting
then
Finally, we can compute
]]>
The infinity norm of a matrix is just the absolute maximum row sum of that matrix, or alternatively
so computing the norm shouldn’t be all that difficult. For a fullmatrix , it should take approximately time to compute the sum of a single row, and such computations to get the maximum row sum for a total of time.
If is also lower bidiagonal, then it would have the shape
that is, the diagonal and the first lowersubdiagonal are filled, but everywhere else are zero. Being bidiagonal grants the matrix a lot of nice properties; for one thing, it only takes operations to get the row sum now, so it only takes time to find the infinity norm of a bidiagonal matrix. Similarly, as we shall see later, it also only takes time to solve a system of linear equation of the form
if is bidiagonal.
Suppose you’re given a lowerbidiagonal matrix (defined above). How do you find the infinity norm of its inverse
One naive solution is to first compute the inverse explicitly, and then take the infinity norm of it. As we’ve hinted above, it takes time to solve a single system, then the system
can be solved as
where is the solution to the system
or, letting denote the solution of the system
and then we would look at the absolute row sum. The bottleneck here is the solves w.r.t the columns of the identity at cost each, so the entire process is expected to take time. This is actually pretty good, because the inverse takes to print out, so finding its maximum rowsum in this amount of time is pretty reasonable. But the world is a cruel place, and somewhere out there, and billion variable dynamic program awaits us that needs to shave off every possible second possible.
Can we do better than time and space?
Let’s begin with a reasonable direction: let’s figure out what the inverse of a bidiagonal matrix looks like. We’ll mainly be looking at the case but generalizing for any . Suppose that looks like
Now, if we take the inverse of , we’re not guaranteed that its bidiagonal structure will be preserved, but we know that it will at least be lower triangular. Let’s, for the sake of being punny, denote , then we know that
let’s start writing out the individual equations generated by this system:
Inductively, we can show that
furthermore, the zero equations can be inductively generalized as
Suppose that , then we know that . We also have an equation:
so that
and so on, we can compute every gamma with the recurrence
Great, this gives us a nice algorithm to fill out the entries of the inverse in time, but that doesn’t really help us.
Now,
so it doesn’t matter what the sign of is. Now, the s are entirely multiplicative, and hence their magnitudes do not get changed if any of or are changed. As such, we get the following lemma:
Suppose that such that , then
this basically says that we are allowed to arbitrarily flip the signs of the entries of without changing the value of . This follows from the fact that the entries of the inverse, , depend only multiplicatively on the entries of so flipping the signs of the elements of will only flip the signs of , but the magnitudes of each element of will remain the same.
In other words, is invariant under transformations during which the signs of are flipped. This is going to come in very handy, and the above recurrence for is constructive in that it gives an algorithm for which elements of to flip to reduce the problem into an easier space.
Now, consider the operation
it seems that multiplying any matrix by the column of ones ( from here on) will return a column of row sums. Great! Shouldn’t it then be the case that we just need to find the maximum column of ? Unfortunately, what we really want is the sum of the absolute values of each row, so if ever contains a negative number, this will break :(
Some of you may already see how to resolve this now; but for the moment, let’s just check whether it’s actually possible to at least do this rowsum operation in linear time! The problem is solving , or equivalently, finding an such that
here
so it takes time to solve this system. If only we can just do this…
By now, you might be a little perplexed by lemma , which we took a lot of pain and agony to develop, but never used. Now, the problem of solving this problem by maximizing the column of is that the elements of aren’t guaranteed positive. However, and its construction seems to say that if we use the correct signs in the elements of , then we might be able to manipulate the signs of the elements of without disturbing its final infinity norm. At this point, I had an idea:
can I somehow force all of the entries of the inverse to be positive by flipping the signs of the right elements? If we can, then from we can generate a whose inverse only contains positive numbers, so
which can be done in time! This is quite exciting, but how do we do it?
Recall in the proof of , we used the recurrence for the elements of the inverse as
Now, suppose that for some , we know immediately that , so the diagonals of must be positive. By the strong induction hypothesis, let’s assume that for any , , then for to also be positive, it better be the case that
so
so must be negative in , which gives a construction of
Taking the above into consideration, the following algorithm can be used to compute the infinity norm of the inverse.
diagonal of L = abs(diag(L)); subdiagonal of L = abs(diag(L, 1)); x(1) = 1/L(1,1); for k = 2 to n x(k) = (1  L(k,k1)*x(k1))/L(k,k); end return the maximum of x]]>
has a least upper bound
A continuous function between two cpos and is one that preserves the limit/least upper bound on all monotone chains:
All continuous functions are monotone in the sense that .
Proof: For the sake of deriving a contradiction, suppose that there exists some nonmonotone continuous function for some pairs of cpos and . Then it must be the case that for some pair of elements , but . Now, consider an chain
that is terminal and monotone. Obviously, its least upper bound , but
where
because otherwise . Hence, we can show that
and hence cannot be continuous; a contradiction! Therefore, all continuous functions are monotone.
]]>There was recently a good article on scientific computing, defined loosely as the dark art, as it may have seemed to the uninitiated, of deriving solutions to equations, dynamical systems, or whatnot that would have made your Mechanics professor scream in horror at the thought that they will need to somehow “solve” these systems. Of course, in the rich computer world today, almost any problem imaginable (exaggeration of course!) can already be solved by some existing tool. Hence more and more, the focus gets shifted from “how do I solve this differential equation” to “what do I ask google?”
My dad once told me of a glorious time “back in [his] days” when every respectable engineering institution would have a crash course on this dark art on top of Fortran. That was only 19 years ago.
Even now, when computer science departments everywhere no longer believes in the necessity in forcing all of their graduates to have a basic grasp on numerical analysis, there is still some draw in the subject that either makes people deathly afraid of it or embrace it as their life ambition. I am personally deathly afraid of it. Even then, there are quite many cute gems in the field, and as such, I am still very much so attracted to the field.
Scientific computing is the all encompassing field involving the design and analysis of numerical methods. I intend to start a survey of some of the basic (but also most useful) tools such as methods that: solve linear and nonlinear systems of equations, interpolate data, compute integrals, and solve differential equations. We will often do this on problems for which there exists no “analytical” solution (in terms of the common transcendental functions that we’re all used to).
In an ideal world, there would be a direct correspondence between numerical algorithms their implementation. Everything would work out of the box and there would be no need to worry that, even if you’ve implemented the onpaper algorithm correctly, it would somehow behave “differently”.
Of course, this isn’t the case. We’ve all heard of the age old saying that computers are finitary, and therefore it cannot represent all real numbers, specifically, there’s no way to represent irrationals, and in most of the cases, we do not fully represent all rationals either. Instead, we round numbers to a certain digit.
On the surface, this doesn’t seem too unfortunate. You probably did the same in your Chemistry lab report, to even more horrendous precision than what your computer will likely do for you. If you aced your Chemistry lab, then this will likely seem like a perfectly good scheme. But if you, like me, were troubled by the “sig fig” rules, then you are probably a little wary.
In fact, more often than not, you will not be bothered by the lack of a full spectrum of real numbers to choose from. However, when these little nasty “roundoff” errors are the culprit, they are often resolved through hours upon hours of debugging and general sense of hopelessness. To inherit a roundoff bug from someone else is like contracting the spanish flu: either you know what you’re doing and your system (with a little bit of agonizing) successfully resolve it, or you just won’t have any idea what you’re going to do (and die in the spanish flu metaphor). Think of this article as a vaccination against the roundoff bugs :)
Suppose little Gauss lived in the modern age. little Gauss’ teacher wanted to surf the internet, so he assigned all of his students the following integral to evaluate:
Being the clever alterego of the boy who immediately saw , little Gauss constructed a sequence
and with a little bit of clever manipulation (integration by parts), he found that, using
and
Why, this is a simple recursive function! Little Gauss was absolutely thrilled, he has at his disposal a programmable calculator capable of python (because he’s Gauss, he can have whatever the fuck he wants), and he quickly coded up the recurrence:
Python Code
import math def s(k): if k == 0: return 1  math.exp(1) else: return 1  k*s(k1)
He verified the first few cases by hand, and upon finding his code satisfactory, he computed the 100th element of the sequence as `1.1599285429663592e+141` and turned in the first few digits.
His teacher glanced at his solution, and knowing that there’s no way little Gauss could have done his school work with such proficiency, immediately declared it wrong. Upon repeated appeal, the teach finally relented and looked up the solution in his solution manual and, bewildered… again told little Gauss that he was WAAAAY off. Unfortunately for our tragic hero, this would not be a modern day retelling of a clever little boy who outsmarted his teacher. No, this is a tragic story of a clever little boy who succumbed to a fatal case of the roundoff bugs.
In fact, had his teacher been a little bit more clever, he would have been able to immediately see why Gauss’ solution wouldn’t have worked. It’s immediately obvious that between and , is always positive.
Furthermore, if a function is positive inside an interval, and suppose is also a positive in side the same interval but is everywhere smaller than , then obviously the area under must be bigger than that of . Similarly, if is everywhere larger than , then the area under must also be larger than that of . Now suppose , then because inside , then inside , and
or
and in general
of course can’t be on the order of !
So what went wrong? Well, whenever you’re in trouble, just make a plot!
Hmm. Everything seems to be going fine until around .
It turns out that there was nothing wrong with little Gauss’ method and the integral is perfectly wellbehaved. The culprit lies in the fact that can only be represented approximately on his calculator. As we know already, is irrational, and cannot be represented in finite amount of memory. is only represented approximately, slightly perturbed so that to the computer, we’re actually giving them a initial for that small perturbation (think of it as a really really really tiny number). Now, let’s see what Gauss’ calculator is computing once we unravel the recursion (we’ll use the notation to mean the calculated value of on the calculator):
Oh god! No wonder the method produced the wrong answer, the slight perturbation in the computed value of “propagates” throughout the computation and at the step, manifests itself as factorial times that original perturbation . Even at , we will see around fudged into the calculation. For , the answer will have an additional factor of ! Little Gauss definitely should have learned about roundoff errors.
Before we dig into the floating point encoding underlying most modern computing platforms, let’s talk about errors. Suppose that we’re computing the value of something and the true value of that computation is “stored” in a variable , but our computation is inexact, and in the end, we shall put that inexact result in a “hatted” version of the known result, that is .
Since there’s this notion of inexactness, we can talk about the “error” of our computation. There are two kinds of errors:
Absolute Error
This is just the difference between the true value of the computation and the inexact value . We shall define this as
note that ( subscript ) can be taken to mean the error of the computation of .
Relative Error
This the the ratio of the absolute error to the true value of the computation, or in other words
we read to mean the relative error in the computation of .
First, we can’t compute the absolute or relative errors, because if we can, then we would have know the true value of the computation already! Errors are purely analytic objects that can help us determine how wellbehaving our computations are. Second, in the analysis of floating point roundoff, we will typically exclusively use relative error. This may seem counterintuitive, but it has a few nice properties that simplifies error analysis, as we will see.
One more thing to add, if we allow to have any sign, then through some simple algebra, we will find that
Suppose that, through some series of computations, that we have the computed values of and . Now, in the next instruction, we wish to compute the value of . Because, we have already the (potentially inexact) computed values and , then it seems natural that to compute . we would just add :
Now, suppose that and similarly for , then it seems that
now, if we were doing error analysis, then we would want the relative error of , which from our earlier notation we already called . To find this, we merely need to solve the above equation for :
distributing both sides, we get
canceling the from both sides, we end up with
which gives .
Wow, what a mouthful. This defies intuition, as you would expect error to accumulate additively. Of course, this is true of the absolute errors:
but this no longer holds when you consider the relative error.
So why use relative error at all for analysis? It turns out that while convenient here, it becomes less tractable when reasoning about roundoff. Whenever you do an addition operation in floating point, you accumulate a small bit of absolute error from that operation itself! This absolute error unfortunately depends on the value of the computations itself much like relative error does, so sooner or later, you’re going to need to start reasoning about relative error. Worry not, we will develop a systematic formula for reasoning about the propagation of relative error that will boil down to high school level algebra (and some calculus).
Now, we’ve done addition. What about subtraction? You should easily verify for yourself that
where the relative error is defined as
Let’s now derive the propagated relative error of multiplication:
again, solving for boils down to solving the equation above
assuming that neither nor are 0, we can divide out the to get
which tells us that
It is traditional to assume that the relative errors of your computations are small (), for otherwise, there’s no point in computing a value if you can’t even get one digit correct! Therefore, if is tiny, and is tiny, then is insignificant in comparison. Therefore, we typically discard “higher order” terms. In effect, we just say that
I’ll leave it to your to verify that division propagates as
The propagation schemes we’ve talked about so far are all fine and dandy, but if we already have and we want to compute for some arbitrary but [twice] differentiable function ?
Well, let’s just call with as the argument then!
now, we can do some algebra and get
but we can no longer use our typical algebraic tools to solve the above equation for , since could be anything! Whatever will we do?
If you’ve had a little bit of calculus, then the section heading should probably give the answer away. We’ve already reasoned earlier that we don’t care about second order error terms because they’re so insignificant. But then, we can express as a polynomial in the error term through taylor series centered around !
if is twice differentiable, then is bounded to some constant, then that second order term tends to , which we can disregard (with some skepticism of how nonlinear/curvy is around the neighborhood of ). Using the first order taylor approximation as the right hand side, we can rewrite the above equation as
which, as long as , gives
Now, the restriction might seem a bit unfair, but recall that if , then we won’t have any digits of accuracy, so in effect, the relative error is in fact .
Summarized, suppose that we want to run the computation , but we have inexact computed values and each with relative error and respectively, then the computation will also be inexact, and its relative error will be
Suppose that we’ve computed with relative error and with no error. Furthermore, suppose that the true value of and . What will be the computed error of ?
We’re looking to compute
Now, we need to figure out a few things:
We can simplify this to , but even then, we’re still going to take a first order taylor expansion to get
Since we’re looking for the relative error, we need to factor out a out to get
Now, we’re just left with
From our table of error propagation above, we see that
so we’re just left with
From our table of error propagation, we see that
which finally simplifies our computed value as
so that the final relative error is . Note that the final relative error isn’t just , because we need to also take into account the error of computing . Also note that we are not working with above. To see this more concretely, we are essentially looking for in the following system
which gives the same solution .
This concludes our brief overview of error propagation. We’ll start on IEEE floating point encoding of rational numbers and how to avoid errors in computer arithmetic next time.
]]>true ? 1 : "Hello"
Take a look at the Java expression on the left hand side. Off of the top of your head, can you tell me if it will type check? If so what will be its type?
If you had asked me this question out of the blue, I would have probably said no; I would reason it as follows: this is a Java expression, so it must be assigned a type, meaning that we’re giving a system of equations saying that the type of 1
and the type of "Hello"
must be equal; this is clearly not the case as one is an Integer and the other a String, so this can’t possibly type check!
Of course, I’m actually just suffering from a common case of overlooking the obvious. It turns out that I’m forgetting about the whole subtyping thing built into Java and that I can actually assign Object o = true ? 1 : "Hello";
, because the type Object
sits at the of the type ordering.
Since we’re speaking of subtyping as a type relation anyways, let’s give it a symbolic name to make it easier for us: we’ll let and range over types in Java, and we’ll say
to mean that is a subtype of . Under one interpretation, we can think of this as saying that if we’re expecting a somewhere, we should theoretically be allowed to plug in a in its place and not expect weird or undefined behaviors as a result. Another perhaps more practical view is to claim that there exists some function called the coercion function (which we will denote, for the sake of historical consistency, ) that, given some subtype relation , will be able to translate an object of type into an object of type that behaves “similarly”.
Anyways, It’s obvious that both and (since everything’s a subtype of Object), so we can draw a subtyping hierarchy as a tree:
and we can type check true?1:"Hello" : Object
. This seems to suggest that we can typecheck all instances of to some type if we can typecheck to , to , and . In otherwords, is an upper bound of the types of and .
Of course, upon closer inspection, it turns out that both Integer and String share another common supertype: Serializable.
So just as well, we could claim that the ternary expression above extends Serializable. But if we want to “assign” a type to an expression, we don’t just want its supertypes.
Looking at the type hierarchy above, it’s clear that the lower a type is on that hierarchy, the more “precise” it is. When we say that an expression has a type, we don’t mean that it is bounded above by some set of super types; we actually mean that its most “precise” type is so and so. For example, we could just as well draw the type hierarchy for Integers as
and claim that a is a Serializable; but when I typically talk about the type of the expression , I’m usually talking about Integer, not Serializable.
From this, it’s clear that we also want a way to find the most precise (smallest) type of true ? 1 : "Hello"
such that it is also an upper bound of both and “Hello”; this is called the least upper bound, and we will for historical reasons call this the “join” of two types.
Symbolically, if we want to find the most precise upperbound (join) type of two types and , we will denote it as such that
the reason that the least upper bound is usually denoted as a square union symbol is actually somewhat natural: the union of two sets is the least upper bound of those two sets in the partial order (set inclusion), convince yourself that this is actually the case.
Note that whenever we join two types, the resulting type is “less precise”. We are after all finding an upper bound!
Going back to the problem at hand, the entire point of this exercise is then very simple: we want to find the least upper bound of Integer and String:
We’ve already seen earlier that both Object and Serializable are upper bounds of both Integer and String, but we also know that , so Serializable is a more precise (less) upper bound than Object is, but is Serializable the most precise?
In order to answer this question and find the least upper bound, we’re going to need the complete subtyping hierarchy. Consulting the Java documentation, we find that Integer is a subtype of Numbers, which are Serializable, and are themselves Comparable with other Integers. Furthermore, Strings are also Serializable and Comparable with other Strings! This gives the slightly more convoluted typing hierarchy:
At first first glance, it doesn’t seem possible that there’s a least upper bound: it seems like both Comparable and Serializable are the most precise upper bounds. Furthermore, this isn’t even necessarily true as Comparable itself might be too imprecise.
Consider the constraint
now, we know that
so we can rewrite this as
by plugging in into the in . (Wildcards, amiright?) Similarly, we can say the same about integers, so we get
but we , so a more precise upper bound of is . But now, we can construct a sequence
each of which a more precise approximation than the previous, which undoubtedly converges to the most precise type: and here’s the kicker, that type cannot be expressed finitely!
So how does Java handle these types? Let’s try it out. I wanted Java to output the internal type representation, so I need to get a typemismatch error during compile time.
It seems as if Java is reporting the type of the two as an unbounded quantification: just Comparable<?>
. This is intentional, Java treats the ? extends ...
as a constraint (or a system of inequalities) and discards them during type checking and coercion in order to use them later solely for typechecking, so we seemed to have for now solved the problem (So even though , we’re going to treat them as if they weren’t).
But we still have another problem, there’s is no least upper bound in the above hierarchy! Again, let’s see what Java does:
It turns out that the least upper bound is just the “intersection” over the three types, which should then be the LEAST upperbound of all three.
]]>If you’re like me, you spend your flight pondering the deepest and darkest corners of human knowledge, like
Of course, after a day of mentally exhausting travel, I’d like to sit back, relax, and ask myself how, given a specific type declaration in , I can figure out the number of all possible instances of that type.
For example, there is just one instance of the type unit (kind of funny because its name is unit) corresponding to the object in . There are two distinct instances of the type bool corresponding to the set , and about distinct instances of the type int.
Now, before we go on any further, let’s first take a look at the syntax of types.
type suit = Diamond  Club  Heart  Spade
this type declaration says that we want to create a new type whose instances are either labeled Diamond, Club, Heart, or Spade. The labels are capitalized to mean that they can serve as constructors of the type. So if you type in Diamond into the OCaml, it’ll construct an object of type suit corresponding to the Diamond label. Note that Diamond isn’t itself a type here.
It’s quite obvious that there are 4 distinct instances of the type suit that we can construct, namely .
We can also give each type constructor some extra stuff. For example, if we want types corresponding to the individual cards rather than suits, we can write something like
type card = Diamond of int  Club of int  Heart of int  Spade of int
so that we can construct an instance of the card Jack of Clubs as Club 11. Here, there are different instances of the type (we’re only familiar with 52 of these). Alternatively, we can also write each card as a pair of a suit and an integer.
type card = Card of (suit * int)
here, the notation suit * int stands for types whose objects are pairs, the first element of which is an instance of a suit and the second an integer; for example, Card (Club, 11) would be the jack of clubs.
One really interesting thing to observe here is that the two different ways of writing the card type above should be equivalent. In each, we are specifying a suit and also a value. Therefore, if there are distinct instances of the first type, there should also be instances of the second type, and it turns out, there are.
Furthermore, if we want to “parameterize” the types to be more generalized, we can do so by introducing type variables (variables that corresponds to types) as 'x that starts with a single apostrophe. So for example, if we want to build a linked list of whatever type we want, we could do something like
type 'x list = Null  Cons ('x * 'x list)
where the list of type int list (so we make the substitution ) is constructed from Cons(1, Cons(2, Cons(3, Null))).
A binary tree can be build up like
type 'x tree = Leaf  Node of ('x tree * 'x * 'x tree)
where the tree
has type int tree (so we make the substitution ) is constructed from
Node( Node(Leaf, 2, Leaf), 1, Node( Node(Leaf, 4, Leaf), 3, Node(Leaf, 5, Leaf) )
Now, we get to the interesting part. As discussed previously, we have already established that there’s only one object of the unit type (just ) and two (True and False) of the boolean type. Of course, there’s also only one object of the type
type one = One
and in fact, we can rename the constructor to anything (assuming that we don’t care about it being valid ocaml code) including the unit object , so it turns out that is equivalent to all types with a single constructor and equivalence isn’t effected under renaming of the constructor (well, there are certain rules to this renaming/substitution, and we’ll call this property equivalence). But is this the only property that preserves the count of the number of instances? Well, no, as it turns out, the following all only have one element.
type one' = One' of unit
type one'' = One'' of one'
...
this is intuitive, if we have only one possible object for the argument of the constructor, and we only have that one constructor, then we can only construct one object for that type. In fact, from now on, we will say that an argumentless type construct C is just a syntax sugar for C of unit because doing so will not change the number of instances for that type.
We’ve also established that there are two instances of booleans, and . In type parlance, this would be
type bool = True  False
but remember earlier that we said if we give a type constructor not expecting any argument the unit argument, the type’s size is preserved. Therefore, we can rewrite the above as
type bool = True of unit  False of unit
and since renaming things doesn’t affect the size of the set of objects satisfying this type, this is also equivalent to
type two = One of unit  Two of unit
Neat. Let’s count up one more.
There’s no native type in \f{OCaml} that only contains three instances, but it’s not too hard to guess how we’re going to construct this.
type three = One of unit  Two of unit  Three of unit
Here, we can either have or as instances of the type three. This gives a natural meaning to the notion of “or”. If I have types and , then the type
can either be one the or one of the , but since the two cases are necessarily distinct (because the label ), then if there are objects of type and objects of type , there must be objects of type .
It’s no coincidence either that the type is called a sum type between and (albeit for different reasons). Now that we’re warming to the notion that we can use these types almost as if we did in highschool algebra, let’s make some simplifications:
Since we do not ever care about what the labels themselves are, only that they are distinct, we can entirely leave out the labels for the type constructors. Therefore, we can write suit (type suit = Diamond of unit  Club of unit  ...) as
and the card type as either
or as
where the means the objects are of a pair type (with both a first and a second element). Note that we can also substitute suit directly in this type as in the second declaration of :
it’s no coincidence that there are also objects of type , but it’s also obvious that the two types are not equal syntactically. That is, .
Of course, it’s useful to construct another equivalence relation whereby type is translationally equivalent to type if both have the same number of type objects. In that case, we’ll say . As you probably have already guessed, we’re going to define some kind of a translation between these types into numbers denoting the number of objects of that type. We’ll go into detail more, but needless to say, we would translate as .
Okay, at this point, we kind of already know what we are after. But first, I want to establish some basic laws seen in school algebra.
Recall that
and that
Now, obviously,
but let’s look at something else. Suppose we have a pair type such that
now, there are again four instances of , they are
so it seems as if as well. In fact it’s not hard to show that product types correspond to multiplication in the same that cartesian products corresponds to multiplication of the cardinality. (In fact, this question reduces to the cardinality of sets case).
Just to get to the point, how many functions are there for which the domain is of size and the codomain is of size ? For example, suppose that I have the type
where is taken to mean a function that maps a type with three elements to a type with two. Well, in general, suppose I have a function , then it turns out that there are such functions: reduce this problem to counting in base .
Suppose I have elements in that need to be mapped into . Suppose both and are finite and are order into the sequences and respectively, then I can encode one such function
and a second function
but since we don’t really want to clutter up the notation with all of the braces, the subscripts, and everything, we could just encode each function as a stream of digits where the digit = corresponds to which the element maps to. Then, we can just write
but this is the same as counting up to in base .
Anyways, from this argument, we would expect to have distinct objects, and hence .
Let’s now define a translation between a simplified type declaration and a number denoting the total number of objects with that type declaration. We’ll write this relation . For now, pretend that is the naturals , but this will cause some subtle issues later on because we might not be able to translate certain types (like the list) directly into . Furthermore, is almost a function, but as discussed a few seconds ago, if we let be the naturals, it will not be able to translate certain types, so at best, it is a partial function. In that case, we will use the syntax to mean that .
To define this translation, let’s do it “recursively”. The validity of such a technique is beyond my scope, but convince yourself that because the right hand side is always translating smaller and smaller pieces, then eventually, the translation of any arbitrary type will terminate. Here, I will use to mean some type expression; that is, we will use it as a metavariable over the domain of types. Assume the same precedence level of as and power in school algebra.
Next, we define a notion of translational equivalence.
note that is a weaker notion of syntactic equality of , therefore, if , then so too must owing to the fact that is a partial function. Furthermore, because of the construction of the sequence
then is obviously onto . However, because we need to have this notion of , it’s also very obvious that two different type expressions could have the same number of instances and hence is not oneone.
We’re going to take some extraordinary risks here (because we’re not mathematicians) and assume that this translation is correct (there are in fact many problems). Let’s do something with this new translation of ours
in fact, under translational equality, we can manipulate types algebraically as if in school algebra. More amazingly, that means that we can encode everything the same exact way. Therefore, if we wanted to pass around an object of type , then we can instead pass around an object of type instead!
Suppose we wanted to translate the list type above:
unfortunately, we don’t have a definition of . Rather, we merely have an equation that is expected to be satisfied in order for an object to be considered an instance of the type. This equation is
it turns out that is the least fixed point of the function . This is going a bit beyond the scope of this post, but that least fixed point can be computed on the ordering of immediate subtype expression which turns the sequence into a complete partial order (complete because all subsets contain a least upper bound). Therefore, by the fixed point theorem, that least upper bound is the mysterious expression (where means the least upper bound)
now, the translation function is “continuous” in the sense that it preserves least upper bounds on the CPO . (Of course, this is why we can’t have = ), but loosely speaking, the operations can be substituted as
therefore,
unfortunately, if , the above sum diverges to the top element in , .
However, an interesting question arises that this type of analysis can readily answer: how many instances of the list type are translationally equivalent to the tuple. In this case, it turns out that for whatever basetype , there are exactly lists.
Of course, there’s a slightly easier hack which comes from the above derivation directly (namely because the translation preserves monotonicity, verify this!).
Why don’t we just translate the equation directly?
We’ll get
In fact, we could even get this by solving the equation and finding its generating function.
Neat. However, this brings up a good question. Can all such equations be solved? Unfortunately, no.
Let’s define a type
if we translate it, we get
now, if our codomain of the translation is just , then we would be forced to abandon hope. However, what we really did was we packed that codomain to include that you can think of as . What this can be interpreted as is that no matter what type variable becomes, type is going to diverge.
However, using the above trick, we still get the expansion
which says that all instances of are translationally equivalent to .
Let
which under translation gives
where corresponds to the catalan numbers (i.e., we’re generating the sequence that count the number of binary trees translationally equivalent to the tuple, which is the number of binary trees with nodes).
]]>