Which of the following two statements is logically true?

- All planets of the Solar System orbit the Sun. The Earth orbits the Sun. Consequently, the Earth is a planet of the Solar System.
- God is the creator of all things which exist. The Earth exists. Consequently, God created the Earth.

I've seen this question or variations of it pop up as "provocative" posts in social networks several times. At times they might invite lengthy discussions, where the participants would split into camps - some claim that the first statement is true, because Earth is indeed a planet of the Solar System and God did not create the Earth. Others would laugh at the stupidity of their opponents and argue that, obviously, only the second statement is correct, because it makes a valid logical implication, while the first one does not.

Not once, however, have I ever seen a proper *formal* explanation of what is happening here. And although it is fairly trivial (once you know it), I guess it is worth writing up. The root of the problem here is the difference between *implication* and *provability *- something I myself remember struggling a bit to understand when I first had to encounter these notions in a course on mathematical logic years ago.

Indeed, what any textbook on propositional logic will tell you in one of the first chapters that you can write

to express the statement " *implies* ". A chapter or so later you will learn that there is also a possibility to write

to express a confusingly similar statement, that " is *provable *from ". To confirm your confusion, another chapter down the road you should discover, that is the same as , which, in turn, is logically equivalent to . Therefore, indeed, whenever is true, is true, and vice-versa. Is there a difference between and then, and why do we need the two different symbols at all? The "provocative" question above provides an opportunity to illustrate this.

The spoken language is rather informal, and there can be several ways of formally interpreting the same statement. Both statements in the puzzle are given in the form ", , consequently ". Here are at least four different ways to put them formally, which make the two statements true or false in different ways.

Anyone who has enough experience solving logic puzzles would know that both statements should be interpreted as abstract claims about provability (i.e. deducibility):

As mentioned above, this is equivalent to

or

In this interpretation the first statement is wrong and the second is a correct implication.

People who have less experience with math puzzles would often assume that they should not exclude their common sense knowledge from the task. The corresponding formal statement of the problem then becomes the following:

In this case *both* statements become true. The first one is true simply because the consequent is true on its own, given common knowledge (the Earth is indeed a planet) - the antecedents and provability do not play any role at all. The second is true because it is a valid reasoning, independently of the common knowledge.

This type of interpretation is used in rhetorical phrases like "If this is true, I am a Dutchman".

Some people may prefer to believe that a logical statement should only be deemed correct if every single part of it is true and logically valid. The two claims must then be interpreted as follows:

Here the issue of provability is combined with the question about the truthfulness of the facts used. Both statements are false - the first fails on logic, and the second on facts (assuming that God creating the Earth is not part of common knowledge).

Finally, people very unfamiliar with strict logic would sometimes tend to ignore the words "consequently", "therefore" or "then", interpreting them as a kind of an extended synonym for "and". In their minds the two statements could be regarded as follows:

From this perspective, the first statement becomes true and the second (again, assuming the aspects of creation are not commonly known) is false.

Although the author of the original question most probably did really assume the "pure logic" interpretation, as is customary for such puzzles, note how much leeway there can be when converting a seemingly simple phrase in English to a formal statement. In particular, observe that questions about *provability*, where you deliberately have to abstain from relying on common knowledge, may be different from questions about *implication*, where common sense may be assumed and you can skip the whole "reasoning" part if you know the consequent is true anyway.

Here is an quiz question to check whether you understood what I meant to explain.

]]>"The sky is blue, and therefore the Earth is round." True or false?

The basic notion in quantum mechanics is a *quantum system*. Pretty much anything could be modeled as a quantum system, but the most common examples are elementary particles, such as electrons or photons. A quantum system is described by its *state. *For example, a photon has *polarization*, which could be vertical or horizontal. Another prominent example of a particle's state is its wave function, which represents its position in space.

There is nothing special about saying that things have state. For example, we may say that any cat has a "liveness state", because it can be either "dead" or "alive". In quantum mechanics we would denote these basic states using the bra-ket notation as and . The strange thing about quantum mechanical systems, though, is the fact that quantum states can be combined together to form superpositions. Not only could a photon have a purely vertical polarization or a purely horizontal polarization , but it could also be in a *superposition* of both vertical and horizontal states:

This means that if you asked the question "is this photon polarized vertically?", you would get a positive answer with 50% probability - in another 50% of cases the measurement would report the photon as horizontally-polarized. This is not, however, the same kind of uncertainty that you get from flipping a coin. The photon is not *either* horizontally or vertically polarized. It is *both *at the same time.

Amazed by this property of quantum systems, Schrödinger attempted to construct an example, where a domestic cat could be considered to be in the state

which means being *both* dead and alive at the same time. The example he came up with, in his own words (citing from Wikipedia), is the following:

A cat is penned up in a steel chamber, along with the following device (which must be secured against direct interference by the cat): in a Geiger counter, there is a tiny bit of radioactive substance, so small, that perhaps in the course of the hour one of the atoms decays, but also, with equal probability, perhaps none; if it happens, the counter tube discharges and through a relay releases a hammer that shatters a small flask of hydrocyanic acid. If one has left this entire system to itself for an hour, one would say that the cat still lives if meanwhile no atom has decayed. The first atomic decay would have poisoned it.

The idea is that after an hour of waiting, the radiactive substance must be in the state

the poison flask should thus be in the state

and the cat, consequently, should be

Correct, right? No.

Superposition, which is being "in *both* states at once" is not the only type of uncertainty possible in quantum mechanics. There is also the "usual" kind of uncertainty, where a particle is in *either* of two states, we just do not exactly know which one. For example, if we measure the polarization of a photon, which was originally in the superposition , there is a 50% chance the photon will end up in the state after the measurement, and a 50% chance the resulting state will be . If we do the measurement, but *do not look at the outcome*, we know that the resulting state of the photon must be *either* of the two options. It is *not* a superposition anymore. Instead, the corresponding situation is described by a statistical ensemble:

Although it may seem that the difference between a superposition and a statistical ensemble is a matter of terminology, it is not. The two situations are truly different and can be distinguished experimentally. Essentially, every time a quantum system is measured (which happens, among other things, every time it interacts with a non-quantum system) all the quantum superpositions are "converted" to ensembles - concepts native to the non-quantum world. This process is sometimes referred to as decoherence.

Now recall the Schrödinger's cat. For the cat to die, a Geiger counter must register a decay event, triggering a killing procedure. The registration within the Geiger counter is effectively an act of *measurement, *which will, of course, "convert" the superposition state into a statistical ensemble, just like in the case of a photon which we just measured without looking at the outcome. Consequently, the poison flask will never be in a superposition of being "*both* broken and not". It will be *either*, just like any non-quantum object should. Similarly, the cat will also end up being *either* dead or alive - you just cannot know exactly which option it is before you peek into the box. Nothing special or quantum'y about this.

"But what gives us the right to claim that the Geiger counter, the flask and the cat in the box are "non-quantum" objects?", an attentive reader might ask here. Could we *imagine* that everything, including the cat, is a quantum system, so that no actual measurement or decoherence would happen inside the box? Could the cat be "both dead and alive" then?

Indeed, we could try to *model* the cat as a quantum system with and being its basis states. In this case the cat indeed could end up in the state of being both dead and alive. However, this would not be its most exciting capability. Way more suprisingly, we could then *kill and revive* our cat at will, back and forth, by simply *measuring* its liveness state appropriately. It is easy to see how this model is unrepresentative of real cats in general, and the worry about them being able to be in superposition is just one of the many inconsistencies. The same goes for the flask and the Geiger counter, which, if considered to be quantum systems, get the magical abilities to "break" and "un-break", "measure" and "un-measure" particles at will. Those would certainly not be a real world flask nor a counter anymore.

There is one way to bring quantum superposition back into the picture, although it requires some rather abstract thinking. There is a theorem in quantum mechanics, which states that any statistical ensemble can be regarded as a *partial view of a higher-dimensional superposition*. Let us see what this means. Consider a (non-quantum) Schrödinger's cat. As it might be hopefully clear from the explanations above, the cat must be *either* dead *or* alive (not both), and we may formally represent this as a statistical ensemble:

It turns out that this ensemble is *mathematically equivalent* in all respects to a superposition state of a higher order:

where "Universe A" and "Universe B" are some abstract, unobservable "states of the world". The situation can be interpreted by imagining two parallel universes: one where the cat is dead and one where it is alive. These universes exist *simultaneously* in a superposition, and we are present in *both* of them at the same time, until we open the box. When we do, the universe superposition collapses to a single choice of the two options and we are presented with either a dead, or a live cat.

Yet, although the *universes* happen to be in a superposition here, existing both at the same time, the cat itself remains *completely ordinary*, being either totally dead or fully alive, depending on the chosen universe. The Schrödinger's cat is just a cat, after all.

Ever since the "Prior Confusion" post I was planning to formulate one of its paragraphs as the following abstract puzzle, but somehow it took me 8 years to write it up.

According to fictional statistical studies, the following is known about a fictional chronic disease "statistite":

- About 30% of people in the world have statistite.
- About 35% of men in the world have it.
- In Estonia, 20% of people have statistite.
- Out of people younger than 20 years, just 5% have the disease.
- A recent study of a random sample of visitors to the Central Hospital demonstrated that 40% of them suffer from statistite.

Mart, a 19-year Estonian male medical student is standing in the foyer of the Central Hospital, reading these facts from an information sheet and wondering: what are his current chances of having statistite? How should he *model himself:* should he consider himself as primarily "an average man", "a typical Estonian", "just a young person", or "an average visitor of the hospital"? Could he combine the different aspects of his personality to make better use of the available information? How? In general, what would be the best possible probability estimate, given the data?

Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the *covariance matrix* for multiple variables as , and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector has a *normal* (or *Gaussian*) distribution with mean and covariance if:

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. ) and refrain from writing out the normalizing constant . Now, the definition of the (centered) multivariate Gaussian looks as follows:

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but *the parameter of the Gaussian distribution*. That's it. You will see where it will lead us in a moment.

Consider a symmetric Gaussian distribution, i.e. the one with (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1)

Now let us apply a linear transformation to the points, i.e. let . Suppose that, for the sake of this example, scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points :

What is the distribution of ? Just substitute into (1), to get:

(2)

This is exactly the Gaussian distribution with covariance . The logic works both ways: if we have a Gaussian distribution with covariance , we can regard it as a* distribution which was obtained by transforming the symmetric Gaussian by some *, and we are given .

More generally, if we have *any *data*, *then, when we compute its covariance to be , we can say that *if our data were Gaussian,* then *it could have been obtained* from a symmetric cloud using some transformation , and we just estimated the matrix , corresponding to this transformation.

Note that we do not know the actual , and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix can be represented as:

(3)

where is orthogonal (i.e. a rotation) and is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4)

where . This, in simple words, means that *any covariance matrix* could have been the result of transforming the data using *a coordinate-wise scaling* followed by *a rotation* . Just like in our example with and above.

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us *assume (or “pretend”) *it came from a normal distribution, and let us ask the following questions:

- What could have been the rotation and scaling , which produced our data from a symmetric cloud?
- What were the original, “symmetric-cloud” coordinates before this transformation was applied.
- Which original coordinates were scaled the most by and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?

All of those questions can be answered in a straightforward manner if we just decompose into and according to (3). But (3) is exactly the eigenvalue decomposition of . I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the *deformation of space*. However, rather than claiming that it in some sense determines a particular transformation (which it does not, as we saw), we shall say that it affects the way we compute *angles and distances *in our transformed space.

Namely, let us redefine, for any two vectors and , their inner product as:

(5)

To stay consistent we will also need to redefine the *norm *of any vector as

(6)

and the *distance *between any two vectors as

(7)

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8)

The following illustration might shed light on what is actually happening in this -“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a *skewed space*. The deformation of this space is described by the tensor (which is, as we know, equal to . The PCA now becomes a method for analyzing the *deformation of space*, how cool is that.

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product . Think of this inner product as a function , which takes a vector and maps it to a real number, the dot product of and . Regard the here as the *parameter (“weight vector”) *of the function. If you have done any machine learning at all, you have certainly come across such *linear functionals *over and over, sometimes in disguise. Now, the set of *all possible linear functionals* is known as the *dual space *to your “data space”*.*

Note that each linear functional is determined uniquely by the parameter vector , which has the same dimensionality as , so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element of your “data space” denotes, well, a data point. An element of the dual space denotes a function , which *projects* your data points on the direction (recall that if is unit-length, is exactly the length of the perpendicular projection of upon the direction ). So, in some sense, if -s are “vectors”, -s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If are elements of a skewed Euclidean space with the metric tensor , is a function an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the *parameterization *of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon nor can we say that is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy . Instead, they satisfy . Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form . The new parameters could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. , without any ugly -s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9)

where . (Note that we can lose the transpose because is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form . Or, in other words, is a *transformation, which maps from “skewed perpendiculars” to “true perpendiculars”*. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the *dual space*. That is, it is the vector in a linear functional . The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression to evaluate the functional. However, why don’t we find the actual *normal* to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like or , remember that those are simply *inner products and (squared) distances *in a skewed space, while is a *conversion from a skewed normal to a true normal. *Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

“But wait”, you should say here. “You have been talking about expressions like all the time, while things like are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal *dual *metric space”. That is, whilst the expression denoted the “new inner product” between the *points*, the expression denotes the “new inner product” between the *parameter vectors*. Let us see why it is so.

Consider an example again. Suppose that our space transformation scaled all points by 2 along the axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the *parameter ("direction")* vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if , then

(10)

This means, that the transformation of the data points implies the transformation of the dual vectors. The metric tensor for the dual space must thus be:

(11)

Remember the illustration of the “unit circle” in the metric? This is how the unit circle looks in the corresponding metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that is exactly the variance of the data along the direction (assuming ).

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: *why is the inverse covariance of the data a good estimate for that metric tensor*? After all, it is not obvious that (where is the data matrix) must be related to the in the distribution equation .

Here is one possible way to see the connection. Firstly, let us take it for granted that if is sampled from a symmetric Gaussian, then is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of ?

(12)

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from *any *distribution with uncorrelated variables of unit variance and then transformed using some matrix , the expression will still be an estimate of , the metric tensor for the corresponding (dual) space deformation.

However, note that out of *all *possible initial distributions , the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even *mention* the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The "slinky drop" is a fun little experiment which has got its share of internet fame.

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

Let us start with the classical, "high school" model of a spring. The spring has some length in the relaxed state, and if we stretch it, making it longer by , the two ends of the spring exert a contracting force of . Assume we hold the top of the spring at the vertical coordinate and have it balance out. The lower end will then position at the coordinate , where the gravity force is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to . Recall the corresponding Taylor expansion:

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring *is not a good model* for the slinky. It is more correct to regard a slinky as a *chain* of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having *infinitely many* infinitely small springs. Let's briefly take a look at how that solution looks like.

Let denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered and respectively. The continuous slinky will have infinitely many points numbered .

Let denote the vertical coordinate of a point at time . The acceleration of point at time is then, by definition , and there are two components affecting it: the gravitational pull and the force of the spring.

The spring force acting on a point is proportional to the *stretch *of the spring at that point . As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. . Consequently, the dynamics of the slinky can be described by the equation:

where is some positive constant. Let us denote the second derivatives by and , replace with and rearrange to get:

(1)

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the* wave of disturbance*, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy (because it is not moving at all), (because the top end is located at coordinate 0), and (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

Next, we release the slinky, hence the conditions and disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly *not* a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the *massless *spring. Any real spring has a mass and thus forms an *oscillation wave* traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

]]>A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

The concept of *kernel methods* is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, , and square it, the result can be regarded as a dot product of two "*feature vectors*", where the *features* are all pairwise products of the original inputs:

Analogously, if you raise to the third power, you are essentially computing a dot product within a space of all possible* three-way products* of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a *kernel function,* and this will magically convert your model into a *linear model with nonlinear features* (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: . We can "kernelize" it by first representing as a linear combination of the data points (this is called a dual representation):

and then swapping all the dot products with a custom *kernel function*:

If we now substitute here, our model becomes a second degree polynomial regression. If it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function , and, of course, the *Gaussian* *function* is one of these choices:

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power makes a kernel with a feature space, which includes all -wise products. Now let us examine what happens if we add two or more kernel functions. Consider , for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

i.e. the *concatenation *of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant is also meaningful. It corresponds to *scaling* the corresponding features by . For example, .

Still with me? Great, now let us combine the tricks above and consider the following kernel:

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by and all triple products scaled down by .

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to *infinity-wise* products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

We can conclude here that is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to *normalize* the inputs before applying our linear model. That is, we want to convert each vector to before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel . If we apply it to normalized vectors, we get

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this *kernel normalization* to the “infinite polynomial” (i.e. exponential) kernel we just derived:

Voilà, the Gaussian kernel. Well, it still lacks in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by

To conclude: **the Gaussian kernel is a normalized polynomial kernel of infinite degree** (where feature products if -th degree are scaled down by before normalization). Simple, right?

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let , . The value of the Gaussian kernel for these inputs is:

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

(1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

(1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations we just need to normalize:

(0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

(0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

In boldface are the decimal digits, which match the value of . The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. ), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form must be repeated exactly times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by . "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector , for example. Its norm is , exactly the same as the norm of the single-element vector .

As we do this scaling, each feature gets converted to a nice symmetric form:

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

Let us limit ourselves to this degree-3 example and let , (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

(1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

(1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

(0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

(0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is , what about the exact Gaussian kernel value?

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

**The features of the unnormalized** **-dimensional Gaussian kernel are**:

where .

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is . Thus we may also state the following:

**The features of the ****-dimensional Gaussian kernel are**:

where .

That's it, now you have seen the soul of the Gaussian kernel.

]]>- In this lecture the concept of a desk is used as a visual metaphor of "structuring the information" as well as to provide an outline to the talk.
- Here and here an imaginary geographical map is used in a similar context.
- For the computer graphics course I had to develop some "slides" as small OpenGL apps for visualizing the concepts during the lecture. This has been later taken to extreme heights in the practical materials designed by Raimond-Hendrik, who went on to give this course (alongside with a seminar) in the following years. Unfortunately the materials are closed for non-participants (yet I still hope they will be opened some day, do you read this, Raimond?), but the point is that every single notion has a tiny WebGL applet made to illustrate it interactively.
- Once I tried to make a short talk about computer graphics, where the slides would be positioned on the walls of a 3D maze, so that to show them I'd have to "walk through the maze", like in a tiny first-person shooter game. Although this kind of visualization was not at all useful as a learning aid (as it did not structure anything at all), it none the less looked cool and was very well appreciated by the younger audience of the talk, to whom it was aimed at.

I have lost the sources of that last presentation to a computer error and decided to recreate a similar "maze with slides" with AFrame. The night was long and I got sucked into the process to the point of making an automated tool. You upload your slides, and it generates a random maze with your slides hanging on the walls. It is utterly useless, but the domain name "slideamaze.com" was free and I could not resist the pun.

Check it out. If you are into programming-related procrastination, try saving the "mazes" generated by the tool on your computer and editing the A-frame code to, say, add monsters or other fun *educational tools* into the maze.

Hyperlinks could be added to the corresponding words by *annotating* these words with appropriate *mark-up tags*. For example, "this sentence" would become "this <a href="other_document">sentence</a>". This looked ugly on a text-only screen, hence *browsers* were soon born - applications, which could render such mark-up in a nicer way. And once you view your text through a special application which knows how to interpret mark-up (a serious advance back in the 90s), you do not have to limit yourself to only tagging your text with hyperlinks - text appearance (such as whether it should be **bold**, *italic*, large or small) could just as well be specified using the same kind of mark-up.

Fast-forward 25 years, most documents on the web consist nearly *entirely* of mark-up - some have nearly no text at all. The browsers now are extremely complex systems whose job goes way beyond turning hyperlinks into underlined words. They run code, display graphics and show videos. The web experience becomes more graphical and interactive with each year.

There is one area, however, that the web has not yet fully embraced - 3D graphics. First attempts to enrich the web experience with 3D go back as far as 94, when VRML was born. It picked some traction in the scientific community - projects were made which used VRML to, say, visualize molecules. Unfortunately, the common web developers of the time mostly regarded VRML as an arcane technology irrelevant to their tasks, and the layman user would not care to install a heavyweight VRML plug-in in order to view a molecule in the browser. Now, if it were possible to make, say, an addictive 3D shooter game with VRML, things would probably be different - a critical mass of users would be tempted to install the plug-in to play the game, and the developers would become tempted to learn the arcane tech to develop a selling product for that critical mass. Apparently, no selling games were created with VRML.

It took about fifteen years for the browsers to develop native support for 3D rendering technology. It is now indeed possible to create addictive 3D games, which run in the browser. And although a whole market for in-browser 3D applications has been born, the common web developer of our time still regards 3D as an arcane technology irrelevant to their tasks. It requires writing code in an unfamiliar API and it does not seem to interoperate with the rest of your webpage well. The successors of VRML still look like rather niche products for the specialized audience.

I have recently discovered the A-Frame project and I have a feeling that this might finally bring 3D into the web as a truly common primitive. It interoperates smoothly with HTML and Javascript, it works out of the box on most browsers, it supports 3D virtual reality headsets, and it relies on an extremely intuitive Entity-Component approach to modeling (just like the Unity game engine, if you know what that means). Let me show you by example what I mean:

<a-scene> <a-cube color="#d22" rotation="0 13 0"> <a-animation attribute="position" dur="1000" easing="ease-in-out-quad" direction="alternate" to="0 2 0" repeat="indefinite"> </a-animation> </a-cube> </a-scene>

This piece of markup can be included anywhere within your HTML page, you can use your favourite Javascript framework to add interactivity or even create the scene dynamically. You are not limited to any fixed set of entities or behaviours - adding new components is quite straightforward. The result seems to work on most browsers, even the mobile ones.

Of course, things are not perfect, A-Frame's version number is 0.2.0 at the moment, and there are some rough edges to be polished and components to be developed. None the less, the next time you need to include a visualization on your webpage, try using D3 with A-frame, for example. It's quite enjoyable and feels way more natural than any of the 3D-web technologies I've tried so far.

]]>*"I need a publicly accessible webpage/wiki/web application/my revolutionary web-based startup. Where could I host it?"*

I've been asked for advice on this topic many times by very different people, ranging from aspiring software engineers to their grandmothers. Apparently, finding a good answer to this question in Google is tricky - there are too many options to choose from, and way too many results lead to self-advertising of various hosting companies or overly biased opinions. Moreover, good advice on this topic changes every year, as new services become available. The following is my current answer to the question.

There are many possibilities for putting stuff out on the net, which differ in functionality, convenience and price. The answer to the question thus depends heavily on what actual need you want to solve. Here are the common needs along with possible solutions:

This is the most common case to start with, and in this case you**I need to share some files/photos with my friends.**

*do not need*to "host a webpage" at all. Your first sharing option is, of course, your favourite social network: Facebook, Google+, VK, or wherever you already have an account anyway. If the social network option is unsuitable for some reason (say, your files are too large), consider one of the multitude of file sharing services, such as Dropbox, OneDrive, Google Drive, Yandex Disk, or whatnot. All of them are easy to use and free up to a certain space usage limit (within several gigabytes). If you need more space, consider getting a paid account. Alternatively, if you work at a company or study in a university, ask if your company/university has an internal OwnCloud installation. University of Tartu has one, for example. Finally, if you only need to share pieces of text or code, consider services like pastebin or Gist.**I need to set up a one-page site, a landing page, a wiki, a blog, or the like.**

There are many great wiki systems, blogging, CMS or e-commerce solutions out there. They are reasonably easy to install and configure, if you have your own server, know the necessary system administration basics and have some hours to spare. In most situations, however, this would be an unnecessary waste of time. Numerous user-friendly services will let you set up a simple site without the need to bother with servers or any other technical details. Thus, if you need a wiki or a one-pager, try Google Sites or Wix, for example. If you are a developer and know your way around Git, consider Github Pages. If you need a typical personal blog, WordPress.com is a reasonable choice, although using your Facebook or Google+ account for blogging may often be even more reasonable. If you need a landing page with a visitor counter, Unbounce will help. If you want to sell stuff online, try E-bay. If all you need is a collaboratively editable document - why not use a simple Google Doc or a PiratePad?**I need to host a webapp.**

If you need to publicly host a*web application*you created, you have three general choices: deploying it to an*application container*, using a*virtual private server,*or relying on a*backend-as-a-service (BaaS) provider*.

The popular application container providers are Heroku, Google AppEngine, Amazon Elastic Beanstalk, Azure App Service or IBM BlueMix. None of them are free, but all offer some amount of free credits or "free tier" functionality (Amazon and IBM's free tiers are probably the most generous overall). Hosting your application in an app container is simple as you do not need to deal with server configuration, scalability and security. However, it is not very flexible (you cannot always set up every single detail to your liking) nor is it the cheapest option.

If you know your way around server administration, getting a*virtual private server*could be much more comfortable. The popular providers to consider here are Amazon AWS (expensive, but very feature-rich and with a great free tier), Microsoft Azure (expensive, but you get a sign-up credit to try things out), DigitalOcean (affordable and popular, the referral link gives you $10 bonus on sign-up), ScaleWay (cheap) or ArubaCloud (very cheap, albeit with slightly worse server uptime guarantees). If you are a university student, check out whether your university could provide you the server space for free.

Many apps can be cleanly separated into a Javascript-based*frontend*and a simplistic*backend*, where the latter only needs to support a small number of standard operations such as verifying user identity or storing and retrieving data. In this case you could host your frontend pages on, say, Github Pages, and use a dedicated BaaS provider for the backend. Hundreds of those are available. If you need to start with something free and simple, I'd suggest you check out Google Firebase first.**Wait, what about the domain name?**

Once you've published your site (no matter whether it is a google doc or a VPS) you may often want to make it accessible at a nice domain name. If you are not very fussy about the choice of the name, check out the freely available subdomain choices at FreeDNS. If you want to get something more fancy, you'll need to buy the name from a multitude of DNS registrar services out there (let me refer you to Namecheap, which is my current personal preference).

This is a (slightly updated) repost of my quora answer to the corresponding question.

There are many ways in which smart people tend to explain Bayesian statistics and contrast it with a "non-Bayesian" one. One usually highlights that the primary concept of a Bayesian approach is the the desire to model everything as a probability distribution. Once this is fact is clear, many smart people would proceed to claim that this *is, *in fact, what fundamentally sets Bayesian statistics aside from the "classical" one. However, I feel that this kind of explanation is somewhat incomplete. It is not like classical statisticians do not use complete probability distributions. The difference is in general somewhat more subtle and philosophical.

Consider the question "what is your height?". For a classical statistician there exists some abstract "true answer", say "180cm", which is a fixed number - your one and only height. The problem is, of course, you do not know this number because every measurement is slightly different, so the classical statistician will add that "there is a normally-distributed measurement error". In the world of a pure Bayesian there are almost no "fixed numbers" - everything is a probability distribution, and so is your height! That is, a Bayesian should say that "your height is a Normal distribution centered around 180cm".

Note that from the mathematical perspective there is no difference between the two representations - in both cases the number 180cm is mentioned, and the normal distribution. However, from a philosophical, syntactical, methodological and "mental" perspectives this tends to have serious implications, and there has been historically a kind of an ongoing intellectual feud between the statisticians who lend more towards the first or the second approach (it is somewhat resemblant of how there is a divide among the physicists with regard to their support of the Copenhagen interpretation of quantum mechanics).

One of the implications of denying the fact that things in the world are mostly fixed (and are all pure distributions instead) is that you may not use many of the common sense inference methods directly. What is my height if I stand on a chair? "Well, it is your height plus the height of a chair", a classical statistician would say. He can keep in mind the measurement errors, if necessary, but those could be dealt with later. In the Bayesian world heights are not numbers, so the procedure of adding heights implies convoluting two distributions to get the resulting distribution. If both distributions are Gaussian, the result will match that of the "common sense", but note that now the common sense somehow became "just one special case". Moreover, a Bayesian might even keep the possibility that "your height and the height of the chair are dependent" in the back of his mind, just in case. Because when you speak about two numbers in the Bayesian world, you must immediately start thinking about their joint distribution.

On the other hand, modeling everything in probabilities lets you use probability theory inference methods (Bayes rule, convolutions, marginalizations, etc) everywhere, without the need to differentiate between "fixed numbers" and "random measurement errors" and this adds peace of mind as well as tends to make your explanations clearer. A Bayesian confidence interval, for example, is a "fixed interval such that 95% of height measurements fall into it". A classical confidence interval, on the other hand, is "a random interval such that the true height may fall into it with 95% probability". Again, mathematically and numerically those may often be the same, but think how different the two explanations are.

Bayesian "thinking" tends to be more flexible for complex models. Many classical statistics models would stick to fixed parameters, point or "interval" inferences, and try to "hide" the complexity of probability distributions as much as possible. As a result, reasoning about a system with many highly interconnected concepts becomes flawed. Consider a sequence of three questions:

- What the height of this truck?
- Will it fit under this 3m bridge?
- Do we need pick another route?

In the "classical" mindset you would tend to give fixed answers to the questions.

- "Height of the truck is 297".
- "Yes, 297<300, hence it will fit".
- "No, we do not need".

Sometimes you may be more careful and work with confidence intervals, but it still feels unwieldy:

- "The confidence interval on the height of the truck is 290..310"
- ".. aahm, it
*might*not fit..." - "let's pick another route, just in case"

Note, if a followup question appears that depends on the previous inferences (e.g. "do we need to remodel the truck") answering it becomes even harder because the true uncertainty is "lost" in the intermediate steps. Such problems are never present if you are disciplined as a Bayesian. Note the answers:

- "The height of the truck is a normal distribution N(297, 10)"
- "It will fit under the bridge with probability 60%"
- "We need another route with probability 40%"

At any point is information about the uncertainty is preserved in the distributions and you are free to combine it further, or apply a decision-theoretic utility model. This makes Bayesian networks possible, for example.

It is interesting to see how this largely philosophical preference leads to two completely different (albeit complementary) sets of techniques. Indeed, if you are a true classical statistician, your work revolves around *parameterized probability distributions*. You write them down like , where is the "truly random" value from some probability space, and is the "fixed but unknown" parameter. Your whole "school of thought" is now focused on clever ad-hoc techniques for computing *estimates *of this fixed parameter from the provided distribution.

For a pure Bayesian, however, there is no "fixed" that has to be treated somehow separately. Instead, is also a part of some probability space, and instead of writing he would safely write , , or . As a result, the probability distribution he works with are not parameterized any more, and all of the clever techniques that the classical statisticians have invented over the centuries for estimating parameters become seemingly useless. At this point a classical statistician puts his hands down and goes home, as there is nothing to do for him - there are no "unknowns". The Bayesian is, however, left to struggle with mathematically trivial, yet computationally incredibly heavy methods for extracting essentially the same values that the classical statistician could have obtained using his "parameter estimation" approaches. That's why the Bayesian "school of thought" is mostly focused on computationally-efficient methods for marginalization and sampling.

In reality, of course, a Bayesian would quite often give up and "cheat", at least partially parameterizing his models and making use of the classical estimation methods, while a "classical" statistician might happen to write and apply the Bayes rule here and there, whenever it seems appropriate. A number of computations derived from the two theoretical backgrounds end up exactly the same.

Thus, in practice, labeling things as "Bayesian" or "non-Bayesian" is still largely a philosophical choice. For example, there are methods in machine learning, ensemble learners, that are somewhy never labeled/marketed as being "Bayesian" nor were they probably invented by someone "Bayesian", although at their core those would be among the best examples of where a Bayesian approach is different from a classical one. Those are also among the best performant models quite often, by the way.

]]>