**1) Matrix multiplication scales/rotates/skews a geometric plane.**

This is useful when first learning about vectors: vectors go in, new ones come out. Unfortunately, this can lead to an over-reliance on geometric visualization.

If 20 families are coming to your BBQ, how do you estimate the hotdogs you need? (*Hrm… 20 families, call it 3 people per family, 2 hotdogs each… about 20 * 3 * 2 = 120 hotdogs.*)

You probably don't think "Oh, I need the volume of a invitation-familysize-hunger prism!". With large matrices I don't think about 500-dimensional vectors, just data to be modified.

**2) Matrix multiplication composes linear operations.**

This is the technically accurate definition: yes, matrix multiplication results in a new matrix that composes the original functions. However, sometimes the matrix being operated on is not a linear operation, but a set of vectors or data points. We need another intuition for what's happening.

I'll put a programmer's viewpoint into the ring:

**3) Matrix multiplication is about information flow, converting data to code and back.**

I think of linear algebra as "math spreadsheets" (if you're new to linear algebra, read this intro):

- We store information in various spreadsheets ("matrices")
- Some of the data are seen as functions to apply, others as data points to use
- We can swap between the vector and function interpretation as needed

Sometimes I'll think of data as geometric vectors, and sometimes I'll see a matrix as a composing functions. But mostly I think about information flowing through a system. (Some purists cringe at reducing beuatiful algebraic structures into frumpy spreadsheets; I sleep OK at night.)

Take your favorite recipe. If you interpret the words as *instructions*, you'll end up with a pie, muffin, cake, etc.

If you interpret the words as *data*, the text is prose that can be tweaked:

- Convert measurements to metric units
- Swap ingredients due to allergies
- Adjust for altitude or different equipment

The result is a new recipe, which can be further tweaked, or executed as instructions to make a different pie, muffin, cake, etc. (Compilers treat a program as text, modify it, and eventually output "instructions" — which could be text for another layer.)

That's Linear Algebra. We take raw information like "3 4 5" treat it as a vector or function, depending on how it's written:

By convention, a vertical column is usually a vector, and a horizontal row is typically a function:

`[3; 4; 5]`

means`x = (3, 4, 5)`

.`x`

is a vector of data (I'm using`;`

to separate each row).`[3 4 5]`

means`f(a, b, c) = 3a + 4b + 5c`

. This is a function taking three inputs and returning a single result.

And the aha! moment: data is code, code is data!

The row containing a horizontal funtion could really be three data points (each with a single element). The vertical column of data could really be three distinct functions, each taking a single parameter.

Ah. This is getting neat: depending on the desired outcome, we can combine data and code in a different order.

The matrix transpose swaps rows and columns. Here's what it means in practice.

If `x`

was a column vector with 3 entries (`[3; 4; 5]`

), then `x'`

is:

- A function taking 3 arguments (
`[3 4 5]`

) `x'`

can still remain a data vector, but as three separate entries. The transpose "split it up".

Similarly, if `f = [3 4 5]`

is our row vector, then `f'`

can mean:

- A single data vector, in a vertical column.
`f'`

is separated into three functions (each taking a single input).

Let's use this in practice.

When we see `x' * x`

we mean: `x'`

(as a single function) is working on `x`

(a single vector). The result is the ** dot product** (read more). In other words, we've applied the data to itself.

When we see `x * x'`

we mean `x`

(as a set of functions) is working on `x'`

(a set of individual data points). The result is a grid where we've applied each function to each data point. Here, we've mixed the data with itself in every possible permutation.

(In fact, you can see `xx`

as `x(x)`

. It's the "function x" working on the "vector x" -- which does simplify to `x*x`

.)

Phew! How does this help us? When we see an equation like this (from the Machine Learning class):

I now have an instant feel of what's happening. In the first equation, we're treating theta (which is normally a set of data parameters) as a function, and passing in x as an argument. This should give us a single value.

More complex derivations like this:

can be worked through. In some cases it gets tricky because we store the data as rows (not columns) in the matrix, but now I have much better tools to follow along. You can start estimating when you'll get a single value, or when you'll get a "permutation grid" as a result.

Geometric scaling and linear composition have their place, but here I want to think about information. "The information in x is becoming a function, and we're passing itself the a parameter."

Long story short, don't get locked into a single intuition. Multiplication evolved from repeated addition, to scaling (decimals), to rotations (imaginary numbers), to "applying" one number to another (integrals), and so on. Why not the same for matrix multiplication?

Happy math.

You may be curious why we can't use the other combinations, like `x x`

or `x' x'`

. Simply put, the parameters don't line up: we'd have functions expecting 3 inputs only being passed a single parameter, or functions expecting single inputs getting passed 3.

The dot product `x' * x`

could be seen as the following javascript command:

`(function(a,b,c){ return 3*a + 4*b + 5*c; })(3,4,5)`

We define a function of 3 arguments and pass it the 3 parameters. This returns 50 (the dot product).

The math notation is super-compact, so we can simply write (in Octave/Matlab):

>> [3 4 5] * [3 4 5]' ans = 50

(Remember that `[3 4 5]`

is the function and `[3; 4; 5]`

or `[3 4 5]'`

is how we'd write the data vector.)

This article came about from a TODO in my class notes:

I wanted to explain to myself — in plain English — why we wanted `x' x`

and not the reverse. Now, in plain English: We're treating the information as a function, and passing the same info as the parameter.

- Build a
**lasting intuition**for the key ideas. - During the course, understand it enough to solve problems.
- After the course, enjoy it enough to revisit.

That's why I learn things. Non-goals are transcribing what a teacher says, or cramming only to forget everything. (Yeah, it's a game we play, but we're stepping off the treadmill and only cheating ourselves. Most subjects have useful insights buried somewhere.)

So, here's my strategy when studying:

If an idea clicks, write down the

*Aha!*moment in language you'd use yourself.If it doesn't, write down the

*Huh?*moment. Move on and try again later (such as with the ADEPT method).

Keep it simple, like the KonMari method of organizing: *Look at everything in your house.* *Does it spark joy? Keep what does, thank and donate what doesn't.*

**A simple study plan: Go through the material. Did it click? Write down what helped, otherwise look for a better explanation.**

My current learning project is the Machine Learning Class on Cousera. I've read a smattering of blog posts, the subject is growing, and after my friend asked me to join the class, I had to sign up. (It's great.)

Here's where I'm keeping my notes, Aha, and Huh moments:

Machine Learning Notes on Google Docs

This is one of the best learning experiences I can remember. A few examples:

For the major concepts the course depends on, I keep a 5-second summary in mind. This underlying concept, why does it exist? In plain English, what does it mean?

- Linear Algebra: spreadsheets for your equations. We "pour" data through various operations.
- Natural log: time needed to grow. Helps normalize widely varying numbers.
- e^x: models continuous growth, has a simple derivative.
- Gradient: direction of greatest change, helps optimize.
- Calculus -Art of breaking a system into steps. With the gradient, we can move in the best direction.

I reference these snippets as I encounter new formulas.

There was a formula that I expected to be positive ("cost" should be positive), yet it had a negative sign out front. What gives?

It turns out I had forgotten a part of the derivation, where we expected the natural log to be negative. (This happens when we take the logarithm of numbers less than 1 — in other words, we are going "back in time" and shrinking.)

I would have preferred the equation written another way, and I made a note of this Huh? moment.

Early in the course, we define a "cost" function which tracks the difference between our predictions and the real value.

Why not call this difference something normal, like error?

It turns out "cost" is used because later in the course, we have items to minimize (like the number of variables in our model) which are not directly related to the error. The "cost" captures things outside the model, like the complexity we have. (If two models make equally accurate predictions, prefer the simpler one.)

Ah, "cost" can include fuzzier concepts. (I'd still prefer that laid out up-front.)

As I go through the course, I have a plain-English definition in mind. What's it all about?

**Machine Learning: Create models with Linear Algebra, then improve them with Calculus.**

- Linear Algebra lets us use many (tens, hundreds, thousands) of variables in a "math spreadsheet".
- Calculus lets us improve our spreadsheet via feedback on how well it's working. Using functions like e^x, ln(x), x^2, etc. make it easy to take derivatives. Absolute value, if/then statements, etc. aren't easy to work with.

Now my thinking becomes: What types of predictive models can I make? If Linear Algebra can describe it, let's use it.

After the course is done, you're left with a set of notes that make sense to you: the Ahas, Huhs, and other gotchas. (This website is a running collection of mine.)

Future learning gets that much easier. Remember how you were confused about a topic a few years ago? Well, let's read the explanation *you wrote to yourself* on how to overcome it. Over time you build up a massive collection.

Other tips:

Embrace your confusion. The hesitation you feel when you see a formula is ok. Try to break down each part of the equation, ask what it means, make note of what is confusing and return over time. Every positive sign, every variable, why are they there?

It's ok to forget things - I do all the time. I just want a list of intuitions to load up when needed. Often a single phrase or diagram will bring it all back.

These notes are meant for you. Make them fast and quick. (My notes eventually become articles, but they stay informal and for my own use till then.)

The textbook already exists. Don't simply copy what the teacher/book said, add what

*you need*to make it clear.

This course is among the most fun I've had -- this is what learning should feel like, exploration with constant refinement. I'm curious to see if this approach helps you too.

For your next course, try keeping your notes in a single Google doc. Write down your Aha! and Huh? moments. Send me a link and I'll add them to this list:

- Kalid Azad - Coursera Machine Learning
- [you go here]

I'm curious to see what works for you, feedback is always welcome.

Happy math.

]]>However, the numbers follow a grid, with rules nobody told me (image source, click to enlarge):

Even numbers go East/West (I-90, I-10), and odd numbers go North/South (I-5, I-95). Think "Even" goes "East".

Numbers increase towards the Northeast. (Hey, NYC thinks it's the center of the world, right?) I-5 is on the West coast, I-95 on the East coast. I-10 must be in Texas, I-90 must be in Massachusetts.

Auxiliary interstates connect to the primary ones, and have 3 digits: 290 connects to 90, 495 connects to 95, etc.

- Odd prefixes (190) connect once into the city from the interstate ("spur").
- Even prefixes (495) typically loop around a city. (Being a man-made system, there are exceptions.)

Whoa. There's so much information conveyed in a simple numbering scheme! Without looking at a map, I know I can drive from Seattle to Boston on I-90. Maybe I'll take I-95 South when I'm there and make my way to Florida. On the way I'll take I-10 West, over to LA, then drive up I-5 North back to Seattle.

How does this work?

We have a concept of a number, and all its properties (even/odd, size, number of digits...)

We noticed a real-world object (a highway) that had various properties (North/South, position, major/minor)

We associated the properties of the number to the properties of the object

*This* is thinking mathematically. It's not about doing arithmetic quickly, or memorizing formulas, it's about connecting patterns. Math is a zoo of made-up objects that we relate to ones in the real world. The "usefulness" of the made-up objects depends on our imagination.

Have we used all the interesting properties of a number? How about whether it's a prime number.

Suppose local routes used small prime numbers: Route 2, 3, 5, 7, 11. (Yep, remember that 2 is prime.)

Once the main routes are numbered, smaller roads that *connect* them can follow this rule:

If you connect two routes, use their product. 3 * 11 = 33, so Route 33 connects Route 3 and 11.

If you loop back to the same route, just square it. 3 * 3 = 9, so Route 9 connects Route 3 to itself.

If you connect three roads, it could be Route 66 (connecting routes 2, 3 and 11).

Will this always work? You bet. Any two primes, when multiplied, give a *unique* number. 33 will never be reached by any other combination of primes. (The fancy math phrase: every number has a unique prime factorization.)

See how we're trying to cram a bunch of information into a little number? That's the essence of binary data.

An eight-bit binary number like `01000100`

is essentially eight true/false questions:

- Are you East/West? (1 if yes, 0 otherwise)
- Are you local connection? (1 if yes...)
- Are you a spur road?
- Treating your route number as a set of binary digits...
- Anything in the ones digit?
- Anything in the twos digit?
- Anything in the fours digit?
- Anything in the eights digit?
- Anything in the sixteens digit?

An 8-bit binary number can pack in a bunch of related questions into a single byte, and is what makes binary so efficient.

Numbers have a bunch of properties, right? Aren't we curious to discover more, like the remainder (modular arithmetic)? Maybe Route 12 (which is one set of 11, remainder 1) has some connection to Route 11.

Happy math.

]]>What would you do? Well, you could work out the exact formula:

and plug in n=100 to get 5050.

But we just want a rough answer. You have a list of numbers, they follow a simple pattern, and want a quick estimate. What to do?

The "easy" way (well, the Calculus way) is to realize 1 + 2 + 3 + 4 is about the same as f(x) = x. The first element is f(1) = 1, the second is f(2) = 2, and so on.

From here, we can take the integral:

We usually see the integral as a formal, elegant operation, which artfully accumulates one function and returns another. Informally, we're squashing everything together in that bad mamma-jamma and seeing how much there is.

The result frac(1)(2) x^{2} should be pretty close to what we want.

The *exact* total is our staircase-like pattern, which accumulates to 5050.

The *approximate* answer is the area of that triangle, frac(1)(2) base · height = frac(1)(2) 100 · 100 = 5000. The difference is because of the corners in the staircase which overhang. frac(x)(2) is one-half, x times (the size of overhang (1/2) times the number of pieces (x)).

The net result is using a smooth, easy-to-measure shape to approximate a jagged, tedious-to-measure one. (This is a bit of Calculus inception, since we usually use rectangles to approximate smooth shapes.)

This tactic works for other sequences:

**What's the sum of the first 10 square numbers? 1 + 4 + 9 + 16 + 25 + ... + 100 = ?**

Hrm. The formula is probably tricky to work out. But without our Calculus-infused Arithmetic, a quick guess would be:

Our first hunch should be "one third of 10^3" or 333. But as we saw before, there's an "overhang" that we missed. Let's call it 10%, for an estimate of 330 + 10% ~ 370.

The exact answer is 385. Not bad! The actual formula is:

I'd say frac(x^{3})(3) isn't bad for a few seconds of work.

**Data doubles every year. What does lifetime usage look like?**

The integral (squashed-together total) of an exponential is an exponential. In Calculus terms,

The key insight is that all exponential growth is just a variation of e^{x}. If e^{x} accumulates exponentially, so will 2^{x}.

So the total usage to date will also follow an exponential pattern, doubling every year also. Contrast this with a usage pattern of "1 + 2 + 3 + 4 ..." -- we grow linearly (f(x) = x), but total usage accumulates quadratically (frac(1)(2)x^{2}).

My goal is to incorporate math thinking into everyday scenarios. We start with an arithmetic question, convert it to a geometry puzzle (how big is the staircase?), and then use calculus to approximate it.

I know a concept is clicking when I can switch between a few styles of thought. Imagine the problem as a script: how would Spielberg, Tarantino, or Scorsese direct it? Each field takes a different look. (To learn how to think with Calculus, check out the Calculus Guide.)

Happy math.

]]>