`shell-mode`

to work as I’d like. Maybe this will save someone else some time if they want to do the same.I’ve used a Mac occasionally since the days of the beige toasters, but I never owned one until recently. I’ve said for years that I’d buy a Mac as soon as I have a justification, and I recently started a project that needs a Mac.

I’d heard that Emacs was hard to set up on Mac, but that has not been my experience. I’m running Emacs 25.1 on macOS 10.12.1. Maybe there were problems with earlier versions of Emacs or OS X that I skipped. Or maybe there are quirks I haven’t run into yet. So far my only difficulties have been related to running a shell inside Emacs.

The first problem I ran into is that my path is not the same inside `shell-mode`

as in a terminal window. A little searching showed a lot of discussion of this problem but no good solutions. My current solution is to run `source .bash_profile`

from my bash shell inside Emacs to manually force it to read the configuration file. There’s probably a way to avoid this, and if you know how please tell me, but this works OK for now.

Manually sourcing the `.bash_profile`

file works for bash but doesn’t work for Eshell. I doubt I’ll have much use for Eshell, however. It’s more useful on Windows when you want a Unix-like shell inside Emacs.

**Update**: Dan Schmidt pointed out in the comments that Emacs reads `.bashrc`

rather than `.bash_profile`

. It seems that Mac doesn’t read `.bashrc`

at all, at least not if it can find a `.bash_profile`

file. I created a `.bashrc`

file that sources `.bash_profile`

and that fixed my problem, though it did not fix the problem with Eshell or the path problem below.

The second problem I had was that Control-up arrow does not scroll through shell history because that key combination has special meaning to the operating system, bringing up Mission Control. Quite a surprise when you expect to scroll through previous commands but instead your entire screen changes.

I got around this by putting the following code in my Emacs config file and using Alt-up and Alt-down instead of Control-up and Control-down to scroll shell history. (I’m using my beloved Microsoft Natural keyboard, so I have an Alt key.)

(add-hook 'shell-mode-hook (lambda () (define-key shell-mode-map (kbd "<M-up>") 'comint-previous-input) (define-key shell-mode-map (kbd "<M-down>") 'comint-next-input) ) )

The last problem I had was running the Clojure REPL inside Emacs. When I ran `lein repl`

from bash inside Emacs I got an error saying command not found. Apparently running `source .bash_profile`

didn’t give me entirely the same path in Emacs as in a terminal. I was able to fix the following to my Emacs config file.

(add-to-list 'exec-path "/usr/local/bin")

This works, though there are a couple things I don’t understand. First, I don’t understand why `/usr/local/bin`

was missing from my path inside Emacs. Second, I don’t understand why adding the path customizations from my `.bash_profile`

to `exec-path`

doesn’t work. Until I need to understand this, I’m willing to let it remain a mystery.

By the way, you can use one configuration file across operating systems by putting code like this in your file.

(cond ((string-equal system-type "windows-nt") (progn ; Windows-specific configurations ... ) ) ((string-equal system-type "gnu/linux") (progn ; Linux-specific configurations ... ) ) ((string-equal system-type "darwin") (progn ; Mac-specific configurations ... ) ) )

If you need machine-specific configuration for two machines running the same OS, you can test `system-name`

rather than `system-type`

.

See my answer here and take a look at some of the other answers on the same site.

Yes. See my post Advice for going solo.

Shortly after I went out on my own, I wrote this post responding to questions people had about my particular situation. My answers there remain valid, except one. I said that planned to do anything I can do well that also pays well. That was true at the time, but I’ve gotten a little more selective since then.

Only in general terms. For example, I did some work with psychoacoustics earlier this year, and lately I’ve been working with medical device startups and giving expert testimony.

Nearly all the work I do is covered under NDA (non-disclosure agreement). Occasionally a project will be public, such as the white paper I wrote for Hitachi Data Systems comparing replication and erasure coding. But usually a project is confidential, though I hope to be able to say more about some projects after they come to market.

I wrote an FAQ post of sorts a few years ago. Here are the questions from that post that people still ask fairly often.

- Can you recommend an introductory statistics book?
- Why do you prefer Python to R?
- How do you run your Twitter accounts?
- Why don’t use you XeTeX?
- Can I use your code?
- How can you test a random number generator?
- How do you type math characters in HTML?

You can use this page to send me a question and see my various contact information. The page also has a link to a vCard you could import into your contact manager.

]]>The University of Texas band’s half time show that year was a beautiful tribute to the fallen A&M students.

]]>I’d say it’s not a book about networks per se but a collection of topics associated with networks: cell phone protocols, search engines, auctions, recommendation engines, etc. It would be a good introduction for non-technical people who are curious about how these things work. More technically inclined folks probably already know much of what’s here.

]]>Productivity tip: Work hard.

— John D. Cook (@JohnDCook) October 8, 2015

I don’t know how people take it, but here’s what I meant by it. Sometimes you can find a smarter way to work, and if you can, I assume you’re doing that. Don’t drive nails with your shoe if you can find a hammer. But ultimately the way to get things done is hard work. You might see some marginal increase in productivity from using some app or another, but there’s nothing that’s going to magically make you 10x more productive without extra effort.

Many people have replied on Twitter “I think you mean ‘work smart.'” At some point “work smarter” wasn’t a cliché, but now it is. The problem of our time isn’t people brute-forcing their way with hard, thoughtless work. We’re more likely to wish for a silver bullet. We’re gnostics.

Smart work is a kind of hard work. It may take less physical work but more mental work. Or less mental work and more emotional work. It’s hard work to try to find a new perspective and take risks.

One last thought: hard work is not necessarily long work. Sometimes it is, but often not. Hard creative work requires bursts of mental or emotional effort that cannot be sustained for long.

]]>

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. P_f < 10^-7/hour).

Emphasis added.

Developing medium-reliability and high-reliability software are almost entirely different professions. Using typical software development procedures on systems that must be ultra-reliable would invite disaster. But using extremely cautious development methods on systems that can afford to fail relatively often would be an economic disaster.

**Related post**: Formal validation methods let you explore the corners

]]>

Rivalries seem sillier to outsiders the more similar the two options are. And yet this makes sense. I’ve forgotten the psychological term for this, but it has a name: Similar things compete for space in your brain more than things that are dissimilar. For example, studying French can make it harder to spell English words. (Does *literature* have two t’s in French and one in English or is it the other way around?) But studying Chinese doesn’t impair English orthography.

It’s been said that academic politics are so vicious because the stakes are so small [1]. Similarly, there are fierce technological loyalties because the differences with competing technologies are so small, small enough to cause confusion. My favorite example: I can’t keep straight which languages use `else if`

, `elif`

, `elseif`

, … in branching.

If you have to learn two similar technologies, it may be easier to devote yourself exclusively to one, then to the other, then use both and learn to keep them straight.

**Related post**: Ford-Chevy arguments in technology

[1] I first heard this attributed to Henry Kissinger, but there’s no agreement on who first said it. Several people have said similar things.

]]>

How does this function compare to its limit, exp(*x*)? We might want to know because it’s often useful to have polynomial upper or lower bounds on exp(*x*).

For *x* > 0 it’s clear that exp(*x*) is larger than *T*_{n}(*x*) since the discarded terms in the power series for exp(*x*) are all positive.

The case of *x* < 0 is more interesting. There exp(*x*) > *T*_{n}(*x*) if *n* is odd and exp(*x*) < *T*_{n}(*x*) if *n* is even.

Define *f*_{n}(*x*) = exp(*x*) – *T*_{n}(*x*). If *x* > 0 then *f*_{n}(*x*) > 0.

We want to show that if *x* < 0 then *f*_{n}(*x*) > 0 for odd *n* and *f*_{n}(*x*) < 0 for even *n*.

For *n* = 1, note that *f*_{1} and its derivative are both zero at 0. Now suppose *f*_{1} is zero at some point *a* < 0. Then by Rolle’s theorem, there is some point *b* with *a <* *b* < 0 where the derivative of *f*_{1} is 0. Since the derivative of *f*_{1} is also zero at 0, there must be some point *c* with *b* < *c* < 0 where the second derivative of *f*_{1} is 0, again by Rolle’s theorem. But the second derivative of *f*_{1} is exp(*x*) which is never 0. So our assumption *f*_{1}(*a*) = 0 leads to a contradiction.

Now *f*_{1}(0) = 0 and *f*_{1}(*x*) ≠ 0 for *x* < 0. So *f*_{1}(*x*) must be always positive or always negative. Which is it? For negative *x*, exp(*x*) is bounded and so

*f*_{1}(*x*) = exp(*x*) – 1 – *x*

is eventually dominated by the –*x* term, which is positive since *x* is negative.

The proof for *n* = 2 is similar. If *f*_{2}(*x*) is zero at some point *a* < 0, then we can use Rolle’s theorem to find a point *b* < 0 where the derivative of *f*_{2} is zero, and a point *c* < 0 where the second derivative is zero, and a point *d* < 0 where the third derivative is zero. But the third derivative of *f*_{2} is exp(*x*) which is never zero.

As before the contradiction shows *f*_{2}(*x*) ≠ 0 for *x* < 0. So is *f*_{2}(*x*) always positive or always negative? This time we have

*f*_{2}(*x*) = exp(*x*) – 1 – *x* – *x*^{2}/2

which is eventually dominated by the –*x*^{2} term, which is negative.

For general *n*, we assume *f*_{n} is zero for some point *x* < 0 and apply Rolle’s theorem *n*+1 times to reach the contradiction that exp(*x*) is zero somewhere. This tells us that *f*_{n}(*x*) is never zero for negative *x*. We then look at the dominant term –*x*^{n} to argue that *f*_{n} is positive or negative depending on whether *n* is odd or even.

Another way to show the sign of *f*_{n}(*x*) for negative *x* would be to apply the alternating series theorem to *x* = -1.

]]>It’s always been “We can’t do it that way. It would be too slow.”

You know what’s slow? Spending all day trying to figure out why it doesn’t work. That’s slow. That’s the slowest thing I know.

In calculus, you’d say more. First you’d say that if a square has side *near* *x*, then it has area *near* *x*^{2}. That is, area is a continuous function of the length of a side. As the length of the side changes, there’s never an abrupt jump in area. Next you could be more specific and say that a small change Δ*x* to a side of length *x* corresponds to approximately a change of 2*x* Δ*x* in the area.

In probability, you ask what is the area of a square like if you pick the length of its side at random. If you pick the length of the side from a distribution with mean μ, does the distribution of the area have mean μ^{2}? No, but if the probability distribution on side length is tightly concentrated around μ, then the distribution on area will be concentrated near μ^{2}. And you can approximate just how near the area is to μ^{2} using the delta method, analogous to the calculus discussion above.

If the distribution on side lengths is not particularly concentrated, finding the distribution on the area is more interesting. It will depend on the specific distribution on side length, and the mean area might not be particularly close to the square of the mean side length. The function to compute area is trivial, and yet the question of what happens when you stick a random variable into that function is not trivial. **Random variables** behave as you might expect when you stick them into linear functions, but **offer surprises when you stick them into nonlinear functions**.

Suppose you pick the length of the side of a square uniformly from the interval [0, 1]. Then the average side is 1/2, and so you might expect the average area to be 1/4. But the expected area is actually 1/3. You could see this a couple ways, analytically and empirically.

First an analytical derivation. If *X* has a uniform [0, 1] distribution and *Z* = *X*^{2}, then the CDF of *Z* is

Prob(*Z* ≤ *z*) = Prob(*X* ≤ √*z*) = √ *z*.

and so the PDF for *Z*, the derivative of the CDF, is -1/2√*z*. From there you can compute the expected value by integrating *z* times the PDF.

You could check your calculations by seeing whether simulation gives you similar results. Here’s a little Python code to do that.

from random import random N = 1000000 print( sum([random()**2 for _ in range(N)] )/N )

When I run this, I get 0.33386, close to 1/3.

Now lets look at an exponential distribution on side length with mean 1. Then a calculation similar to the one above shows that the expected value of the product is 2. You can also check this with simulation. This time we’ll be a little fancier and let SciPy generate our random values for us.

print( sum(expon.rvs(size=N)**2)/N )

When I ran this, I got 1.99934, close to the expected value of 2.

You’ll notice that in both examples, the expected value of the area is more than the square of the expected value of the side. This is not a coincidence but consequence of Jensen’s inequality. Squaring is a convex function, so the expected value of the square is larger than the square of the expected value for any random variable.

]]>For the standard normal distribution, the hazard function is

and has a surprisingly simple continued fraction representation:

Aside from being an elegant curiosity, this gives an efficient way to compute the hazard function for large *x*. (It’s valid for any positive *x*, but most efficient for large *x*.)

Source: A&S equation 26.2.14

**Related posts**:

Suppose there are only finitely many primes and let *P* be their product. Then

The original publication gives the calculation above with no explanation. Here’s a little commentary to explain the calculation.

Since prime numbers are greater than 1, sin(π/*p*) is positive for every prime. And a finite product of positive terms is positive. (An infinite product of positive terms could converge to zero.)

Since *p* is a factor of *P*, the arguments of sine in the second product differ from those in the first product by an integer multiple of 2π, so the corresponding terms in the two products are the same.

There must be some *p* that divides 1 + 2*P*, and that value of *p* contributes the sine of an integer multiple of π to the product, i.e. a zero. Since one of the terms in the product is zero, the product is zero. And since zero is not greater than zero, we have a contradiction.

* * *

[1] A One-Line Proof of the Inﬁnitude of Primes, *The American Mathematical Monthly*, Vol. 122, No. 5 (May 2015), p. 466

Surprisingly, the same is true for a catenary with scale *k*.

With the flat line, the length of the segment of the graph is the same as the length of the segment [*a*, *b*] on the *x*-axis, but in general the curve will be longer. The catenary is convex, and it bends so that area under the curve decreases exactly enough balance out the increase in arc length.

The area under a curve *f*(*x*) and over the interval [*a*, *b*] is simply the integral of *f* from *a* to *b*:

The length of the curve from *a* to *b* is also given by an integral:

You can prove the claim above by showing that the first integral is *k* times the second integral when *f*(*x*) = *k* cosh((*x*–*c*)/*k*),* *the catenary centered at *c* with scale *k*.

By the way, this result was discovered independently by Johann Bernoulli and Gottfried Liebnitz three centuries ago.

]]>Nature always, so to speak, knows where and when to stop. There is a measure in all natural things—in their size, speed, or violence. As a result, the system of nature, of which man is a part, tends to be self-balancing, self-adjusting, self-cleansing. Not so with technology, or perhaps I should say: not so with man dominated by technology and specialization. Technology recognizes no self-limiting principle …

We speak of natural growth more often than natural limits to growth. Maybe we should consider the latter more often.

Schumacher’s book was written in 1973 and seems to embody some of the hippie romanticism of its day. That does not make its arguments right or wrong, but it shows what some of the author’s influences were.

The book’s back cover has an endorsement describing Schumacher as “eminently practical, sensible, … versant in the subtleties of large-scale business management …” I haven’t read the whole book, only parts here and there, but the romantic overtones stand out more to me, maybe because they contrast more with the contemporary atmosphere. When the book was published, maybe the pragmatic overtones stood out more.

]]>When does the thing on the left equal the thing on the right?

A few things could go wrong:

- Maybe not all the terms on the right side exist, i.e. the function
*f*might not be infinitely differentiable. - Maybe
*f*is infinitely differentiable but the series diverges. - Maybe
*f*is infinitely differentiable but the series converges to something other than*f*.

The canonical example of the last problem is the function *g*(*x*) defined to be exp(-1/*x*^{2}) for positive *x* and 0 otherwise. This function is so flat when it approaches the origin that all derivatives are zero there. So all the coefficients in Taylor’s formula are zero. But the function *g* is positive for positive *x*.

Everything above can be found in a standard calculus text, but the following results are hardly known.

Being infinitely differentiable is necessary but not sufficient for a function to be real analytic. What additional requirements would be sufficient?

We start by examining what went wrong with the function *g*(*x*) above. It has a Taylor series at every point, but the radii of convergence go to zero as we get close to the origin. At the origin it is not analytic because the radius of convergence collapses to 0.

Alfred Pringsheim claimed that it is enough to require that the radii of convergence be bounded below on an interval. If we let ρ(*x*) be the radius of convergence for a series centered at *x*, then a function *f* is analytic in an open interval *J* if ρ(*x*) has some lower bound δ > 0 in *J*. Pringsheim’s theorem was correct, but his proof was flawed. The proof was accepted for 40 years until Ralph Boas discovered the flaw.

Here is a generalization of Pringsheim’s theorem. (It’s not clear to me from my source [1] who first proved this theorem. Perhaps Boas, but in the same context Boas mentions the work of others on the problem.)

Let *f* be infinitely differentiable in an open interval *J*. Suppose the radius of convergence ρ(*x*) for a Taylor series centered at *x* is positive for each *x* in the interval *J*. Suppose also that for every point *p* in *J* we have

Then *f* is analytic on the interval *J*.

Note that if ρ(*x*) is bounded below by δ > 0 then the limit above is infinite and so Pringsheim’s theorem follows as a special case.

The function *g*(*x*) above does not satisfy the hypothesis of the theorem on any open interval around 0 because if we set *p* = 0, the limit above is 1, not greater than 1.

* * *

[1] R. P. Boas. When Is a *C*^{∞} Function Analytic? Mathematical Intelligencer, Vol 11, No. 4, pp. 34–37.

]]>

Here’s how I’ve approached this dilemma in the past. Information and outcomes are not directly comparable, so any objective function combining the two is going to add incommensurate things. One way around this is to put a value not on information *per se* but on what you’re going to **do** with the information.

In the context of a clinical trial, you want to treat patients in the trial effectively, and you want a high probability of picking the best treatment at the end. You can’t add patients and probabilities meaningfully. But **why** do you want to know which treatment is best? **So you can treat future patients effectively***.* The value of knowing which treatment is best is the increase in the expected number of future successful treatments. This gives you a meaningful objective: maximize the expected number of effective treatments, of patients in the trial and future patients treated as a result of the trial.

The hard part is guessing what will be done with the information learned in the trial. Of course this isn’t exactly known, but it’s the key to estimating the value of the information. If nobody will be treated any differently in the future no matter how the trial turns out—and unfortunately this may be the case—then the trial should be optimized strictly for the benefit of the people in the trial. On the other hand, if a trial will determine the standard of care for thousands of future patients, then there is more value in picking the best treatment.

]]>What is a logarithm? It’s the solution to an exponential equation. For example, the logarithm base 10 of 2 is the solution to the equation 10^{x} = 2, and so *x* =0.30103. Similarly, you could look for the logarithm base 10 of 2 modulo 19. This is an integer value of *x* such that 10* ^{x}* = 2 mod 19, and you can show that 17 is a solution.

If we work modulo an integer *n*, the discrete logarithm base *a* of a number *y* is an integer *x* such that *a*^{x} = *y*. For some values of *a* there may not be a solution, but there will be a solution if *a* is a generator of the integers mod* n.*

Brute force the simplest algorithm for finding discrete logarithms: try *x* = 1, 2, 3, …, *n* until you find a value of *x* satisfying *a*^{x} = *y*. The problem with brute force is that the expected run time is on the order of *n*, and *n* is often very large in application.

Discrete logarithms are expensive to compute, but we can do better than brute force. (Cryptographic applications take advantage of the fact that discrete logarithms are hard to compute.) There’s a simple algorithm by Daniel Shanks, known as the baby-step giant-step algorithm, that reduces the run time from order *n* to order roughly √*n*. (Actually O(√*n* log *n*) for reasons we’ll see soon.)

Let *s* be the ceiling of the square root of *n*. The idea is to search for *x* by taking giant steps forward (multiples of *s*) and baby (single) steps backward.

Let *G* be the set of pairs (*a*^{gs} mod *n*, *gs*) where *g* ranges from 1 to *s*. These are the giant steps.

Let *B* be the set of pairs (*y**a*^{b} mod *n*, *b*) where *b* ranges from 0 to *s*-1. These are the baby steps.

Now look for a pair in *G* and a pair in *B* with the matching first components, i.e. *ya*^{b} = *a*^{gs}. Then *x* = *gs* – *b* is the required solution.

Constructing the sets *G* and *B *requires O(*s*) operations, and matching the two sets takes O(*s* log *s*) operations.

Here’s an example, going back to the problem above of finding the discrete log base 10 of 2 mod 19, using a little Python code to help with the calculations.

The square root of 19 is between 4 and 5 so *s* = 5.

>>> [(2*10**r % 19, r) for r in range(5)] [(2, 0), (1, 1), (10, 2), (5, 3), (12, 4)]

>>> [(10**(4*t) % 19, 4*t) for t in range(1,6)] [(6, 4), (17, 8), (7, 12), (4, 16), (5, 20)]

The number 5 appears as the first element of a pair in both *B* and *G*. We have (5, 3) in *B* and (5, 20) in *G* so *x* = 20 – 3 = 17.

**Related**: Applied number theory

It’s possible that treatment *X* is doing so poorly that you want to end the trial without going any further. It’s also possible that *X* is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with **futility**. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments *X *and *Y* are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment *X* is doing a little better than *Y*, but not so much better that you can declare with confidence that *X* is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If *X* is doing poorly, ethics might dictate that you stop giving *X* to patients so you stop early. If *X* is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if *X* is doing so-so, there’s usually not an ethical reason to stop, unless *X* is worse than *Y* on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the **trial**, not of the **treatment** per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why **predictive probability** is natural for interim analysis. You’re trying to predict **outcomes**, not **parameters**. (This is subtle: you’re trying to estimate the probability of **outcomes **that lead to certain estimates of **parameters**, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a *p*-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (**Bayesian**) predictive probability of meeting your **frequentist** criteria!

How long is the non-repeating part? How long is the period of the repeating part?

The answer depends on the prime factorization of the denominator *b*. If *b* has the form

*b* = 2^{α} 5^{β}

then the decimal expansion has *r* digits where *r* = max(α, β).

Otherwise *b* has the factorization

*b* = 2^{α} 5^{β} *d*

where *d* is some integer greater than 1 and relatively prime to 10. Then the decimal expansion of *a*/*b* has *r* non-repeating digits and a repeating part with period *s* where *r* is as above, and *s* is the smallest positive integer such that

*d* | 10^{s}– 1,

i.e. the smallest *s* such that 10^{s} – 1 is divisible by *d*. How do we know there exists any such integer *s*? This isn’t obvious.

Fermat’s little theorem tells us that

*d* | 10^{d – 1} – 1

and so we could take *s* = *d* – 1, though this may not be the smallest such *s*. So not only does *s* exist, we know that it is at most *d* – 1. This means that the period of the repeating part of *a*/*b* is no more than *d* – 1 where *d* is what’s left after factoring out as many 2’s and 5’s from *b* as possible.

By the way, this means that you can take any integer *d*, not divisible by 2 or 5, and find some integer *k* such that *dk* consists only of 9’s.

**Related post**: Cyclic fractions

Not all that speed improvement comes from changing languages. Some of it comes from better algorithms, eliminating redundancy, etc.

If code is running 100 times slower than you’d like, why not just run it on 100 processors? Sometimes that’s the way to go. But maybe the code doesn’t split up easily into pieces that can run in parallel. Or maybe you’d rather run the code on your laptop than send it off to the cloud. Or maybe you’d like to give your code to someone else and you want *them* to be able to run the code conveniently.

It’s sometimes possible to tweak R code to make it faster without rewriting it, especially if it is naively using loops for things that could easily be vectorized. And it’s possible to use better algorithms without changing languages.

Beyond these high-level changes, there are a number of low-level changes that may give you a small speed-up. This way madness lies. I’ve seen blog posts to the effect “I rewrote this part of my code in the following non-obvious way, and for reasons I don’t understand, it ran 30% faster.” Rather than spending hours or days experimenting with such changes and hoping for a small speed up, I use a technique fairly sure to give a 10x speed up, and that is rewriting (part of) the code in C++.

If the R script is fairly small, and if I have C++ libraries to replace all the necessary R libraries, I’ll rewrite the whole thing in C++. But if the script is long, or has dependencies I can’t replace, or only has a small section where nearly all the time is spent, I may just rewrite that portion in C++ and call it from R using Rcpp.

The R programs I’ve worked on often compute something approximately by simulation that could be calculated exactly much faster. This isn’t because the R language encourages simulation, but because the language is used by statisticians who are more inclined to use simulation than analysis.

Sometimes a simulation amounts to computing an integral. It might be possible to compute the integral in closed form with some pencil-and-paper work. Or it might be possible to recognize the integral as a special function for which you have efficient evaluation code. Or maybe you have to approximate the integral, but you can do it more efficiently by numerical analysis than by simulation.

Sometimes it’s possible to speed up code, written in any language, simply by not calculating the same thing unnecessarily. This could be something simple like moving code out of inner loops that doesn’t need to be there, or it could be something more sophisticated like memoization.

The first time it sees a function called with a new set of arguments, memoization saves the result and creates a way to associate the arguments with the result in some sort of look-up table, such as a hash. The next time the function is called with the same argument, the result is retrieved from memory rather than recomputed.

Memoization works well when the set of unique arguments is fairly small and the calculation is expensive relative to the cost of looking up results. Sometimes the set of *potential* arguments is very large, and it looks like memoization won’t be worthwhile, but the set of *actual* arguments is small because some arguments are used over and over.

** Related post**:

]]>The knee-jerk response [to neural networks] from statisticians was “What’s the big deal? A neural network is just a nonlinear model, not too different from many other generalizations of linear models.”

While this may be true, neural networks brought a new energy to the field. They could be scaled up and generalized in a variety of ways … And most importantly, they were able to solve problems on a scale far exceeding what the statistics community was used to. This was part computing scale expertise, part liberated thinking and creativity on the part of this computer science community.

After enjoying considerable popularity for a number of years, neural networks were somewhat sidelined by new inventions in the mid 1990’s. … Neural networks were

passé. But then they re-emerged with a vengeance after 2010…the reincarnation now being calleddeep learning.

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t *start* there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R *per se*. This is the book I wish I could have handed my programmers who had to pick up R.

Mathematical objects are usually defined internally. For example, the Cartesian product *P* of two sets *A* and *B* is defined to be the set of all ordered pairs (*a*, *b*) where *a* comes from *A* and *b* comes from *B*. The definition of *P* depends on the elements of *A* and *B* but it does not depend on any other sets.

Category theory turns this inside-out. Operations such as taking products are not defined in terms of elements of objects. Category theory makes no use of elements or subobjects [1]. It defines things by how they act, not their inner workings. People often stress what category theory **does not** depend on, but they less often stress what it **does** depend on. The definition of the product of two objects in any category depends on **all** objects in that category: The definition of the product of objects *A* and *B* contains the phrase “such that for any other object *X* …” [More on categorical products].

The payoff for this inside-out approach to products is that you can say something simultaneously about everything that acts like a product, whether it’s products of sets, products of fields (i.e. that they don’t exist), products of groups, etc. You can’t say something valid across multiple categories if you depend on details unique to one categories.

**This isn’t unique to products**. Universal properties are everywhere. That is, you see definitions containing “such that for any other object *X* …” all the time. In this sense, category theory is extremely non-local. The definition of a widget often depends on **all** widgets.

There’s a **symmetry** here. Traditional definitions depend on the internal workings of objects, but only on the objects themselves. There are no third parties involved in the definition. Categorical definitions have zero dependence on internal workings, but depend on the behavior of everything in the category. There are an infinite number of third parties involved! [2] You can have a definition that requires complete internal knowledge but zero external knowledge, or a definition that requires zero internal knowledge and an infinite amount of external knowledge.

**Related**: Applied category theory

* * *

[1] Category theory does have notions analogous to elements and subsets, but they are defined the same way everything else is in category theory, in terms of objects and morphisms, not by appealing to the inner structure of objects.

[2] You can have a category with a finite number of objects, but usually categories are infinite. In fact, they are usually so large that they are “classes” of objects rather than sets.

]]>*t* = *W*/*n*

where *t* is the calendar time to completion, *W* is a measure of how much work is to be done, and *n* is the number of people. This assumes everything on the project can be done in parallel. Nobody waits for anybody else.

The next refinement is to take into account the proportion of work that can be done in parallel. Call this *p*. Then we have

*t* = *W*[1 – *p*(*n*-1)/*n*].

If everything can be done in parallel, *p* = 1 and *t* = *W*/*t* as before. But if nothing can be done in parallel, *p*= 0, and so *t* = *W*. In other words, the total time is the same whether one person is on the project or more. This is essentially Amdahl’s law.

With the equation above, adding people never slows things down. And if *p* > 0, every addition person helps at least a little bit.

Next we add a term to account for communication cost. Assume communication costs are proportional to the number of communication paths, *n*(*n* – 1)/2. Call the proportionality constant *k*. Now we have

*t* = *W*[1 – *p*(*n*-1)/*n + **k**n*(*n*-1)/2].

If *k* is small but positive, then at first adding more people causes a project to complete sooner. But beyond some optimal team size, adding more people causes the project to take longer.

Of course none of this is exact. Project time estimation doesn’t follow any simple formula. Think of these equations more as rough guides or metaphors. It’s certainly true that beyond a certain size, adding more people to a project can slow the project down. Kevlin gave examples of projects that were put back on track by reducing the number of people working on them.

My quibble with the equation above is that I don’t think the cost of more people is primarily communication. Communication paths in a real project are not the simple trees of org charts, but neither are they complete graphs. And if the problem were simply communication, then improved communication would mitigate the cost of adding people to a project, though I imagine it hardly does.

I think the cost of adding people to a project has more to do with Parkinson’s Law which says that **people make work for each other**. (The aphorism form of Parkinson’s Law says that work expands to the time allowed. But the eponymous book explains **why** work expands, and it is in part because people make extra work for each other.)

I wrote about a similar theme in the blog post Maybe you only need it because you have it. Here’s the conclusion of that post:

]]>Suppose a useless project adds staff. These staff need to be managed, so they hire a manager. Then they hire people for IT, accounting, marketing, etc. Eventually they have their own building. This building needs security, maintenance, and housekeeping. No one questions the need for the security guard, but the guard would not have been necessary without the original useless project.

When something seems absolutely necessary, maybe it’s only necessary because of something else that isn’t necessary.

Mads said that now with .NET Core, C# code runs about as fast on Linux as Windows. Maybe a few percent slower on Linux. Scott Hanselman joined the conversation and explained that with .NET Core, the same code runs on every platform. The Mono project duplicated the much of the functionality of the .NET framework, but it was an entirely independent implementation and not as efficient.

I had assumed the difference was due to compiler optimizations or lack thereof, but Scott and Mads said that the difference was mostly the code implementation. There are some compiler optimizations that are better on the Windows side, and so C# might run a little faster on Windows, but the performance is essentially the same on both platforms.

I could recommend C# to a client running Linux if there’s a 5% performance penalty, but a 500% performance penalty was a show-stopper. Now I’d consider using C# on a project where I need more performance than Python or R, but wanted to use something easier to work with than C++.

Years ago I developed software with the Microsoft stack, but I moved away from Microsoft tools when I quit doing the kind of software development the tools are geared for. So I don’t write C# much any more. It’s been about a year since I had a project where I needed to write C# code. But I like the C# language. You can tell that a lot of thought has gone into the design, and the tool support is great. Now that the performance is better on Linux I’d consider using it for numerical simulations.

]]>It’s been fun to see some people I haven’t seen since I spoke at the GOTO and YOW conferences four years ago.

Photo above by conference photographer Fritz Schumann.

]]>Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.

In order to infer what you’d *like* to measure from what you *can* measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.

I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.

After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.

If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.

]]>

“A bird may love a fish but where would they build a home together?” — Fiddler on the Roof

]]>

Suppose *f*(*x*) is a function of several variables, i.e. *x* is a vector, and *g*(*x*) = *c* is a constraint. Then the Lagrange multiplier theorem says that at the maximum of *f* subject to the constraint *g* we have ∇*f* = λ ∇*g*.

Where does this mysterious λ come from? And why should the gradient of your objective function be related to the gradient of a constraint? These seem like two different things that shouldn’t even be comparable.

Here’s the geometric explanation. The set of points satisfying *g*(*x*) = *c* is a surface. And for any *k*, the set of points satisfying *f*(*x*) = *k* is also surface. Imagine *k* very large, larger than the maximum of *f* on the surface defined by *g*(*x*) = *c*. You could think of the surface *g*(*x*) = *c* being a balloon inside the larger balloon *f*(*x*) = *k**.*

Now gradually decrease *k*, like letting the air out of the outer balloon, until the surfaces *g*(*x*) = *c* and *f*(*x*) = *k* first touch. At that point, the two surfaces will be tangent, and so their normal vectors, given by their gradients, point in the same direction. That is, ∇*f* and ∇*g* are parallel, and so ∇*f* is some multiple of ∇*g*. Call that multiple λ.

I don’t know how well that explanation works when written down. But when I heard Jim Vick explain it, moving his hands in the air, it was an eye-opener.

This is not a rigorous proof, and it does not give the most general result possible, but it explains what’s going on. It’s something to keep in mind when reading proofs that are more rigorous or more general. As I comment on here,

Proofs serve two main purposes: to establish that a proposition is true, and to show

whyit is true.

The literally hand-wavy proof scores low on the former criterion and high on the latter.

***

Jim Vick was a great teacher. Some of us affectionately called him The Grinning Demon because he was always smiling, even while he gave devilishly hard homework. He was Dean of Natural Sciences when I took a couple classes from him. He later became Vice President for Student Affairs and kept teaching periodically. He has since retired but still teaches.

After taking his topology course, several of us asked him to teach a differential geometry course. He hesitated because it’s a challenge to put together an undergraduate differential geometry course. The usual development of differential geometry uses so much machinery that it’s typically a graduate-level course.

Vick found a book that starts by looking only at manifolds given by level sets of smooth functions, like the surfaces discussed above. Because these surfaces sit inside a Euclidean space, you can quickly get to some interesting geometric results using elementary methods. We eventually got to the more advanced methods, but by then we had experience in a more tangible setting. As Michael Atiyah said, abstraction should follow experience, not precede it.

]]>