Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.

What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in [1].

We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.

The array `demo`

below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.

from random import randrange from numpy import zeros zctasize = 9330 demosize = 365*78*2 demo = zeros(demosize) for _ in range(zctasize): d = randrange(demosize) demo[d] += 1 unique = len(demo[demo == 1]) print(unique/zctasize)

I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.

As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.

Suppose there are *z* inhabitants in our zip code and *d* demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.

We start by looking at a particular demographic category. The probability that exactly one person falls in that category is

To find the expected number of demographic slots with exactly one entry we multiply by *d*, and to get the proportion of *p* of people this represents we divide by *z*.

and so

which in our case is 84.8%, consistent with our simulation above.

Here we used the Taylor series approximation

If *z* is of the same magnitude as *d* or smaller, then the error in the approximation above is *O*(1/*d*).

By the way, a similar calculation shows that the expected number of demographic slots containing two people is *r* exp(-*r*)/2 where *r* = *z*/*d*. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.

[1] I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.

]]>In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous. The state of Massachusetts had released data on 135,000 state employees and their families with obvious identifiers removed. However, the data contained zip code, birth date, and sex for each individual. Sweeney was able to cross reference this data with publicly available voter registration data to find the medical records of then Massachusetts governor William Weld.

An estimated 87% of Americans can be identified by the combination of zip code, birth date, and sex. A back-of-the-envelope calculation shows that this should not be surprising, but Sweeney appears to be the first to do this calculation and pursue the results. (**Update**: See such a calculation in the next post.)

In her paper Only You, Your Doctor, and Many Others May Know Sweeney says that her research was unwelcome. Over 20 journals turned down her paper on the Weld study, and nobody wanted to fund privacy research that might reach uncomfortable conclusions.

A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

Photo of Latanya Sweeney via Parker Higgins [CC BY 4.0], from Wikimedia Commons

]]>Let’s start by finding the sine of a trillion (10^{12}) using floating point arithmetic. There are a couple ways to think about this. The floating point number `t = 1.0e12`

can only be represented to 15 or 16 significant decimal figures (to be precise, 53 bits [1]), and so you could think of it as a representative of the interval of numbers that all have the same floating point representation. Any result that is the sine of a number in that interval should be considered correct.

Another way to look at this is to say that `t`

can be represented exactly—its binary representation requires 42 bits and we have 53 bits of significand precision available—and so we expect `sin(t)`

to return the sine of exactly one trillion, correct to full precision.

It turns out that IEEE arithmetic does the latter, computing `sin(1e12)`

correctly to 16 digits.

Here’s the result in Python

>>> sin(1.0e12) -0.6112387023768895

and verified by Mathematica by computing the value to 20 decimal places

In:= N[Sin[10^12], 20] Out= -0.61123870237688949819

Note that the result above is *not* what you’d get if you were first to take the remainder when dividing by 2π and then taking the sine.

>>> sin(1.0e12 % (2*pi)) -0.6112078499756778

This makes sense. The result of dividing a trillion by the floating point representation of 2π, 159154943091.89536, is correct to full floating point precision. But most of that precision is in the integer part. The fractional part is only has five digits of precision, and so we should expect the result above to be correct to at most five digits. In fact it’s accurate to four digits.

When your processor computes `sin(1e12)`

it does not naively take the remainder by 2π and then compute the sine. If it did, you’d get four significant figures rather than 16.

We started out by saying that there were two ways of looking at the problem, and according to the first one, returning only four significant figures would be quite reasonable. If we think of the value `t`

as a measured quantity, measured to the precision of floating point arithmetic (though hardly anything can be measured that accurately), then four significant figures would be all we should expect. But the people who designed the sine function implementation on your processor did more than they might be expected to do, finding the sine of a trillion to full precision. They do this using a **range reduction algorithm** that retains far more precision than naively doing a division. (I worked on this problem a long time ago.)

What if we try to take the sine of a ridiculously large number like a googol (10^{100})? This won’t work because a googol cannot be represented exactly as a floating point number (i.e. as a IEEE 754 double). It’s not too big; floating point numbers can be as large as around 10^{308}. The problem is that a number that big cannot be represented to full precision. But we can represent 2^{333} exactly, and a googol is between 2^{332} and 2^{333}. And amazingly, Python’s sine function (or rather the sine function built into the AMD processor on my computer) returns the correct result to full precision.

>>> sin(2**333) 0.9731320373846353

How do we know this is right? I verified it in Mathematica:

In:= Sin[2.0^333] Out= 0.9731320373846353

How do we know Mathematica is right? We can do naive range reduction using extended precision, say 200 decimal places.

In:= p = N[Pi, 200] In:= theta = x - IntegerPart[x/ (2 p)] 2 p Out= 1.8031286178334699559384196689…

and verify that the sine of the range reduced value is correct.

In:= Sin[theta] Out= 0.97313203738463526…

Because floating point numbers have 53 bits of precision, all real numbers between 2^{56} – 2^{2} and 2^{56} + 2^{2} are represented as the floating point number 2^{56}. This is a range of width 8, wider than 2π, and so the sines of the numbers in this range cover the possible values of sine, i.e. [-1, 1]. So if you think of a floating point number as a sample from the set of all real numbers with the same binary representation, every possible value of sine is a valid return value for numbers bigger than 2^{56}. But if you think of a floating point number as representing itself exactly, then it makes sense to ask for the sine of numbers like 2^{333} and larger, up to the limits of floating point representation [2].

[1] An IEEE 754 double has 52 significand bits, but these bits can represent 53 bits of precision because the first bit of the fractional part is always 1, and so it does not need to be represented. See more here.

[2] The largest exponent of an IEEE double is 1023, and the largest significand is 2 – 2^{-53} (i.e. all 1’s), so the largest possible value of a double is

(2^{53} – 1)2^{1024-53}

and in fact the Python expression `sin((2**53 - 1)*2**(1024-53))`

returns the correct value to 16 significant figures.

Since the mathematical equivalent of Six Degrees of Kevin Bacon is Six degrees of Paul Erdős, I tried looking for the distance between Kevin Bacon and Paul Erdős and found this:

**Kevin Bacon** and **Paul Erdős** are both known for their prolific collaborations. Many actors have acted with Kevin Bacon, or acted with actors who have acted with him, etc. Many mathematicians wrote papers with Erdős, or have written papers with people who have written papers with him, etc. I’m four degrees of separation from Paul Erdős last I checked. [1]

Here’s a more complex graph showing the three degrees of separation between **Thomas Aquinas** and **Thomas the Tank Engine**.

Note that the edges are **directed**, links from the starting page to pages that lead to the target page. If you reverse the starting and ending page, you get different results. For example, there are still three degrees of separation if we reverse our last example, going from Thomas the Tank Engine to Thomas Aquinas, but there are about twice as many possible paths.

[1] Your Erdős number, the degrees of separation between your and Paul Erdős, can decrease over time because his collaborators are still writing papers. Even Erdős himself continued to publish posthumously because people who began papers with him finished them years later.

]]>The vertical axis plots the exponents *p*, which are essentially the logs base 2 of the Mersenne primes. The scale is logarithmic, so this plot suggests that the Mersenne primes grow approximately linearly on a double logarithmic scale.

A spherical triangle is a triangle drawn on the surface of a sphere. It has three vertices, given by points on the sphere, and three sides. The sides of the triangle are portions of great circles running between two vertices. A great circle is a circle of maximum radius, a circle with the same center as the sphere.

An interesting aspect of spherical geometry is that both the sides and angles of a spherical triangle are angles. Because the sides of a spherical triangle are arcs, they have angular measure, the angle formed by connecting each vertex to the center of the sphere. The arc length of a side is its angular measure times the radius of the sphere.

Denote the three vertices by *A*, *B*, and *C*. Denote the side opposite *A* by *a*, etc. Denote the angles at *A*, *B*, and *C* by α, β, and γ respectively.

Research Triangle is a (spherical!) triangle with vertices formed by Duke University, North Carolina State University, and University of North Carolina at Chapel Hill.

(That was the origin of the name, though it’s now applied more loosely to the general area around these three universities.)

We’ll take as our vertices

*A*= UNC Chapel Hill (35.9046 N, 79.0468 W)*B*= Duke University in Durham (36.0011 N, 78.9389 W),*C*= NCSU in Raleigh (35.7872 N, 78.6705 W)

We’ll illustrate several features of Mathematica using the spherical triangle corresponding to Research Triangle.

The map above was produced with the following Mathematica code.

ptA = GeoPosition[{35.9046, -79.0468}] ptB = GeoPosition[{36.0011, -78.9389}] ptC = GeoPosition[{35.7872, -78.6705}] GeoGraphics[{Red, PointSize[Large], Point[ptA], Point[ptB], Point[ptC]}, GeoScaleBar -> "Imperial", GeoRange -> 29000]

Note that longitude is in degrees east, so the longitudes above are negative.

To find the distance between two locations, you can use the function `GeoDistance`

. For example,

GeoDistance[ptA, ptB]

tells me that the distance between UNC and Duke is 8.99185 miles. I assume it displays miles by default based on my location in the US, though the `GeoRange`

option above defaults to meters. You can make the system of units explicit. For example

GeoDistance[ptA, ptB, UnitSystem -> "Metric"]

returns 14.741 km.

If we want to find the length of the side *c* in degrees, we can use the Earth’s radius.

r = PlanetData["Earth", "Radius"] c = GeoDistance[ptA, ptB] 360 / (2 Pi r)

This shows that *c* is 0.13014°. Calculating the other sides similarly shows *a* = 0.30504° and *b* = 0.32739°.

Calling `GeoDirection[ptA, ptB]`

returns 42.2432°, which says we need to head at an angle of about 42° to walk from UNC to Duke.

The code

GeoDirection[ptA, ptB] - GeoDirection[ptA, ptC]

shows that the angle α is 68.6128°. (The code returns the negative of this angle because the angle is clockwise.) Similarly we find β = 87.9808° and γ = 23.4068.

The angles add up to 180° only because our triangle is small compared to the earth’s total area. The actual sum should be slightly more than 180°, but we’ve not retained enough precision to detect the difference. In general the “spherical excess,” i.e. the amount by which the sum of the angles exceed 180°, is proportional to the area of the triangle.

The rules for manipulating expressions with real numbers carry over to complex numbers so often that it can be surprising when a rule doesn’t carry over. For example, the rule

(*b*^{x})^{y} = *b*^{xy}

holds when *b* is a positive real number and *x* and *y* are real numbers, but doesn’t necessarily hold when *x* or *y* are complex. In particular, if *x* is complex,

(*e*^{x})^{y} = *e*^{xy}

does not hold in general, though it does hold when *y* is an integer. If it did hold, and **this is where people get into trouble**, you could make computations like

*e*^{2πi/3} = (*e*^{2πi})^{1/3} = 1^{1/3} = 1

even though *e*^{2πi/3} actually equals (-1 + *i*√3)/2.

I usually use the notation exp(*x*) rather than *e*^{x }for several reasons. For one thing, it’s easier to read. When the argument is complicated, say if it contains exponents of its own, the superscripts can get tiny and hard to head. For example, here are two ways of writing the probability density function for the Gumbel distribution.

Math publishers often require the exp() notation in their style guides.

Aside from legibility, there is a substantive reason to prefer the exp() notation. The function *f*(*z*) = exp(*z*) has a rigorous definition in terms of a power series. It’s a single-valued function, valid for any complex number you might stick in. By contrast, it’s subtle how *b*^{x} can even be defined rigorously, even if *b* = *e*. If you care to delve into the details, see the footnote [1]. This may seem pedantic, but it’s a hair worth splitting because it can avoid difficulties like the incorrect calculation above.

Now it is true that

exp(*x* + *y*) = exp(*x*) exp(*y*)

for all *x* and *y*, even complex *x* and *y*, and so it is true that

exp(*nx*) = exp(*x*)^{n}

when *n* is an integer. For positive integers you can see this by repeatedly adding *x* to itself. You can extend this to negative integers using

exp(-*x*) = 1/exp(*x*).

[1] For positive *b*, you can bootsrap your way by defining *b*^{x} for integer exponents, then rational exponents, then real exponents by taking limits. But when *b* is negative this doesn’t work. The way you end up defining *b*^{x} in general is as exp(*x* log *b*), taking advantage of the rigorous definition of exp() alluded to above.

But what is log *b*? This gets subtle. The logarithm of *b* is a solution to the equation exp(*x*) = *b*, and this has infinitely many solutions for complex *x*. So now you need to either specify a particular branch to make the log function single valued, or you need to change your idea of a function to include multiply-valued functions.

If you get really technical, there’s a difference between exp(*x*) and *e*^{x} is because the latter is exp(*x* log *e*), and log *e* has multiple values! So in that sense exp(*x*) and *e*^{x} are different functions, the former single valued and the latter multiply valued.

The sines of the positive integers have just the right balance of pluses and minuses to keep their sum in a fixed interval. (Not hard to show.) #math pic.twitter.com/RxeoWg6bhn

— Sam Walters (@SamuelGWalters) November 29, 2018

If for some reason your browser doesn’t render the embedded tweet, he points out that

for all positive integers *N*.

Here’s a little Python script to illustrate the sum in his tweet.

from scipy import sin, arange import matplotlib.pyplot as plt x = arange(1,100) y = sin(x).cumsum() plt.plot(x, y) plt.plot((1, 100), (-1/7, -1/7), "g--") plt.plot((1, 100), (2, 2), "g--") plt.savefig("sinsum.svg")

Exponential sums are interesting because crude application of the triangle inequality won’t get you anywhere. All it will let you conclude is that the sum is between –*N* and *N*.

(Why is this an exponential sum? Because it’s the imaginary part of the sum over *e*^{in}.)

For more on exponential sums, you might like the book Uniform Distribution of Sequences.

Also, I have a page that displays the plot of a different exponential sum each day, the coefficients in the sum

being taking from the day’s date. Because consecutive numbers have very different number theoretic properties, the images vary quite a bit from day to day.

Here’s a sneak peak at what the exponential sum for Christmas will be this year.

]]>In that spirit, I thought I’d celebrate some of my Twitter accounts that have come and gone.

My first daily tip Twitter account was SansMouse, an account that posted one keyboard shortcut per day. I later renamed the account ShortcutKeyTip. I also had an account PerlRegex for Perl regular expressions, DailySymbol for symbols, and BasicStatistics for gardening. Just kidding about that last one; it was for basic statistics.

I started RLangTip and then handed it over to people who were more enthusiastic about R. I also started MusicTheoryTip and gave it away.

The account GrokEM was for electricity and magnetism. I renamed it ScienceTip and broadened the content to be science more generally. I also had an account SedAwkTip for, you guessed it, sed and awk. It also broadened its content and became UnixToolTip.

I renamed StatFact to DataSciFact and got an immediate surge in followers. Statistics sells better when you change the label to data science, machine learning, or artificial intelligence.

I stopped posting to my signal processing account DSP_fact for a while and then started it up again. I also let my FormalFact account lie fallow for a while, then renamed it LogicPractice and restarted it. It’s the only one of my accounts I don’t post to regularly.

I had several other ideas for accounts that I never started. They will probably never see the light of day. I have no intention of starting any new accounts at this time, but I might someday. I also might retire some of my existing accounts.

Here’s a collage of the icons for my accounts as eight years ago:

The icons became more consistent over time, and now all my Twitter accounts have similar icons: blue dots with a symbol inside. I’m happy with the current design for now.

]]>… allow the forest of client data to be studied, without permitting the possibility of looking at individual trees.

[1] Úlfar Erlingsson, Vasyl Pihur, and Alexsandra Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. Available here.

]]>*M*_{n} = 2^{n} – 1.

A Mersenne prime is a Mersenne number which is also prime. So far 50 have been found [1].

A necessary condition for *M*_{n} to be prime is that *n* is prime, so searches for Mersenne numbers only test prime values of *n*. It’s not sufficient for *n* to be prime as you can see from the example

*M*_{11} = 2047 = 23 × 89.

The largest known prime has been a Mersenne prime since 1952, with one exception in 1989. This is because there is an efficient algorithm, the Lucas-Lehmer test, for determining whether a Mersenne number is prime. This is the algorithm used by GIMPS (Great Internet Mersenne Prime Search).

The Lucas-Lehmer test is very simple. The following Python code tests whether *M*_{p} is prime.

def lucas_lehmer(p): M = 2**p - 1 s = 4 for _ in range(p-2): s = (s*s - 2) % M return s == 0

Using this code I was able to verify the first 25 Mersenne primes in under 50 seconds. This includes all Mersenne primes that were known as of 40 years ago.

Mersenne primes are named after the French monk Marin Mersenne (1588–1648) who compiled a list of Mersenne primes.

Édouard Lucas came up with his test for Mersenne primes in 1856 and in 1876 proved that *M*_{127} is prime. That is, he found a 39-digit prime number by hand. Derrick Lehmer refined the test in 1930.

As of January 2018, the largest known prime is *M*_{77,232,917}.

[1] We’ve found 50 Mersenne primes, but we’re not sure whether we’ve found the *first* 50 Mersenne primes. We know we’ve found the 47 smallest Mersenne primes. It’s possible that there are other Mersenne primes between the 47th and the largest one currently known.

Fermat numbers are prime if *n* = 0, 1, 2, 3, or 4. Nobody has confirmed that any other Fermat numbers are prime. Maybe there are only five Fermat primes and we’ve found all of them. But there might be infinitely many Fermat primes. Nobody knows.

There’s a specialized test for checking whether a Fermat number is prime, Pépin’s test. It says that for *n* ≥ 1, the Fermat number *F*_{n} is prime if and only if

We can verify fairly quickly that *F*_{1} through *F*_{4} are prime and that *F*_{5} through *F*_{14} are not with the following Python code.

def pepin(n): x = 2**(2**n - 1) y = 2*x return y == pow(3, x, y+1) for i in range(1, 15): print(pepin(i))

After that the algorithm gets noticeably slower.

We have an efficient algorithm for testing whether Fermat numbers are prime, efficient relative to the size of numbers involved. But the size of the Fermat numbers is growing exponentially. The number of digits in *F*_{n} is

So *F*_{14} has 4,933 digits, for example.

The Fermat numbers for *n* = 5 to 32 are known to be composite. Nobody knows whether *F*_{33} is prime. In principle you could find out by using Pépin’s test, but the number you’d need to test has 2,585,827,973 digits, and that will take a while. The problem is not so much that the exponent is so big but that the modulus is also very big.

The next post presents an analogous test for whether a Mersenne number is prime.

]]>Conventional notation uses *a* for the equatorial radius and *c* for the polar radius. Oblate means *a* > *c*. The eccentricity *e* is defined by

For a perfect sphere, *a* = *c* and so *e* = 0. The eccentricity for earth is small, *e* = 0.08. The eccentricity increases as the polar radius becomes smaller relative to the equatorial radius. For Saturn, the polar radius is 10% less than the equatorial radius, and *e* = 0.43.

The volume of an oblate spheroid is simple:

Clearly we recover the volume of a sphere when *a* = *c*.

The surface area is more interesting. The surface of a spheroid is a surface of rotation, and so can easily be derived. It works out to be

It’s not immediately clear that we get the area of a sphere as *c* approaches *a*, but it becomes clear when we expand the log term in a Taylor series.

This suggests that an oblate spheroid has approximately the same area as a sphere with radius √((*a*² + *c*²)/2), with error on the order of *e*².

If we set *a* = 1 and let *c* vary from 0 to 1, we can plot how the surface area varies.

Here’s the corresponding plot where we use the eccentricity *e* as our independent variable rather than the polar radius *c*.

[1] Not in a yellow submarine as some have suggested.

]]>Here are a couple screen shots from the book to give you an idea what it contains.

Here’s an example scale, number 277 out of 344.

And here’s an example of the notes for the accompanying audio files.

]]>

To first approximation, Earth is a sphere. But it bulges at the equator, and to second approximation, Earth is an oblate spheroid. Earth is not exactly an oblate spheroid either, but the error in the oblate spheroid model is about 100x smaller than the error in the spherical model.

Finding the distance between two points on a sphere is fairly simple. Here’s a calculator to compute the distance, and here’s a derivation of the formula used in the calculator.

Finding the distance between two points on an ellipsoid is much more complicated. (A spheroid is a kind of ellipsoid.) Wikipedia gives a description of **Vincenty’s algorithm** for finding the distance between two points on Earth using an oblate spheroid model (specifically **WGS-84**). I’ll include a **Python** implementation below.

How much difference does it make when you calculate difference on an oblate spheroid rather than a sphere? To address that question I looked at the coordinates of several cities around the world using the `CityData`

function in Mathematica. Latitude is in degrees north of the equator and longitude is in degrees east of the prime meridian.

|--------------+--------+---------| | City | Lat | Long | |--------------+--------+---------| | Houston | 29.78 | -95.39 | | Caracas | 10.54 | -66.93 | | London | 51.50 | -0.12 | | Tokyo | 35.67 | 139.77 | | Delhi | 28.67 | 77.21 | | Honolulu | 21.31 | -157.83 | | Sao Paulo | -23.53 | -46.63 | | New York | 40.66 | -73.94 | | Los Angeles | 34.02 | -118.41 | | Cape Town | -33.93 | 18.46 | | Sydney | -33.87 | 151.21 | | Tromsø | 69.66 | 18.94 | | Singapore | 1.30 | 103.85 | |--------------+--------+---------|

Here are the error extremes.

The spherical model underestimates the distance from London to Tokyo by 12.88 km, and it overestimates the distance from London to Cape Town by 45.40 km.

The relative error is most negative for London to New York (-0.157%) and most positive for Tokyo to Sidney (0.545%).

**Update**: The comparison above used the same radius for both the spherical and ellipsoidal models. As suggested in the comments, a better comparison would use the *mean* radius (average of equatorial and polar radii) in the spherical model rather than the equatorial radius.

With that change the most negative absolute error is between Tokyo and New York at -30,038 m. The most negative relative error is between London and New York at -0.324%.

The largest positive error, absolute and relative, is between Tokyo and Sydney. The absolute error is 29,289 m and the relative error is 0.376%.

The code below is a direct implementation of the equations in the Wikipedia article.

Note that longitude and latitude below are assumed to be in radians. You can convert from degrees to radians with SciPy’s `deg2rad`

function.

from scipy import sin, cos, tan, arctan, arctan2, arccos, pi def spherical_distance(lat1, long1, lat2, long2): phi1 = 0.5*pi - lat1 phi2 = 0.5*pi - lat2 r = 0.5*(6378137 + 6356752) # mean radius in meters t = sin(phi1)*sin(phi2)*cos(long1-long2) + cos(phi1)*cos(phi2) return r * arccos(t) def ellipsoidal_distance(lat1, long1, lat2, long2): a = 6378137.0 # equatorial radius in meters f = 1/298.257223563 # ellipsoid flattening b = (1 - f)*a tolerance = 1e-11 # to stop iteration phi1, phi2 = lat1, lat2 U1 = arctan((1-f)*tan(phi1)) U2 = arctan((1-f)*tan(phi2)) L1, L2 = long1, long2 L = L2 - L1 lambda_old = L + 0 while True: t = (cos(U2)*sin(lambda_old))**2 t += (cos(U1)*sin(U2) - sin(U1)*cos(U2)*cos(lambda_old))**2 sin_sigma = t**0.5 cos_sigma = sin(U1)*sin(U2) + cos(U1)*cos(U2)*cos(lambda_old) sigma = arctan2(sin_sigma, cos_sigma) sin_alpha = cos(U1)*cos(U2)*sin(lambda_old) / sin_sigma cos_sq_alpha = 1 - sin_alpha**2 cos_2sigma_m = cos_sigma - 2*sin(U1)*sin(U2)/cos_sq_alpha C = f*cos_sq_alpha*(4 + f*(4-3*cos_sq_alpha))/16 t = sigma + C*sin_sigma*(cos_2sigma_m + C*cos_sigma*(-1 + 2*cos_2sigma_m**2)) lambda_new = L + (1 - C)*f*sin_alpha*t if abs(lambda_new - lambda_old) <= tolerance: break else: lambda_old = lambda_new u2 = cos_sq_alpha*((a**2 - b**2)/b**2) A = 1 + (u2/16384)*(4096 + u2*(-768+u2*(320 - 175*u2))) B = (u2/1024)*(256 + u2*(-128 + u2*(74 - 47*u2))) t = cos_2sigma_m + 0.25*B*(cos_sigma*(-1 + 2*cos_2sigma_m**2)) t -= (B/6)*cos_2sigma_m*(-3 + 4*sin_sigma**2)*(-3 + 4*cos_2sigma_m**2) delta_sigma = B * sin_sigma * t s = b*A*(sigma - delta_sigma) return s]]>

In this post I’ll show how to align two sequences using the sequence alignment algorithms of **Needleman-Wunsch** and **Hirschberg**. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.

I’ll be using Gang Li’s implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg’s algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I’ll give the output of Hirschberg’s algorithm.

Li’s alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.

from alignment import Needleman, Hirschberg def compare(str1, str2): seq1 = list(str1) seq2 = list(str2) for algorithm in [Needleman(), Hirschberg()]: a, b = algorithm.align(seq1, seq2) print("".join(a)) print("".join(b)) print()

The code inserts vertical bars to indicate spaces added for alignment. Here’s the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.

|||riverrun, past Ev|e| and Adam'||||s, mov|i|er|un, past ||new and |||||hopes, from swe|rv||e of shore|||| to bend of from s||tr|ike of |||||back to bend of b|||ay, brings us by a commodius |jeday, brings us by a commodius vic|u||s of recirculation back to |||lucas of recirculation back to H|owth Ca||stle|||| and E|nvi||r|ons. |fo||||||rest||moon and |en||dor.||||

I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite simiar:

Bygme||ster Fi|nnega||n, of the Bygm|onster ||Ann||akin, of the Stutte||r|||||||ing Hand, f|re|emen'|s ||||||Throatchokin| Hand, for|cemen|’s mau-rer, lived in the broadest way mau-rer, lived in the broadest way immarginable in his rushlit immarginable in his rushlit toofar-|||back for messuages before toofar| — back for messuages before

but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg’s algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.

]]>I ran into a delightfully strange blog post today called Finnegans Ewok that edits the first few paragraphs of Finnegans Wake to make it into something like Return of the Jedi.

The author, Adam Roberts, said via Twitter “What I found interesting here was how little I had to change Joyce’s original text. Tweak a couple of names and basically leave it otherwise as was.”

So what I wanted to do is quantify just how much had to change using the **Levenshtein distance**, which is essentially the number of one-character changes necessary to transform one string into another.

Here’s the first paragraph from James Joyce:

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.

And here’s the first paragraph from Adam Roberts:

movierun, past new and hopes, from strike of back to bend of jeday, brings us by a commodius lucas of recirculation back to forestmoon and endor.

The original paragraph is 150 characters, the parody is 145 characters, and the Levenshtein distance is 44.

Here’s a summary of the results for the first four paragraphs.

|-------+---------+----------| | Joyce | Roberts | Distance | |-------+---------+----------| | 150 | 145 | 44 | | 700 | 727 | 119 | | 594 | 615 | 145 | | 1053 | 986 | 333 | |-------+---------+----------|

The fifth paragraph seems to diverge more from Joyce. I maybe have gotten something misaligned, and reading enough of Finnegans Wake to debug the problem made my head hurt, so I stopped.

**Update**: See the next post for sequence alignment applied to the two sources. This lets you see not just the number of edits but what the edits are. This show why I was having difficulty aligning the fifth paragraphs.

**Rényi differential privacy** (RDP) is a new generalization of ε-differential privacy by Ilya Mironov that is comparable to the (ε, δ) version but has several advantages. For instance, RDP is easier to interpret than (ε, δ)-DP and composes more simply.

My previous post discussed Rényi entropy. **Rényi divergence** is to Rényi entropy what Kullback-Leibler divergence is to Shannon entropy.

Given two random variables *X* and *Y* that take on *n* possible values each with positive probabilities *p*_{i} and *q*_{i} respectively, the Rényi divergence of *X* from *Y* is defined by

for α > 0 and not equal to 1. The definition extends to α = 0, 1, or ∞ by taking limits.

As α converges to 1, *D*_{α} converges to the Kullback-Leibler divergence.

A couple weeks ago I wrote an introduction to differential privacy that started by saying

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not.

The post develops precisely what it means for a query to be differentially private. The bottom line (literally!) was

An algorithm

Asatisfies ε-differential privacy if for everytin the range ofA, and for every pair of neighboring databasesDandD’,Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

It turns out that this definition is equivalent to saying the Rényi divergence of order ∞ between *A*(*D*) and *A*(*D*‘) is less than ε. That is,

where α = ∞. The idea of **Rényi differential privacy** (RDP) is to allow the possibility that α could be finite [1]. That is, an algorithm is (α, ε)-RDP if the Rényi divergence of order α between any two adjacent databases is no more than ε.

One of the nice properties of RDP is how simply algorithms compose: The composition of an (α, ε_{1})-RDP algorithm and an (α, ε_{2})-RDP algorithm is a (α, ε_{1} + ε_{2})-RDP algorithm. This leads to simple **privacy budget accounting**.

The composition of ε-DP algorithms is simple: just substitute α = ∞ in the result above for composing RDP algorithms. However, the resulting bound may be overly pessimistic.

The composition of (ε, δ)-DP algorithms is not as simple: both ε and δ change. With RDP, the order α does not change, and the ε values simply add.

Mironov proves in [1] that every RDP algorithm is also (ε, δ)-DP algorithm. Specifically, Proposition 3 says

If

fis an (α, ε)-RDP mechanism, it also satisfiesdifferential privacy for any 0 < δ < 1.

This tells us that Rényi differential privacy gives us privacy guarantees that are somewhere between ε-differential privacy and (ε, δ)-differential privacy.

[1] Ilya Mironov, Rényi Differential Privacy. arXiv 1702.07476

]]>The most common way of measuring information is Shannon entropy, but there are others. Rényi entropy, developed by Hungarian mathematician Alfréd Rényi, generalizes Shannon entropy and includes other entropy measures as special cases.

If a discrete random variable *X* has *n* possible values, where the *i*th outcome has probability *p*_{i}, then the Rényi entropy of order α is defined to be

for 0 ≤ α ≤ ∞. In the case α = 1 or ∞ this expression means the limit as α approaches 1 or ∞ respectively.

The definition of Rényi entropy can be extended to continuous random variables by

but unlike the discrete case, Rényi entropy can be negative for continuous random variables, and so Rényi entropy is typically only used for discrete variables. For example, let *X* be the random variable defined on [1, ∞) with density

Then

Each value of α gives a possible entropy measure. All are additive for independent random variables. And for each discrete random variable *X*, *H*_{α} is a monotone non-decreasing function of α.

Assume all the probabilities *p*_{i} are positive. Then the *H*_{0} is known as the max-entropy, or Hartley entropy. It is simply log_{2}*n*, the log of the number of values *X* takes on with positive probability.

When α = 1 we get the more familiar Shannon entropy:

When the order α is not specified, it’s implicit default value is 2. That is, when people speak of Rényi entropy without qualification, they often have in mind the case α = 2. This case is also called collision entropy and is used in quantum information theory.

In the limit as α goes to ∞, the Rényi entropy of *X* converges to the negative log of the probability of the most probable outcome. That is,

Let *p* without a subscript be the vector of all the *p*_{i}. Then for α not equal to 1,

As with Lebesgue norms, you use varying values of the parameter to emphasize various features.

This post was a warm-up for the next post: Rényi differential privacy.

Other related posts:

]]>The term Curry-Howard **isomorphism** is often used but is an overstatement. Logic and programming are not isomorphic, nor is either isomorphic to category theory.

You’ll also hear the term **computational trinitarianism**. That may be appropriate because logic, programming, and category theory share common attributes without being identical.

The relations between logic, programming, and categories are **more than analogies**, but **less than isomorphisms**.

There are formal correspondences between parts of each theory. And aside from these rigorous connections, there is also a heuristic that something interesting in one area is likely to have an interesting counterpart in the other areas. In that sense Curry-Howard-Lambek is similar to the **Langlands program** relating number theory and geometry, a set of theorems and conjectures as well as a sort of philosophy.

Scott Hanselman interviewed attorney Gary Nissenbaum in show #647 of Hanselminutes. The title was “How GDPR is effecting the American Legal System.”

Can Europe pass laws constraining American citizens? Didn’t we settle that question in 1776, or at least by 1783? And yet it is inevitable that European law effects Americans. And in fact Nissembaum argues that **every** country has the potential to pass internet regulation effecting citizens of **every other country** in the world.

Hanselman: Doesn’t that imply that we can’t win? There’s two hundred and something plus countries in the world and if any European decides to swing by a web site in Djibouti now they’re going to be subject to laws of Europe?

Nissenbaum: I’ll double down on that. It implies that

anycountry that has users of the internet can create amore stringentlaw than even the Europeans, and then on the basis of that being the preeminent regulatory body of theworld, because it’s a race to who can be the most restrictive. Because the most restrictive is what everyone needs to comply with.So if Tanzania decides that it is going to be the most restrictive country in terms of the laws … that relate to internet use of their citizens, theoretically,

allweb sites around the world have to be concerned about that because there are users that could be accessing their web site from Tanzania and they wouldn’t even know it.

Will the “world wide web” someday not be worldwide at all? There has been speculation, for example, that we’ll eventually have at least two webs, one Chinese and and one non-Chinese. The web could tear into a lot more pieces than that.

As Nissenbaum says toward the end of the podcast

If anyone assumes there’s a simple way of handling this, they’re probably wrong. It is complicated, and you just have to live with that, because that’s the world we’re in.

**Related post**: GDPR and the right to be forgotten

For a small example, take

1/7 = 0.142857142857…

and notice that 142 + 857 = 999. That is, 8, 5, and 7 are the nine’s complements of 1, 4, and 2 respectively.

For a larger example, we can use Mathematica to look at the decimal expansion of 6/47:

```
In: N[6/47, 60]
Out: 0.127659574468085106382978723404255319148936170212765957446809
```

and we can confirm

12765957446808510638297 + 87234042553191489361702 = 99999999999999999999999

Let’s do another example with 6/43:

```
In: N[6/43, 50]
Out: 0.13953488372093023255813953488372093023255813953488
```

Does the theorem hold here? The the hypothesis of the theorem doesn’t hold because the fraction has period 21, which is not even. Can the conclusion of the theorem hold anyway? No, because we can’t split 21 digits in half! But we can split 21 digits into three parts, and if we do we find

1395348 + 8372093 + 232558 = 9999999

Let’s finish up with an example in another base. We’ll look at the fraction 3/7 in base 17.

In: BaseForm[N[3/7, 24], 17] Out: 0.74e9c274e9c274e9c27_{17}

If we split the repeating part of the decimal into two parts we get 74e and 9c2, and these add up to the base 17 analog of all 9’s.

In: BaseForm[17^^74e + 17^^9c2, 17] Out: ggg_{17}

That is, the base 17 numbers 9, c, and 2, are the g’s complements of 7, 4, and e respectively.

[The Mathematica operator `^^`

interprets its right argument as a number whose base is given by the left argument. The `BaseForm`

function takes a number as its first argument and returns its representation in the base given as its second argument.]

**Related post**: Casting out Z’s

The General Conference on Weights and Measures voted today to redefine the **kilogram**. The official definition no longer refers to the mass of the International Prototype of the Kilogram (IPK) stored at the BIPM (Bureau International des Poids et Measures) in France. The **Coulomb**, **kelvin**, and **mole** have also been redefined.

The vote took place today, 2018-11-16, and the new definitions will officially begin 2019-05-20.

Until now, Planck’s constant had been **measured** to be

*h* = 6.62607015 × 10^{-34} kg m^{2} s^{-1}.

Now this is the exact value of **Planck’s constant** by definition, implicitly defining the kilogram.

The other SI units were already defined in terms of fundamental physical phenomena. The second is defined as

the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the cesium-133 atom

and the statement that the speed of light is

*c* = 299,792,458 m s^{-1}

went from being a measurement to a definition, much as the measurement of Planck’s constant became a definition [1].

So now the physics of cesium-133 defines the second, the speed of light defines the meter, and Planck’s constant defines the kilogram. There are no more SI units defined by physical objects.

The elementary charge had been measured as

*e* = 1.602176634 × 10^{-19} C

and now this equation is exact by definition, defining the **coulomb** in terms of the elementary charge. And since the ampere is defined as flow of one Coulomb per second, this also changes the definition of the **ampere**.

The **kelvin** unit of temperature is now defined in terms of the **Boltzmann constant**, again reversing the role of measurement and definition. Now the expression

*k* = 1.380649 × 10^{-23} kg m^{2} s^{-1 }K^{-1}

for the Boltzmann constant is exact by definition, defining *K*.

Avogadro’s number is now exact, defining the mol:

*N*_{A} = 6.02214076 × 10^{23} mol^{-1}.

Incidentally, when I first started my Twitter account @UnitFact for units of measure, I used the standard kilogram as the icon.

Then some time later I changed the icon to a Greek mu, the symbol for “micro” used with many units.

The icon for @ScienceTip, however, is Planck’s constant!

It would have been cool if I had anticipated the redefinition of the kilogram and replaced my kilogram icon with the Plank constant icon. (Actually, it’s not Planck’s constant *h* but the “reduced Planck constant” h-bar equal to *h*/2π.)

[1] The meter was originally defined in 1791 to be one ten-millionth of the distance from the equator to the North Pole. In 1889, it was redefined to be the distance between two marks on a metal bar stored at BIPM. In 1960 the meter was defined in terms of the spectrum of krypton-86, the first definition that did not depend on an object kept in a vault. In 1983 the definition changed to the one above, making the speed of light exact.

The second was defined in terms of the motion of the earth until 1967 when the definition in terms of cesium-133 was adopted.

]]>

**Deep learning** has spurred interest in **novel floating point formats**. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at **bfloat16**, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

|--------+------+----------+----------| | Format | Bits | Exponent | Fraction | |--------+------+----------+----------| | FP32 | 32 | 8 | 23 | | FP16 | 16 | 5 | 10 | | BF16 | 16 | 8 | 7 | |--------+------+----------+----------|

BF16 has as many bits as a FP16, but as many *exponent* bits as a FP32. The latter makes conversion between BF16 and FP32 simple, except for some edge cases regarding denormalized numbers.

The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2^{–e} where *e* is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

|--------+------------| | Format | Epsilon | |--------+------------| | FP32 | 0.00000012 | | FP16 | 0.00390625 | | BF16 | 0.03125000 | |--------+------------|

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

|--------+-------| | Format | DR | |--------+-------| | FP32 | 83.38 | | BF16 | 78.57 | | FP16 | 12.04 | |--------+-------|

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted *es* by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an *n*-bit posit, the number of fraction bits near 1 is *n* – 2 – *es* and so epsilon is 2 to the exponent *es* – *n* – 2. The dynamic range is

which is derived here. The dynamic range and epsilon values for 16-bit posits with *es* ranging from 1 to 4 are given in the table below.

|----+--------+-----------| | es | DR | epsilon | |----+--------+-----------| | 1 | 16.86 | 0.0000076 | | 2 | 33.82 | 0.0000153 | | 3 | 37.43 | 0.0000305 | | 4 | 143.86 | 0.0000610 | |----+--------+-----------|

For all the values of *es* above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of *es*, and greater than BF16 and FP32 when *es* = 4.

One of my most popular posts on Twitter was an implicit criticism of the cliché “work smarter, not harder.”

Productivity tip: Work hard.

— John D. Cook (@JohnDCook) October 8, 2015

I agree with the idea that you can often be more productive by stepping back and thinking about what you’re doing. I’ve written before, for example, that programmers need to spend less time in front of a computer.

But one thing I don’t like about “work smarter” is the implication that being smart eliminates the need to work hard. It’s like a form of gnosticism.

Also, “working smarter” is kind of a given. People don’t often say “I know of a smarter way to do this, but I prefer working hard at the dumb way.” [1] Instead, they’re being as smart as they know how, or at least they think they are. To suggest otherwise is to insult their intelligence.

One way to “work smarter, not harder” is to take good advice. This is different from “working smarter” in the sense of thinking alone in an empty room, waiting for a flash of insight. Maybe you’re doing what you’re doing as well as you can, but you should be doing something else. Maybe you’re cleverly doing something that doesn’t need to be done.

- “Hammock-driven development” by Rich Hickey
- Why programmers are not paid in proportion to productivity

[1] If they do, they’re still being smart at a different level. Someone might think “Yeah, I know of a way to do this that would be more impressive. But I’m going to take a more brute-force approach that’s more likely to be correct.” Or they might think “I could imagine a faster way to do this, but I’m too tired right now to do that.” They’re still being optimal, but they’re including more factors in the objective they’re trying to optimize.

]]>**Hypergeometric functions** are something of a “grand unified theory” of special functions. Many functions that come up in application are special cases of hypergeometric function, as shown in the diagram below. (Larger version here.)

I give a brief introduction to hypergeometric functions in these notes (4-page pdf).

The confluent hypergeometric function corresponds to Hypergeometric 1F1 in the chart above [2]. In Mathematica it goes by `Hypergeometric1F1`

and in Python goes by `scipy.special.hyp1f1`

.

This function is important in probability because you can use it to compute the CDFs for the normal and gamma distributions.

Bujanda et al give series expansions of the confluent hypergeometric functions in terms of incomplete gamma functions. That may not sound like progress because the incomplete gamma functions are still special functions, but the confluent hypergeometric function is a function of three variables *M*(*a*, *b;* *z*) whereas the incomplete gamma function γ(*s*, *z*) is a function of two variables.

They also give expansions in terms of elementary functions which may be of more interest. The authors give the series expansion below.

The *A* functions are given recursively by

and

for *n* > 1.

The *F* functions are given by

where

The authors give approximation error bounds in [1]. In the plot below we show that *n* = 3 makes a good approximation for *M*(3, 4.1, *z*). The error converges to zero uniformly as *n* increases.

- Connection between hypergeometric distribution and hypergeometric function
- Relations between special functions
- Stand-alone numerical code

[1] Blanca Bujanda, José L. López, and Pedro J. Pagola. Convergence expansions of the confluent hypergeometric functions in terms of elementary functions. Mathematics of computation. DOI https://doi.org/10.1090/mcom/3389

[2] “Confluent” is a slightly archaic name, predating a more systematic naming for hypergeometric functions. The name *m*F*n* means the hypergeometric function has *m* upper parameters and *n* lower parameters. The confluent hypergeometric function has one of each, so it corresponds to 1F1. For more details, see these notes.

How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?

People commonly speak of big data in terms of **volume**—the “four v’s” of big data being volume, variety, velocity, and veracity—but what we’re concerned with here might better be called “area.” We’ll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.

In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?

All else being equal, your **privacy goes down as columns go up**. The more information someone has about you, the more likely some of it may be used in combination to identify you.

How privacy varies with the number of rows is more complicated. Your **privacy could go up or down with the number of rows**.

The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of *k*-anonymity, this makes it harder to **indentify** you, but easier to **reveal information** about you.

For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you’re a Native American man in your 40’s and there are 49 others like you. Then someone who knows your demographics can’t be sure which record is yours; they’d have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data show that middle aged Native American men are highly susceptible to some disease, then the data imply that *you* are likely to be susceptible to that disease.

Because privacy measures aimed at protecting individuals don’t necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The **Genetic Information Nondiscrimination Act** of 2008 (GINA) makes some forms of genetic discrimination illegal, but it’s quite understandable that minority groups might still be reluctant to participate in studies.

Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of **randomization** without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.

With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of *queries* on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.

If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or **local differential privacy**. With these approaches, there’s no need for a trusted data curator. This wouldn’t be possible with a small dataset because the noise would be too large relative to the size of the data.

If it cost more to send email, even a fraction of a cent per message, that could be enough to deter spammers. So suppose you want to charge anyone $0.005 to send you an email message. You don’t actually want to collect their money, you just want proof that they’d be *willing* to spend something to email you. You’re not even trying to block robots, you just want to block *cheap* robots.

So instead of asking for a micropayment, you could ask the sender to solve a puzzle, something that would require around $0.005 worth of computing resources. If you’re still getting too much spam, you could increase your rate and by giving them a harder puzzle.

Dwork and Naor list several possible puzzles. The key is to find a puzzle that takes a fair amount of effort to solve but the solution is easy to verify.

Bitcoin uses hash problems for proof-of-work puzzles. Cryptographic hash functions are difficult to predict, and so you can’t do much better than brute force search if you want to come up with input whose hashed value has a specified pattern.

The goal is to add a fixed amount of additional text to a message such that when the hash function is applied, the resulting value is in some narrow range, such as requiring the first *n* bits to be zeros. The number *n* could be adjusted over time as needed to calibrate the problem difficulty. Verifying the solution requires computing only one hash, but finding the solution requires computing 2^{n} hashes on average.

[1] Cynthia Dwork and Noni Naor (1993). “Pricing via Processing, Or, Combatting Junk Mail, Advances in Cryptology”. CRYPTO’92: Lecture Notes in Computer Science No. 740. Springer: 139–147.

[2] Markus Jakobsson and Ari Juels (1999). “Proofs of Work and Bread Pudding Protocols”. Communications and Multimedia Security. Kluwer Academic Publishers: 258–272.

]]>